Life Expectancy Across the World

February 25, 2023

ggplot2, scraping

I came across an interesting dataset on life expectancy for different countries on Worldometer, and thought that might make for some interesting visualizations.

First, I need to get ahold of the data, which is embedded in a web-page and isn’t in an immediately downloadable form. I’ll have to do some web-scraping. Fortunately, there’s rvest for that task.

The essential set of packages include:

library(tidyverse)
library(rvest)
library(hrbrthemes)

Data #

Let’s grab the data.

The process involves specifying the resource URL, then defining the elements on the page (in this case, I’m grabbing a distinct table), and then transforming it to a dataframe for use.

I’m also dumping the data to a .csv for good measure, in case the source page ever disappears or has a structural change that causes the web scraping logic to break.

I’m able to discern the element by using the browser’s inspector to zero in on the target.

url <- "https://www.worldometers.info/demographics/life-expectancy/"

table <- url %>%
  read_html() %>%
  html_elements("tbody") %>%
  html_table()

df <- as.data.frame(table) %>%
  rename(
    "index" = X1,
    "country" = X2,
    "life_expectancy" = X3,
    "life_expectancy_female" = X4,
    "life_expectancy_male" = X5
  )

# let's dump it out somewhere so we don't lose it
# write.csv(df, file = "df.csv")

Let’s take a look at that data now.

DT::datatable(df)

Show entries

Search:

	index	country	life_expectancy	life_expectancy_female	life_expectancy_male
1	1	Hong Kong	85.83	88.66	83
2	2	Macao	85.51	88.11	82.88
3	3	Japan	84.95	87.97	81.91
4	4	Switzerland	84.38	86.05	82.63
5	5	Singapore	84.27	86.42	82.13
6	6	Italy	84.2	86.13	82.15
7	7	South Korea	84.14	87.23	80.83
8	8	Spain	84.05	86.68	81.35
9	9	Malta	83.85	85.83	81.81
10	10	Australia	83.73	85.56	81.89

Showing 1 to 10 of 201 entries

Previous1 2 3 4 5…21Next

Lollipop #

A simple way to look at this data is with a lollipop plot, really just a glorified bar plot with a point at the end.

df %>%
  distinct(country, .keep_all = TRUE) %>%  # raw data has duplicates for some reason
  ggplot(aes(y = life_expectancy, x = reorder(country, life_expectancy))) +
  ggalt::geom_lollipop() +
  scale_y_continuous(expand = c(0,1), breaks = seq(0,100,10)) +
  coord_flip() +
  theme_ipsum_rc(grid = "X") +
  theme(
    axis.text.y = element_text(size = 8),
    plot.title.position = "plot"
  ) +
  labs(
    title = "2023 Life Expectancy Around the World",
    subtitle = "Data source: Worldometer",
    y = "Years",
    x = NULL,
    caption = "kwanlin.com"
  )

And there it is – a very simple but clear and easily printable plot of life expectancy around the world, ordered sequentially by value.

World Tile Grids #

Another way to look at this information is using a world tile grid, which blends together elements of color gradation mapped to values, and approximate geographic location. It does away with geospatial projections, since in this case, we’re not really focusing on the geographic form of countries.

A world tile grid is something I consider visually interesting, but it falls short in a number of respects:

For this to work, you need to know the ISO 2-character country codes by heart. I imagine most people don’t memorize that.
The country naming convention strictly adheres to ISO-standards. Worldometer’s country names don’t map to ISO standards, which makes mapping data to position difficult. Example: Hong Kong. (On a separate matter, trying to get different organizations to conform to standardized country naming conventions is a whole separate can of worms to deal with).

In the code snippet below, there is a whole lot of country renaming to get the Worldometer data to conform to ISO standards so it can in turn be presented in a world tile grid. It’s tedious but necessary for this sort of plot.

# devtools::install_github("hrbrmstr/worldtilegrid")
library(worldtilegrid)

df %>%
  mutate(
    country = case_when(
      grepl("Antigua", country, ignore.case = TRUE) ~ "Antigua & Barbuda",
      grepl("Czech", country, ignore.case = TRUE) ~ "Czech Republic",   
      grepl("Lucia", country, ignore.case = TRUE) ~ "St. Lucia",      
      grepl("Vincent", country, ignore.case = TRUE) ~ "St. Vincent & the Grenadines",
      grepl("United Kingdom", country, ignore.case = TRUE) ~ "Great Britain and Northern Ireland",
      grepl("United States", country, ignore.case = TRUE) ~ "United States of America",
      grepl("DR Congo", country, ignore.case = TRUE) ~ "Congo (Democratic Republic of the)",
      grepl("Russia", country, ignore.case = TRUE) ~ "Russian Federation",
      grepl("Moldova", country, ignore.case = TRUE) ~ "Moldova (Republic of)",
      grepl("Trinidad", country, ignore.case = TRUE) ~ "Trinidad & Tobago",
      grepl("Sao Tome", country, ignore.case = TRUE) ~ "Sao Tome and Principe",
      grepl("Bosnia", country, ignore.case = TRUE) ~ "Bosnia & Herzegovina",
      grepl("Macedonia", country, ignore.case = TRUE) ~ "Macedonia",
      grepl("Laos", country, ignore.case = TRUE) ~ "Lao People's Democratic Republic",
      grepl("Vietnam", country, ignore.case = TRUE) ~ "Viet Nam",
      grepl("Brunei", country, ignore.case = TRUE) ~ "Brunei Darussalam",
      grepl("Micronesia", country, ignore.case = TRUE) ~ "Micronesia (Federated States of)",	
      TRUE ~ country
    )
  ) %>%
  ggplot(aes(country = country, fill = life_expectancy)) +
  geom_wtg(border_size = .5, border_col = "white") +
  geom_text(
    aes(
      label = stat(alpha.2), 
      # colour = I(ifelse(`Thing Val` < 500, "white", "black"))
      # color = I(ifelse('life_expectancy < 50, "white", "black"))
      color = I(ifelse(life_expectancy > 80, "black", "white"))
    ),
    # color = "white",
    stat = "wtg", 
    size = 4
  ) + 
  coord_equal() +
  viridis::scale_fill_viridis(na.value = "white") + 
  labs(
    title = "2023 Life Expectancy Around the World",
    subtitle = "Data source: Worldometer",
    fill = "Years"
  ) +
  # guides(fill=guide_legend(title="Years")) +
  hrbrthemes::theme_ipsum_rc(grid="") +
  theme_enhance_wtg() +
  theme(
    legend.position = "bottom",
    legend.key.width = unit(2, "cm"),
    legend.text=element_text(size=11),
    legend.title=element_text(size=14)
  )

Dumbbell #

You might notice there are two entire columns of data we haven’t used yet – life expectancies by gender.

We shan’t let that data go to waste.

Let’s put together a dumbbell plot, ordered by female life expectancy (mainly because female populations uniformly live longer than male populations based on the data), while also showing the gap by country with male life expectancy.

df_long <- df %>%
  distinct(country, .keep_all = TRUE) %>%  # raw data has duplicates for some reason
  select(-c(index)) %>%
  mutate(diff = life_expectancy_female - life_expectancy_male) %>%
  pivot_longer(cols = c(life_expectancy_female, life_expectancy_male)) %>%
  mutate(name = gsub("life_expectancy_","", name)) %>%
  rename(gender = name, years = value)

# DT::datatable(df_long)
head(df_long)

## # A tibble: 6 × 5
##   country   life_expectancy  diff gender years
##   <chr>               <dbl> <dbl> <chr>  <dbl>
## 1 Hong Kong            85.8  5.66 female  88.7
## 2 Hong Kong            85.8  5.66 male    83  
## 3 Macao                85.5  5.23 female  88.1
## 4 Macao                85.5  5.23 male    82.9
## 5 Japan                85.0  6.06 female  88.0
## 6 Japan                85.0  6.06 male    81.9

Now let’s separate them into two separate dataframes by gender, and also add a difference dataframe that we’ll use later for labeling purposes.

df_long

## # A tibble: 400 × 5
##    country     life_expectancy  diff gender years
##    <chr>                 <dbl> <dbl> <chr>  <dbl>
##  1 Hong Kong              85.8  5.66 female  88.7
##  2 Hong Kong              85.8  5.66 male    83  
##  3 Macao                  85.5  5.23 female  88.1
##  4 Macao                  85.5  5.23 male    82.9
##  5 Japan                  85.0  6.06 female  88.0
##  6 Japan                  85.0  6.06 male    81.9
##  7 Switzerland            84.4  3.42 female  86.0
##  8 Switzerland            84.4  3.42 male    82.6
##  9 Singapore              84.3  4.29 female  86.4
## 10 Singapore              84.3  4.29 male    82.1
## # ℹ 390 more rows

male <- df_long %>% filter(gender == "male")
female <- df_long %>% filter(gender == "female")

df_diff <- df_long %>%
  filter(gender == "male") %>% # either gender choice works here, it just needs to be one of the two
  mutate(x_pos = years + diff/2) # this defines the horizontal position of the label

head(df_diff)

## # A tibble: 6 × 6
##   country     life_expectancy  diff gender years x_pos
##   <chr>                 <dbl> <dbl> <chr>  <dbl> <dbl>
## 1 Hong Kong              85.8  5.66 male    83    85.8
## 2 Macao                  85.5  5.23 male    82.9  85.5
## 3 Japan                  85.0  6.06 male    81.9  84.9
## 4 Switzerland            84.4  3.42 male    82.6  84.3
## 5 Singapore              84.3  4.29 male    82.1  84.3
## 6 Italy                  84.2  3.98 male    82.2  84.1

The data is at this point prepped for a basic dumbbell plot. Points will be used to represent each gender, and segments will be used to link the genders by country.

col_female = "#f4b41a"
col_male = "#210070"
  
ggplot(df_long) +
  # add the segments by country by gender
  geom_segment(
    data = male,
    aes(
      x = years, 
      y = reorder(country, female$years),
      xend = female$years,
      yend = female$country
    ),
    color = "#aeb6bf",
    size = 2, # sized to match point size
    alpha = .7
  ) +
  # add the points by gender by country
  geom_point(
    aes(
      x = years,
      y = country,
      color = gender
    )
  ) +
  # add labels to the segments reflect difference
  geom_text(
    data = df_diff,
    aes(
      label = sprintf("%s", round(diff, digits = 1)),
      x = x_pos,
      y = country
    ),
    size = 2,
    color = "grey20"
  ) +
  scale_color_manual(values = c(female = col_female, male = col_male)) +
  scale_x_continuous(breaks = seq(0,100, 5), expand = c(0,0), limits = c(50,90)) +
  theme_ipsum_rc() +
  theme(
    axis.text.y = element_text(size = 8),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    legend.position = "none",
    plot.title = ggtext::element_markdown(size = 16), # element_markdown allows for fancier formatting
    plot.caption = ggtext::element_markdown(lineheight = 1.5, hjust = 0)
  ) +
  labs(
    x = "Years",
    y = NULL,
    title = sprintf("2023 Life Expectancy Around the World for <span style = 'color: %s;'>**Females**</span> and <span style = 'color: %s;'>**Males**</span>", col_female, col_male),
    subtitle = "Segment label represents difference in life expectancy between genders in years.",
    caption = "Plot by Kwan Lin | __kwanlin.com__<br>Data from Worldometer | worldometers.info/demographics/life-expectancy"
  )

I’m still trying to figure out what are good color pairs to show discrete groups. I found this paired color guide that I’ve been experimenting with.

Update #

I floated the dumbbell plot around, and received some excellent constructive feedback. Some of the major points were: * There’s just too much going on * It’s difficult to line the dumbbells up to the countries * It’s hard to line values up with the x-axis labels because of how tall the plot is

Here’s a revised dumbbell plot addressing much of that feedback. Rather than showing everything, it shows a subset of the overall data, with several changes to the labeling and proportions of elements.

First, let’s filter the data to cases where the life expectancy between genders differs by at least 5 years.

top_diff <- df_diff %>% arrange(desc(diff)) %>% head(20)

male_sub <- male %>% filter(country %in% top_diff$country)
female_sub <- female %>% filter(country %in% top_diff$country)
df_long_sub <- df_long %>% filter(country %in% top_diff$country)

ggplot(df_long_sub) +
  # add the segments by country by gender
  geom_segment(
    data = male_sub,
    aes(
      x = years, 
      y = reorder(country, female_sub$years),
      xend = female_sub$years,
      yend = female_sub$country,
    ),
    color = "#aeb6bf",
    size = 5, # sized to match point size
    alpha = .7
  ) +
  # add the points by gender by country
  geom_point(
    aes(
      x = years,
      y = country,
      color = gender
    ),
    size = 7
  ) +
  # add labels to the segments reflect difference
  geom_text(
    data = top_diff,
    aes(
      label = sprintf("%s", round(diff, digits = 1)),
      x = x_pos,
      y = country
    ),
    size = 3,
    color = "grey20"
  ) +
  # move country labels to left of female points
  geom_text(
    data = male_sub,
    aes(
      label = sprintf("%s", country),
      x = years - .5,
      y = country
    ),
    size = 4,
    hjust = 1
  ) +
  # add year label to male points
  geom_text(
    data = male_sub,
    aes(
      label = sprintf("%s", round(years, digits = 1)),
      x = years,
      y = country
    ),
    color = "white",
    size = 2.5
  ) +
  # add year label to female points
  geom_text(
    data = female_sub,
    aes(
      label = sprintf("%s", round(years, digits = 1)),
      x = years,
      y = country
    ),
    color = "grey20",
    size = 2.5
  ) +
  scale_color_manual(values = c(female = col_female, male = col_male)) +
  scale_x_continuous(breaks = seq(0,100, 5), expand = c(0,0), limits = c(53,85)) +
  theme_ipsum_rc(grid ="X") +
  theme(
    axis.text.x = element_text(size = 10),
    axis.text.y = element_blank(),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    legend.position = "none",
    plot.title = ggtext::element_markdown(size = 16), # element_markdown allows for fancier formatting
    plot.caption = ggtext::element_markdown(lineheight = 1.5, hjust = 0)
  ) +
  labs(
    x = "Life Expectancy in Years",
    y = NULL,
    title = sprintf("Top 20 Countries with the Most Significant Life Expectancy Differences Between <span style = 'color: %s;'>**Males**</span> and <span style = 'color: %s;'>**Females**</span> in 2023", col_male, col_female),
    subtitle = "Segment label represents difference in life expectancy between males and females, measured in years.",
    caption = "Plot by Kwan Lin | __kwanlin.com__<br>Data from Worldometer | worldometers.info/demographics/life-expectancy"
  )

Conclusion #

In this post, we went through a process of wrangling some embedded data from a website, and from that data, we were able to generate some very distinct views of that data.

After receiving many points of useful feedback, I went back and modified the dumbbell plot. If you’re following along, hopefully you find the process helpful.

If you have any questions or comments, feel free to drop me a note.