# install.packages(c("tidyverse", "this.path", "janitor"))
library(tidyverse)
library(this.path)
# Re-import and clean the ILINet data (same steps as intro)
ili <- read_csv(here("ILINet.csv"), skip = 1)
pop <- read_csv(here("population.csv"))Intermediate R Worksheet
Welcome back!
Use this worksheet to follow along with the live coding demonstration. This workshop picks up where the Intro to R workshop left off—if you attended that session, you’re in great shape.
Today, we’ll expand your R toolkit with some powerful techniques for working with strings, dates, and multiple columns at once. We’ll continue working with the same ILINet influenza surveillance data from last time, plus a dataset called starwars that comes bundled with the tidyverse.
By the end, you should be able to wrangle messy real-world data with confidence and produce publication-ready visualizations.
Setup & Recap
Rebuild your workspace from the intro workshop
We need to get back to where we left off. Copy and run the following setup code, which loads our libraries, imports our data, and rebuilds the variables we created in the intro workshop.
ili <- ili |>
select(
REGION, YEAR, WEEK, `%UNWEIGHTED ILI`, ILITOTAL, `NUM. OF PROVIDERS`,
`TOTAL PATIENTS`
) |>
janitor::clean_names()
ili <- ili |>
mutate(
wk = case_when(
year == 2024 ~ week - 39,
year == 2025 ~ week + 13
),
ili_per_provider = ilitotal / num_of_providers
) |>
filter(region %in% c("Michigan", "Ohio", "Wisconsin")) |>
rename(
ili_total = ilitotal,
pct_unweighted_ili = percent_unweighted_ili
)
# Load and join population data
ili <- left_join(ili, pop, join_by(region == state))Run glimpse(ili) in your console to confirm everything looks right. (glimpse() is a new friend.) You should see columns for region, year, week, wk, ili_total, ili_per_provider, population, and others.
glimpse(ili)Rows: 156
Columns: 10
$ region <chr> "Michigan", "Ohio", "Wisconsin", "Michigan", "Ohio"…
$ year <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
$ week <dbl> 40, 40, 40, 41, 41, 41, 42, 42, 42, 43, 43, 43, 44,…
$ pct_unweighted_ili <dbl> 0.964664, 1.802120, 1.170010, 1.108190, 2.177180, 1…
$ ili_total <dbl> 1460, 777, 356, 1696, 971, 396, 1848, 1095, 423, 20…
$ num_of_providers <dbl> 301, 93, 66, 302, 96, 65, 302, 94, 66, 302, 95, 66,…
$ total_patients <dbl> 151348, 43116, 30427, 153042, 44599, 29061, 150246,…
$ wk <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, …
$ ili_per_provider <dbl> 4.850498, 8.354839, 5.393939, 5.615894, 10.114583, …
$ population <dbl> 10077331, 11799448, 5893718, 10077331, 11799448, 58…
String Manipulation with stringr & glue
Working with text data is a fact of life. The stringr library makes all of this easier, and every single one of its functions starts with str_ so they’re easy to discover.
Let’s also load a fun practice dataset. starwars comes with dplyr (part of the tidyverse that we’ve already covered) and has some great messy string columns. (ili is a bit light on strings, otherwise we would use that.)
data(starwars)Take a look at it with glimpse(starwars). You’ll see columns like name, hair_color, skin_color, and homeworld—all character strings.
Basic string functions
Here are some of the most useful stringr functions. Try each of these in your console to see what they do.
# Changing case
str_to_lower("MICHIGAN")
str_to_upper("michigan")
str_to_title("hello world")
# How long is a string?
str_length("epidemiology")
# Extract a piece of a string
str_sub("epidemiology", start = 1, end = 3)
# Does a string contain a pattern?
str_detect("influenza-like illness", "influenza")
# Replace part of a string
str_replace("ILI Total Cases", "ILI", "Influenza-Like Illness")
# Combine strings
str_c("Week", "40", sep = " ")
str_c("state_", c("MI", "OH", "WI"))You can find the full list of stringr functions in the stringr cheat sheet or by typing ?stringr in your console.
Using stringr in a pipeline
These functions become really powerful inside filter() and mutate(). Try these out:
# Find characters whose names contain "Skywalker"
starwars |>
filter(name |> str_detect("Skywalker"))Clean up the hair_color column by making it title case (str_to_title())
Introducing glue
glue lets you embed R expressions directly inside strings using {}. For those familiar with a bit of base-R, think of it as a fancier, more readable paste0().
library(glue)
myname <- "Jack"
glue("My name is {myname}.") |> cat()My name is Jack.
Where glue really shines is inside mutate(). Let’s build a description column from the starwars data:
starwars |>
filter(!is.na(homeworld)) |>
mutate(
description = glue("{name} is {height}cm tall and comes from {homeworld}.")
) |>
select(name, description) |>
slice_head(n = 5)# A tibble: 5 × 2
name description
<chr> <glue>
1 Luke Skywalker Luke Skywalker is 172cm tall and comes from Tatooine.
2 C-3PO C-3PO is 167cm tall and comes from Tatooine.
3 R2-D2 R2-D2 is 96cm tall and comes from Naboo.
4 Darth Vader Darth Vader is 202cm tall and comes from Tatooine.
5 Leia Organa Leia Organa is 150cm tall and comes from Alderaan.
And here’s a practical example with our ILI data—creating formatted labels:
ili |>
distinct(region) |>
mutate(label = glue("State: {str_to_upper(region)}"))# A tibble: 3 × 2
region label
<chr> <glue>
1 Michigan State: MICHIGAN
2 Ohio State: OHIO
3 Wisconsin State: WISCONSIN
Factors with forcats
Factors are character values with a defined order. Why do we care? Because ggplot2 uses factor order to determine the order of bars, legend entries, facets (we’ll get there), and more. If you’ve ever been frustrated by a bar chart that sorts alphabetically when you wanted it sorted by value, then factors are your answer.
forcats is the tidyverse library for working with factors. Every function starts with fct_. (You may be noticing a pattern with tidyverse function naming!)
The problem: unordered characters
Let’s count how many starwars characters come from each homeworld then plot it:
sw_homeworlds <- starwars |>
filter(!is.na(homeworld)) |>
count(homeworld, sort = TRUE) |>
slice_head(n = 10)
# Without factors, ggplot sorts alphabetically
sw_homeworlds |>
ggplot(aes(x = homeworld, y = n)) +
geom_col() +
coord_flip()That alphabetical ordering isn’t very informative. Let’s fix it.
Ordering factors
There are several ways to reorder factors depending on what you need:
fct_infreq(): order by how often each level appearsfct_reorder(): order by a summary of another variablefct_relevel(): specify the exact order manuallyfct_lump_n(): collapse rare levels into “Other”
Try using fct_reorder() to make the homeworld bar chart sort by count. You’ll want to place it inside a mutate() before your ggplot() call. (Try it yourself before looking at how we did it!)
Code
sw_homeworlds |>
mutate(homeworld = fct_reorder(homeworld, n)) |>
ggplot(aes(x = homeworld, y = n)) +
geom_col() +
coord_flip() +
labs(title = "Top 10 homeworlds in starwars")Chaining fct_*() functions
Here’s a nice trick: you can pipe multiple fct_*() functions together to get exactly the ordering you want.
starwars |>
filter(!is.na(homeworld)) |>
mutate(
homeworld = homeworld |>
fct_infreq() |>
fct_lump_n(n = 5) |>
fct_rev()
) |>
ggplot(aes(y = homeworld)) +
geom_col(stat = "count")We’ll come back to factors at the end of this workshop when we make a bar chart with fill= mapping. Factor ordering will control which state’s bar appears where!
Dates and Times with lubridate
lubridate is the tidyverse library for dates and times. If you’ve ever tried to do date arithmetic in Excel and gotten frustrated, you’ll appreciate what this library can do.
The problem: our data has no dates!
Look at our ili data. We have year and week columns, but no actual date. We want to construct real date values so we can do proper time-series work.
Recall from the intro workshop: MMWR (Morbidity and Mortality Weekly Report) weeks start on Sunday and end on Saturday. So for each year-week pair, we want to find the corresponding Saturday.
Let’s build up to this step by step. First, a look at how lubridate lets you set parts of a date:
start_date <- today()
year(start_date) <- ili$year[1]
wday(start_date) <- 7 # Saturday
week(start_date) <- ili$week[1]
start_date[1] "2024-10-05"
That works for one date, but we need to do this for every row. Time to write our very own function! (We’ll talk more about functions in the next section, but let’s get a sneak preview here.)
# A function for finding all the Saturdays we want
find_saturday <- function(year, week) {
# Start with an arbitrary day
date <- today()
# Reassign the year, then day of the week, then week
year(date) <- year
wday(date) <- 7
week(date) <- week
return(date)
}Test it: find_saturday(2024, 40) should give you a Saturday in early October 2024.
Now add the date column to our data:
ili <- ili |>
mutate(end_of_week = find_saturday(year, week))Run View(ili) to see the new end_of_week column.
Useful lubridate “accessors”
Once you have proper dates, lubridate makes it easy to pull out components:
ili |>
mutate(
the_year = year(end_of_week),
the_month = month(end_of_week),
the_day = day(end_of_week),
day_name = wday(end_of_week, label = TRUE, abbr = FALSE)
) |>
select(end_of_week, the_year, the_month, the_day, day_name) |>
slice_head(n = 5)# A tibble: 5 × 5
end_of_week the_year the_month the_day day_name
<date> <dbl> <dbl> <int> <ord>
1 2024-10-05 2024 10 5 Saturday
2 2024-10-05 2024 10 5 Saturday
3 2024-10-05 2024 10 5 Saturday
4 2024-10-12 2024 10 12 Saturday
5 2024-10-12 2024 10 12 Saturday
And floor_date() is great for rounding dates—for example, to group by month later:
ili |>
mutate(month_start = end_of_week |> floor_date(unit = "month")) |>
select(end_of_week, month_start) |>
slice_head(n = 10)# A tibble: 10 × 2
end_of_week month_start
<date> <date>
1 2024-10-05 2024-10-01
2 2024-10-05 2024-10-01
3 2024-10-05 2024-10-01
4 2024-10-12 2024-10-01
5 2024-10-12 2024-10-01
6 2024-10-12 2024-10-01
7 2024-10-19 2024-10-01
8 2024-10-19 2024-10-01
9 2024-10-19 2024-10-01
10 2024-10-26 2024-10-01
Periods, durations, and intervals
lubridate distinguishes between three types of time spans:
- Periods are human-readable spans: “2 months” or “1 year.” They don’t account for daylight saving time or leap years.
- Durations are exact numbers of seconds.
ddays(30)is exactly30 * 86400seconds. - Intervals are spans between two specific points in time.
Try these in your console:
# Periods
days(30)
months(3)
years(1) + months(6)
# Durations
ddays(30)
dweeks(1)
# Intervals
flu_season <- interval(ymd("2024-09-29"), ymd("2025-09-27"))
flu_season
as.period(flu_season)
as.duration(flu_season)
# Is a date within the flu season?
ymd("2025-01-15") %within% flu_season
ymd("2023-06-01") %within% flu_season
# How many weeks?
flu_season / weeks(1)When does the distinction between these matter? In short: periods are best for human-friendly calendar math (“add 1 month”), durations are best for precise scientific calculations, and intervals are best for checking whether dates fall within a range.
Formatting and parsing dates
format() (a base-R function) gives you full control over how dates display as strings. The %-codes specify what to include. Here’s a handy reference for all these codes.
today() |> format("%B %d, %Y") # "March 13, 2026" style
today() |> format("%m/%d/%y") # "03/13/26" style
today() |> format("%A, %b %d") # "Friday, Mar 13" styleGoing the other direction, lubridate parses dates from strings with ymd(), mdy(), dmy(), and friends:
ymd("2025-01-15")
mdy("01/15/2025")
mdy("January 15, 2025")The tidyverse also includes the hms library for working with times that don’t have a date attached (like “14:30:00”). We won’t cover it today, but it follows similar conventions to lubridate. Check out ?hms::hms if you’re curious.
Writing Your Own Functions
We actually already wrote a function! find_saturday() above is a real, working function that we used in our pipeline.
A function bundles up code so you can reuse it. The basic structure is:
my_function <- function(arg1, arg2) {
# do stuff
return(result)
}A practical example: rate per 100k
This is the sort of thing epidemiologists do all the time. Let’s make a function for it.
rate_per_100k <- function(count, population) {
rate <- (count / population) * 100000
return(rate)
}Test it with some simple values, then use it in a pipeline with our ILI data:
rate_per_100k(500, 10000000)[1] 5
ili |>
mutate(ili_rate = rate_per_100k(ili_total, population)) |>
select(region, end_of_week, ili_total, population, ili_rate) |>
slice_head(n = 10)# A tibble: 10 × 5
region end_of_week ili_total population ili_rate
<chr> <date> <dbl> <dbl> <dbl>
1 Michigan 2024-10-05 1460 10077331 14.5
2 Ohio 2024-10-05 777 11799448 6.59
3 Wisconsin 2024-10-05 356 5893718 6.04
4 Michigan 2024-10-12 1696 10077331 16.8
5 Ohio 2024-10-12 971 11799448 8.23
6 Wisconsin 2024-10-12 396 5893718 6.72
7 Michigan 2024-10-19 1848 10077331 18.3
8 Ohio 2024-10-19 1095 11799448 9.28
9 Wisconsin 2024-10-19 423 5893718 7.18
10 Michigan 2024-10-26 2053 10077331 20.4
(Keep in mind these aren’t real ILI totals—just totals from the ILINet provider network.)
Default argument values
What if we sometimes want to calculate per 10,000 instead of per 100,000? We can add a default value:
rate_per <- function(count, population, per = 100000) {
rate <- (count / population) * per
return(rate)
}
rate_per(500, 10000000) # uses default: per 100k[1] 5
rate_per(500, 10000000, per = 10000) # per 10k[1] 0.5
Lambda (anonymous) functions
Sometimes you don’t want to name a function—you just need it once. The shorthand syntax is \(x) <expression>:
# These do the same thing:
rate_per_100k(500, 10000000)
(\(count, pop) (count / pop) * 100000)(500, 10000000)This looks a bit odd on its own, but lambdas really shine inside across(), which is next!
You might see older tidyverse code that uses ~ .x syntax for anonymous functions (e.g., ~ .x + 1). The \(x) x + 1 shorthand is the modern, base-R equivalent. Both work, but we’ll use the newer syntax, since we like how x can be almost anything—e.g. col, day, or jedi.
Advanced mutate() with across(), tidyselect, and purrr
The problem: repetitive mutations
Imagine we want to round every numeric column in our dataset. We could write a separate round() call for each one, but that’s tedious and error-prone. across() lets us apply a function to multiple columns at once.
across() basics
across() takes two key arguments: which columns (.cols) and what function(s) (.fns). Let’s take an example where we’re “applying” the round() function “across” one column: pct_unweighted_ili
ili |>
mutate(across(c(pct_unweighted_ili), round)) |>
slice_head(n = 5)# A tibble: 5 × 11
region year week pct_unweighted_ili ili_total num_of_providers
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Michigan 2024 40 1 1460 301
2 Ohio 2024 40 2 777 93
3 Wisconsin 2024 40 1 356 66
4 Michigan 2024 41 1 1696 302
5 Ohio 2024 41 2 971 96
# ℹ 5 more variables: total_patients <dbl>, wk <dbl>, ili_per_provider <dbl>,
# population <dbl>, end_of_week <date>
tidyselect: powerful column selection
Instead of listing columns by name, we can use tidyselect’s selection helpers to pick columns programmatically. These work in select(), across(), and many other tidyverse functions.
Try some of these:
# Select columns where a condition is true
ili |> select(where(is.numeric))
# Select by name pattern
ili |> select(starts_with("ili"))
ili |> select(ends_with("providers"))
# Select from a character vector (doesn't error if a name is missing)
cols_we_want <- c("region", "ili_total", "nonexistent_column")
ili |> select(any_of(cols_we_want))
# Exclude columns with -
ili |> select(-population, -wk)
# A range of columns with :
ili |> select(year:week)
# Combine conditions with & and |
ili |> select(where(is.numeric) & !c(year, week))Now combine across() with where() to round all numeric columns in one shot:
ili |>
mutate(across(where(is.numeric), round)) |>
slice_head(n = 5)# A tibble: 5 × 11
region year week pct_unweighted_ili ili_total num_of_providers
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Michigan 2024 40 1 1460 301
2 Ohio 2024 40 2 777 93
3 Wisconsin 2024 40 1 356 66
4 Michigan 2024 41 1 1696 302
5 Ohio 2024 41 2 971 96
# ℹ 5 more variables: total_patients <dbl>, wk <dbl>, ili_per_provider <dbl>,
# population <dbl>, end_of_week <date>
Custom functions inside across()
Here’s where our rate_per_100k() function pays off. We can apply it to multiple count columns simultaneously, and use the .names argument to control the output column names:
ili |>
mutate(across(
.cols = c(ili_total, total_patients),
.fns = \(x) rate_per_100k(x, population),
.names = "{.col}_rate"
)) |>
select(
region, end_of_week, ili_total, ili_total_rate, total_patients,
total_patients_rate
) |>
slice_head(n = 5)# A tibble: 5 × 6
region end_of_week ili_total ili_total_rate total_patients total_patients_rate
<chr> <date> <dbl> <dbl> <dbl> <dbl>
1 Michi… 2024-10-05 1460 14.5 151348 1502.
2 Ohio 2024-10-05 777 6.59 43116 365.
3 Wisco… 2024-10-05 356 6.04 30427 516.
4 Michi… 2024-10-12 1696 16.8 153042 1519.
5 Ohio 2024-10-12 971 8.23 44599 378.
The {.col} placeholder in .names gets replaced with the original column name. You can also pass multiple functions as a named list:
Code
ili |>
mutate(across(
.cols = c(ili_total, total_patients),
.fns = list(
mean = \(x) mean(x, na.rm = TRUE),
max = \(x) max(x, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)) |>
select(contains("mean"), contains("max")) |>
slice_head(n = 3)# A tibble: 3 × 4
ili_total_mean total_patients_mean ili_total_max total_patients_max
<dbl> <dbl> <dbl> <dbl>
1 1908. 71672. 16349 164589
2 1908. 71672. 16349 164589
3 1908. 71672. 16349 164589
Like {.col} in the previous example, {.fn} is a special placeholder that just works in across()
if_any() and if_all() in filter()
These are like across() but for logical conditions inside filter():
# Keep rows where ANY numeric column exceeds 10,000
ili |>
filter(if_any(
where(is.numeric),
\(x) x > 10000
)) |>
select(region, end_of_week, ili_total, total_patients)# A tibble: 156 × 4
region end_of_week ili_total total_patients
<chr> <date> <dbl> <dbl>
1 Michigan 2024-10-05 1460 151348
2 Ohio 2024-10-05 777 43116
3 Wisconsin 2024-10-05 356 30427
4 Michigan 2024-10-12 1696 153042
5 Ohio 2024-10-12 971 44599
6 Wisconsin 2024-10-12 396 29061
7 Michigan 2024-10-19 1848 150246
8 Ohio 2024-10-19 1095 42216
9 Wisconsin 2024-10-19 423 29447
10 Michigan 2024-10-26 2053 158344
# ℹ 146 more rows
Try using if_all() to keep only rows where both ili_total and total_patients are above their respective medians.
Code
ili |>
filter(if_all(
c(ili_total, total_patients),
\(x) x > median(x)
)) |>
nrow()[1] 62
purrr for complex, row-wise operations
purrr is the tidyverse library for “functional programming”—applying functions systematically across data. If you’ve used for-loops in other languages, purrr is a more concise way to do the same kind of repetitive operations.
map_dbl() applies a function to each element of a vector and returns a numeric (double, aka dbl) vector:
map_dbl(c(1, 4, 9, 16), sqrt)[1] 1 2 3 4
map2_dbl() applies a function to pairs of elements from two vectors:
map2_dbl(
c(100, 200, 300), # .x
c(5000000, 8000000, 12e6), # .y
\(x, y) (x / y) * 100000 # function of .x and .y
)[1] 2.0 2.5 2.5
Here’s where it gets powerful: using map2_dbl() inside mutate() for complex derived columns that need conditional logic
ili |>
mutate(
custom_metric = map2_dbl(
ili_total,
total_patients,
\(cases, visits) if (visits > 0) cases / visits * 1000 else NA_real_
)
) |>
select(region, end_of_week, ili_total, total_patients, custom_metric) |>
slice_head(n = 10)# A tibble: 10 × 5
region end_of_week ili_total total_patients custom_metric
<chr> <date> <dbl> <dbl> <dbl>
1 Michigan 2024-10-05 1460 151348 9.65
2 Ohio 2024-10-05 777 43116 18.0
3 Wisconsin 2024-10-05 356 30427 11.7
4 Michigan 2024-10-12 1696 153042 11.1
5 Ohio 2024-10-12 971 44599 21.8
6 Wisconsin 2024-10-12 396 29061 13.6
7 Michigan 2024-10-19 1848 150246 12.3
8 Ohio 2024-10-19 1095 42216 25.9
9 Wisconsin 2024-10-19 423 29447 14.4
10 Michigan 2024-10-26 2053 158344 13.0
Admittedly, this rarely comes up, but we wanted you to get your money’s worth here!
purrr also has pmap_dbl() for when you need to apply a function across three or more columns at once. We won’t cover it today, but the concept is the same: you pass a list of vectors and a function that takes that many arguments.
Grouping & Aggregating with .by=
In the intro workshop, we used group_by() |> summarize() to calculate grouped summaries. That approach works great, but it has a subtle gotcha: group_by() can leave residual grouping on your data that causes surprises later. The .by= argument is a cleaner alternative.
Side by side: group_by() vs .by=
These produce the same result:
# The old way
ili |>
group_by(region) |>
summarize(
total_ili = sum(ili_total),
mean_rate = mean(ili_per_provider)
)# A tibble: 3 × 3
region total_ili mean_rate
<chr> <dbl> <dbl>
1 Michigan 192343 12.4
2 Ohio 76590 16.0
3 Wisconsin 28697 8.37
# The new way
ili |>
summarize(
total_ili = sum(ili_total),
mean_rate = mean(ili_per_provider),
.by = region
)# A tibble: 3 × 3
region total_ili mean_rate
<chr> <dbl> <dbl>
1 Michigan 192343 12.4
2 Ohio 76590 16.0
3 Wisconsin 28697 8.37
The .by= result is identical except for it never leaves residual grouping. No need to remember to ungroup()!
.by= in mutate(): grouped mutates
.by= also works in mutate(), which is powerful for adding summary columns without collapsing your data. Try computing each state’s weekly ILI as a percentage of the 3-state total:
ili |>
mutate(
week_total = sum(ili_total),
pct_of_total = ili_total / week_total * 100,
.by = wk
) |>
select(region, wk, ili_total, week_total, pct_of_total) |>
slice_head(n = 9)# A tibble: 9 × 5
region wk ili_total week_total pct_of_total
<chr> <dbl> <dbl> <dbl> <dbl>
1 Michigan 1 1460 2593 56.3
2 Ohio 1 777 2593 30.0
3 Wisconsin 1 356 2593 13.7
4 Michigan 2 1696 3063 55.4
5 Ohio 2 971 3063 31.7
6 Wisconsin 2 396 3063 12.9
7 Michigan 3 1848 3366 54.9
8 Ohio 3 1095 3366 32.5
9 Wisconsin 3 423 3366 12.6
Multiple grouping variables
Pass a vector of column names to .by= for multiple grouping variables:
ili |>
mutate(month = month(end_of_week, label = TRUE)) |>
summarize(
avg_ili = mean(ili_total),
.by = c(region, month)
) |>
slice_head(n = 12)# A tibble: 12 × 3
region month avg_ili
<chr> <ord> <dbl>
1 Michigan Oct 1764.
2 Ohio Oct 1021.
3 Wisconsin Oct 412.
4 Michigan Nov 2291.
5 Ohio Nov 1222.
6 Wisconsin Nov 473.
7 Michigan Dec 4015.
8 Ohio Dec 1856.
9 Wisconsin Dec 711
10 Michigan Jan 8125.
11 Ohio Jan 2549.
12 Wisconsin Jan 1146.
.by= in other verbs
.by= works in filter(), slice(), and more. Here’s a handy one: find the peak ILI week for each state.
ili |>
slice_max(ili_total, n = 1, by = region)# A tibble: 3 × 11
region year week pct_unweighted_ili ili_total num_of_providers
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Michigan 2025 7 10.6 16349 297
2 Ohio 2025 6 13.4 6583 95
3 Wisconsin 2025 6 5.09 1722 66
# ℹ 5 more variables: total_patients <dbl>, wk <dbl>, ili_per_provider <dbl>,
# population <dbl>, end_of_week <date>
Did you notice by = above instead of .by =? Some tidyverse functions like slice_max() use by without the leading dot. It’s a small (annoying!) inconsistency in dplyr, but both do the same thing conceptually. When in doubt, check the function’s documentation with ?.
Pivoting & a ggplot Payoff
Personally, I hate making pivot tables in Excel. Pivoting in R can still be annoying, but I think tidyverse actually has a pretty nice system for it! pivot_longer() reshapes “wide” data into “long” data. This is one of the most common data-reshaping tasks in R, and it’s essential for making certain kinds of plots in ggplot2.
Why pivot?
Many people think about data in Excel format: weeks as rows, one column per state. Let’s build a table like that:
ili_wide <- ili |>
select(wk, region, ili_per_provider) |>
pivot_wider(
names_from = region,
values_from = ili_per_provider
)
ili_wide |> slice_head(n = 5)# A tibble: 5 × 4
wk Michigan Ohio Wisconsin
<dbl> <dbl> <dbl> <dbl>
1 1 4.85 8.35 5.39
2 2 5.62 10.1 6.09
3 3 6.12 11.6 6.41
4 4 6.80 13.1 7.15
5 5 7.24 12.5 7.02
This is fine for a spreadsheet, but ggplot2 wants “long” format—one row per observation, with a column that identifies the group. pivot_longer() does this conversion:
ili_long <- ili_wide |>
pivot_longer(
cols = !wk,
names_to = "region",
values_to = "ili_per_provider"
)
ili_long |> slice_head(n = 10)# A tibble: 10 × 3
wk region ili_per_provider
<dbl> <chr> <dbl>
1 1 Michigan 4.85
2 1 Ohio 8.35
3 1 Wisconsin 5.39
4 2 Michigan 5.62
5 2 Ohio 10.1
6 2 Wisconsin 6.09
7 3 Michigan 6.12
8 3 Ohio 11.6
9 3 Wisconsin 6.41
10 4 Michigan 6.80
The payoff: a monthly bar chart with fill=
Now let’s bring together several things we’ve learned today—floor_date(), .by=, fct_relevel(), and pivoting—to build a polished stacked bar chart.
First, summarize ILI by month and state:
ili_monthly <- ili |>
mutate(month = floor_date(end_of_week, "month")) |>
summarize(
total_ili = sum(ili_total),
.by = c(region, month)
)Now build the plot. Use fct_relevel() to control the stacking order and scale_fill_manual() to set the colors:
ili_monthly |>
mutate(
region = region |> fct_relevel("Michigan", "Ohio", "Wisconsin")
) |>
ggplot(aes(x = month, y = total_ili, fill = region)) +
geom_col() +
scale_fill_manual(values = c(
"Michigan" = "blue",
"Ohio" = "red",
"Wisconsin" = "green"
)) +
labs(
x = "Month",
y = "Total ILI cases",
fill = "State",
title = "Monthly ILI cases by state, 2024-2025 season"
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.minor = element_blank()
)Try changing geom_col() to geom_col(position = "dodge") to get a side-by-side view instead of stacked. Which do you find more useful for comparing states?
One more pivot: comparing metrics side by side
What if we want to compare two different ILI measures on the same plot? We can pivot two columns into long format, clean up the labels with replace_values() and fct_relevel(), and use facet_wrap():
ili |>
select(region, end_of_week, ili_per_provider, pct_unweighted_ili) |>
pivot_longer(
cols = c(ili_per_provider, pct_unweighted_ili),
names_to = "metric",
values_to = "value"
) |>
mutate(
metric = metric |>
replace_values(
"ili_per_provider" ~ "ILI per provider",
"pct_unweighted_ili" ~ "% unweighted ILI"
) |>
fct_relevel("ILI per provider", "% unweighted ILI")
) |>
ggplot(aes(x = end_of_week, y = value, color = region)) +
geom_line() +
facet_wrap(~ metric, scales = "free_y", ncol = 1) +
scale_color_manual(values = c(
"Michigan" = "blue",
"Ohio" = "red",
"Wisconsin" = "green"
)) +
labs(
x = "Week ending",
y = NULL,
color = "State",
title = "Two measures of ILI burden, 2024-2025 season"
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.minor = element_blank()
)This final plot brings together string manipulation (replace_values), factors (fct_relevel), dates (the end_of_week axis), pivoting, and ggplot2 layering. Not bad for two hours!
Thank you for following along!
Don’t forget that MI-Support offers R office hours if you’d ever like to continue the conversation about R or discuss a project (free of charge!) to integrate it into your workflow.