Intermediate R Worksheet

Author

MI-Support

Published

March 13, 2026

Welcome back!

Use this worksheet to follow along with the live coding demonstration. This workshop picks up where the Intro to R workshop left off—if you attended that session, you’re in great shape.

Today, we’ll expand your R toolkit with some powerful techniques for working with strings, dates, and multiple columns at once. We’ll continue working with the same ILINet influenza surveillance data from last time, plus a dataset called starwars that comes bundled with the tidyverse.

By the end, you should be able to wrangle messy real-world data with confidence and produce publication-ready visualizations.

Setup & Recap

Rebuild your workspace from the intro workshop

We need to get back to where we left off. Copy and run the following setup code, which loads our libraries, imports our data, and rebuilds the variables we created in the intro workshop.

# install.packages(c("tidyverse", "this.path", "janitor"))
library(tidyverse)
library(this.path)

# Re-import and clean the ILINet data (same steps as intro)
ili <- read_csv(here("ILINet.csv"), skip = 1)
pop <- read_csv(here("population.csv"))

ili <- ili |>
  select(
    REGION, YEAR, WEEK, `%UNWEIGHTED ILI`, ILITOTAL, `NUM. OF PROVIDERS`,
    `TOTAL PATIENTS`
  ) |>
  janitor::clean_names()

ili <- ili |>
  mutate(
    wk = case_when(
      year == 2024 ~ week - 39,
      year == 2025 ~ week + 13
    ),
    ili_per_provider = ilitotal / num_of_providers
  ) |>
  filter(region %in% c("Michigan", "Ohio", "Wisconsin")) |>
  rename(
    ili_total = ilitotal,
    pct_unweighted_ili = percent_unweighted_ili
  )

# Load and join population data
ili <- left_join(ili, pop, join_by(region == state))

Run glimpse(ili) in your console to confirm everything looks right. (glimpse() is a new friend.) You should see columns for region, year, week, wk, ili_total, ili_per_provider, population, and others.

glimpse(ili)

Rows: 156
Columns: 10
$ region             <chr> "Michigan", "Ohio", "Wisconsin", "Michigan", "Ohio"…
$ year               <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
$ week               <dbl> 40, 40, 40, 41, 41, 41, 42, 42, 42, 43, 43, 43, 44,…
$ pct_unweighted_ili <dbl> 0.964664, 1.802120, 1.170010, 1.108190, 2.177180, 1…
$ ili_total          <dbl> 1460, 777, 356, 1696, 971, 396, 1848, 1095, 423, 20…
$ num_of_providers   <dbl> 301, 93, 66, 302, 96, 65, 302, 94, 66, 302, 95, 66,…
$ total_patients     <dbl> 151348, 43116, 30427, 153042, 44599, 29061, 150246,…
$ wk                 <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, …
$ ili_per_provider   <dbl> 4.850498, 8.354839, 5.393939, 5.615894, 10.114583, …
$ population         <dbl> 10077331, 11799448, 5893718, 10077331, 11799448, 58…

String Manipulation with `stringr` & `glue`

Working with text data is a fact of life. The stringr library makes all of this easier, and every single one of its functions starts with str_ so they’re easy to discover.

Let’s also load a fun practice dataset. starwars comes with dplyr (part of the tidyverse that we’ve already covered) and has some great messy string columns. (ili is a bit light on strings, otherwise we would use that.)

data(starwars)

Take a look at it with glimpse(starwars). You’ll see columns like name, hair_color, skin_color, and homeworld—all character strings.

Basic string functions

Here are some of the most useful stringr functions. Try each of these in your console to see what they do.

# Changing case
str_to_lower("MICHIGAN")
str_to_upper("michigan")
str_to_title("hello world")

# How long is a string?
str_length("epidemiology")

# Extract a piece of a string
str_sub("epidemiology", start = 1, end = 3)

# Does a string contain a pattern?
str_detect("influenza-like illness", "influenza")

# Replace part of a string
str_replace("ILI Total Cases", "ILI", "Influenza-Like Illness")

# Combine strings
str_c("Week", "40", sep = " ")
str_c("state_", c("MI", "OH", "WI"))

Tip

You can find the full list of stringr functions in the stringr cheat sheet or by typing ?stringr in your console.

Using `stringr` in a pipeline

These functions become really powerful inside filter() and mutate(). Try these out:

# Find characters whose names contain "Skywalker"
starwars |>
  filter(name |> str_detect("Skywalker"))

Clean up the hair_color column by making it title case (str_to_title())

Introducing `glue`

glue lets you embed R expressions directly inside strings using {}. For those familiar with a bit of base-R, think of it as a fancier, more readable paste0().

library(glue)

myname <- "Jack"
glue("My name is {myname}.") |> cat()

My name is Jack.

Where glue really shines is inside mutate(). Let’s build a description column from the starwars data:

starwars |>
  filter(!is.na(homeworld)) |>
  mutate(
    description = glue("{name} is {height}cm tall and comes from {homeworld}.")
  ) |>
  select(name, description) |>
  slice_head(n = 5)

# A tibble: 5 × 2
  name           description                                          
  <chr>          <glue>                                               
1 Luke Skywalker Luke Skywalker is 172cm tall and comes from Tatooine.
2 C-3PO          C-3PO is 167cm tall and comes from Tatooine.         
3 R2-D2          R2-D2 is 96cm tall and comes from Naboo.             
4 Darth Vader    Darth Vader is 202cm tall and comes from Tatooine.   
5 Leia Organa    Leia Organa is 150cm tall and comes from Alderaan.

And here’s a practical example with our ILI data—creating formatted labels:

ili |>
  distinct(region) |>
  mutate(label = glue("State: {str_to_upper(region)}"))

# A tibble: 3 × 2
  region    label           
  <chr>     <glue>          
1 Michigan  State: MICHIGAN 
2 Ohio      State: OHIO     
3 Wisconsin State: WISCONSIN

Factors with `forcats`

Factors are character values with a defined order. Why do we care? Because ggplot2 uses factor order to determine the order of bars, legend entries, facets (we’ll get there), and more. If you’ve ever been frustrated by a bar chart that sorts alphabetically when you wanted it sorted by value, then factors are your answer.

forcats is the tidyverse library for working with factors. Every function starts with fct_. (You may be noticing a pattern with tidyverse function naming!)

The problem: unordered characters

Let’s count how many starwars characters come from each homeworld then plot it:

sw_homeworlds <- starwars |>
  filter(!is.na(homeworld)) |>
  count(homeworld, sort = TRUE) |>
  slice_head(n = 10)

# Without factors, ggplot sorts alphabetically
sw_homeworlds |>
  ggplot(aes(x = homeworld, y = n)) +
  geom_col() +
  coord_flip()

That alphabetical ordering isn’t very informative. Let’s fix it.

Ordering factors

There are several ways to reorder factors depending on what you need:

fct_infreq(): order by how often each level appears
fct_reorder(): order by a summary of another variable
fct_relevel(): specify the exact order manually
fct_lump_n(): collapse rare levels into “Other”

Try using fct_reorder() to make the homeworld bar chart sort by count. You’ll want to place it inside a mutate() before your ggplot() call. (Try it yourself before looking at how we did it!)

Code

sw_homeworlds |>
  mutate(homeworld = fct_reorder(homeworld, n)) |>
  ggplot(aes(x = homeworld, y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 10 homeworlds in starwars")

Chaining `fct_*()` functions

Here’s a nice trick: you can pipe multiple fct_*() functions together to get exactly the ordering you want.

starwars |>
  filter(!is.na(homeworld)) |>
  mutate(
    homeworld = homeworld |>
      fct_infreq() |>
      fct_lump_n(n = 5) |>
      fct_rev()
  ) |>
  ggplot(aes(y = homeworld)) +
  geom_col(stat = "count")

Note

We’ll come back to factors at the end of this workshop when we make a bar chart with fill= mapping. Factor ordering will control which state’s bar appears where!

Dates and Times with `lubridate`

lubridate is the tidyverse library for dates and times. If you’ve ever tried to do date arithmetic in Excel and gotten frustrated, you’ll appreciate what this library can do.

The problem: our data has no dates!

Look at our ili data. We have year and week columns, but no actual date. We want to construct real date values so we can do proper time-series work.

Recall from the intro workshop: MMWR (Morbidity and Mortality Weekly Report) weeks start on Sunday and end on Saturday. So for each year-week pair, we want to find the corresponding Saturday.

Let’s build up to this step by step. First, a look at how lubridate lets you set parts of a date:

start_date <- today()
year(start_date) <- ili$year[1]
wday(start_date) <- 7   # Saturday
week(start_date) <- ili$week[1]
start_date

[1] "2024-10-05"

That works for one date, but we need to do this for every row. Time to write our very own function! (We’ll talk more about functions in the next section, but let’s get a sneak preview here.)

# A function for finding all the Saturdays we want
find_saturday <- function(year, week) {
  # Start with an arbitrary day
  date <- today()
  
  # Reassign the year, then day of the week, then week
  year(date) <- year
  wday(date) <- 7
  week(date) <- week
  
  return(date)
}

Test it: find_saturday(2024, 40) should give you a Saturday in early October 2024.

Now add the date column to our data:

ili <- ili |>
  mutate(end_of_week = find_saturday(year, week))

Run View(ili) to see the new end_of_week column.

Useful lubridate “accessors”

Once you have proper dates, lubridate makes it easy to pull out components:

ili |>
  mutate(
    the_year  = year(end_of_week),
    the_month = month(end_of_week),
    the_day   = day(end_of_week),
    day_name  = wday(end_of_week, label = TRUE, abbr = FALSE)
  ) |>
  select(end_of_week, the_year, the_month, the_day, day_name) |>
  slice_head(n = 5)

# A tibble: 5 × 5
  end_of_week the_year the_month the_day day_name
  <date>         <dbl>     <dbl>   <int> <ord>   
1 2024-10-05      2024        10       5 Saturday
2 2024-10-05      2024        10       5 Saturday
3 2024-10-05      2024        10       5 Saturday
4 2024-10-12      2024        10      12 Saturday
5 2024-10-12      2024        10      12 Saturday

And floor_date() is great for rounding dates—for example, to group by month later:

ili |>
  mutate(month_start = end_of_week |> floor_date(unit = "month")) |>
  select(end_of_week, month_start) |>
  slice_head(n = 10)

# A tibble: 10 × 2
   end_of_week month_start
   <date>      <date>     
 1 2024-10-05  2024-10-01 
 2 2024-10-05  2024-10-01 
 3 2024-10-05  2024-10-01 
 4 2024-10-12  2024-10-01 
 5 2024-10-12  2024-10-01 
 6 2024-10-12  2024-10-01 
 7 2024-10-19  2024-10-01 
 8 2024-10-19  2024-10-01 
 9 2024-10-19  2024-10-01 
10 2024-10-26  2024-10-01

Periods, durations, and intervals

lubridate distinguishes between three types of time spans:

Periods are human-readable spans: “2 months” or “1 year.” They don’t account for daylight saving time or leap years.
Durations are exact numbers of seconds. ddays(30) is exactly 30 * 86400 seconds.
Intervals are spans between two specific points in time.

Try these in your console:

# Periods
days(30)
months(3)
years(1) + months(6)

# Durations
ddays(30)
dweeks(1)

# Intervals
flu_season <- interval(ymd("2024-09-29"), ymd("2025-09-27"))
flu_season

as.period(flu_season)
as.duration(flu_season)

# Is a date within the flu season?
ymd("2025-01-15") %within% flu_season
ymd("2023-06-01") %within% flu_season

# How many weeks?
flu_season / weeks(1)

When does the distinction between these matter? In short: periods are best for human-friendly calendar math (“add 1 month”), durations are best for precise scientific calculations, and intervals are best for checking whether dates fall within a range.

Formatting and parsing dates

format() (a base-R function) gives you full control over how dates display as strings. The %-codes specify what to include. Here’s a handy reference for all these codes.

today() |> format("%B %d, %Y")    # "March 13, 2026" style
today() |> format("%m/%d/%y")     # "03/13/26" style
today() |> format("%A, %b %d")    # "Friday, Mar 13" style

Going the other direction, lubridate parses dates from strings with ymd(), mdy(), dmy(), and friends:

ymd("2025-01-15")
mdy("01/15/2025")
mdy("January 15, 2025")

Note

The tidyverse also includes the hms library for working with times that don’t have a date attached (like “14:30:00”). We won’t cover it today, but it follows similar conventions to lubridate. Check out ?hms::hms if you’re curious.

Writing Your Own Functions

We actually already wrote a function! find_saturday() above is a real, working function that we used in our pipeline.

A function bundles up code so you can reuse it. The basic structure is:

my_function <- function(arg1, arg2) {
  # do stuff
  return(result)
}

A practical example: rate per 100k

This is the sort of thing epidemiologists do all the time. Let’s make a function for it.

rate_per_100k <- function(count, population) {
  rate <- (count / population) * 100000
  return(rate)
}

Test it with some simple values, then use it in a pipeline with our ILI data:

rate_per_100k(500, 10000000)

[1] 5

ili |>
  mutate(ili_rate = rate_per_100k(ili_total, population)) |>
  select(region, end_of_week, ili_total, population, ili_rate) |>
  slice_head(n = 10)

# A tibble: 10 × 5
   region    end_of_week ili_total population ili_rate
   <chr>     <date>          <dbl>      <dbl>    <dbl>
 1 Michigan  2024-10-05       1460   10077331    14.5 
 2 Ohio      2024-10-05        777   11799448     6.59
 3 Wisconsin 2024-10-05        356    5893718     6.04
 4 Michigan  2024-10-12       1696   10077331    16.8 
 5 Ohio      2024-10-12        971   11799448     8.23
 6 Wisconsin 2024-10-12        396    5893718     6.72
 7 Michigan  2024-10-19       1848   10077331    18.3 
 8 Ohio      2024-10-19       1095   11799448     9.28
 9 Wisconsin 2024-10-19        423    5893718     7.18
10 Michigan  2024-10-26       2053   10077331    20.4

(Keep in mind these aren’t real ILI totals—just totals from the ILINet provider network.)

Default argument values

What if we sometimes want to calculate per 10,000 instead of per 100,000? We can add a default value:

rate_per <- function(count, population, per = 100000) {
  rate <- (count / population) * per
  return(rate)
}

rate_per(500, 10000000)               # uses default: per 100k

[1] 5

rate_per(500, 10000000, per = 10000)  # per 10k

[1] 0.5

Lambda (anonymous) functions

Sometimes you don’t want to name a function—you just need it once. The shorthand syntax is \(x) <expression>:

# These do the same thing:
rate_per_100k(500, 10000000)
(\(count, pop) (count / pop) * 100000)(500, 10000000)

This looks a bit odd on its own, but lambdas really shine inside across(), which is next!

Tip

You might see older tidyverse code that uses ~ .x syntax for anonymous functions (e.g., ~ .x + 1). The \(x) x + 1 shorthand is the modern, base-R equivalent. Both work, but we’ll use the newer syntax, since we like how x can be almost anything—e.g. col, day, or jedi.

Advanced `mutate()` with `across()`, tidyselect, and `purrr`

The problem: repetitive mutations

Imagine we want to round every numeric column in our dataset. We could write a separate round() call for each one, but that’s tedious and error-prone. across() lets us apply a function to multiple columns at once.

`across()` basics

across() takes two key arguments: which columns (.cols) and what function(s) (.fns). Let’s take an example where we’re “applying” the round() function “across” one column: pct_unweighted_ili

ili |>
  mutate(across(c(pct_unweighted_ili), round)) |>
  slice_head(n = 5)

# A tibble: 5 × 11
  region     year  week pct_unweighted_ili ili_total num_of_providers
  <chr>     <dbl> <dbl>              <dbl>     <dbl>            <dbl>
1 Michigan   2024    40                  1      1460              301
2 Ohio       2024    40                  2       777               93
3 Wisconsin  2024    40                  1       356               66
4 Michigan   2024    41                  1      1696              302
5 Ohio       2024    41                  2       971               96
# ℹ 5 more variables: total_patients <dbl>, wk <dbl>, ili_per_provider <dbl>,
#   population <dbl>, end_of_week <date>

`tidyselect`: powerful column selection

Instead of listing columns by name, we can use tidyselect’s selection helpers to pick columns programmatically. These work in select(), across(), and many other tidyverse functions.

Try some of these:

# Select columns where a condition is true
ili |> select(where(is.numeric))

# Select by name pattern
ili |> select(starts_with("ili"))
ili |> select(ends_with("providers"))

# Select from a character vector (doesn't error if a name is missing)
cols_we_want <- c("region", "ili_total", "nonexistent_column")
ili |> select(any_of(cols_we_want))

# Exclude columns with -
ili |> select(-population, -wk)

# A range of columns with :
ili |> select(year:week)

# Combine conditions with & and |
ili |> select(where(is.numeric) & !c(year, week))

Now combine across() with where() to round all numeric columns in one shot:

ili |>
  mutate(across(where(is.numeric), round)) |>
  slice_head(n = 5)

# A tibble: 5 × 11
  region     year  week pct_unweighted_ili ili_total num_of_providers
  <chr>     <dbl> <dbl>              <dbl>     <dbl>            <dbl>
1 Michigan   2024    40                  1      1460              301
2 Ohio       2024    40                  2       777               93
3 Wisconsin  2024    40                  1       356               66
4 Michigan   2024    41                  1      1696              302
5 Ohio       2024    41                  2       971               96
# ℹ 5 more variables: total_patients <dbl>, wk <dbl>, ili_per_provider <dbl>,
#   population <dbl>, end_of_week <date>

Custom functions inside `across()`

Here’s where our rate_per_100k() function pays off. We can apply it to multiple count columns simultaneously, and use the .names argument to control the output column names:

ili |>
  mutate(across(
    .cols = c(ili_total, total_patients),
    .fns  = \(x) rate_per_100k(x, population),
    .names = "{.col}_rate"
  )) |>
  select(
    region, end_of_week, ili_total, ili_total_rate, total_patients,
    total_patients_rate
  ) |>
  slice_head(n = 5)

# A tibble: 5 × 6
  region end_of_week ili_total ili_total_rate total_patients total_patients_rate
  <chr>  <date>          <dbl>          <dbl>          <dbl>               <dbl>
1 Michi… 2024-10-05       1460          14.5          151348               1502.
2 Ohio   2024-10-05        777           6.59          43116                365.
3 Wisco… 2024-10-05        356           6.04          30427                516.
4 Michi… 2024-10-12       1696          16.8          153042               1519.
5 Ohio   2024-10-12        971           8.23          44599                378.

The {.col} placeholder in .names gets replaced with the original column name. You can also pass multiple functions as a named list:

Code

ili |>
  mutate(across(
    .cols = c(ili_total, total_patients),
    .fns  = list(
      mean = \(x) mean(x, na.rm = TRUE),
      max  = \(x) max(x, na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  )) |>
  select(contains("mean"), contains("max")) |>
  slice_head(n = 3)

# A tibble: 3 × 4
  ili_total_mean total_patients_mean ili_total_max total_patients_max
           <dbl>               <dbl>         <dbl>              <dbl>
1          1908.              71672.         16349             164589
2          1908.              71672.         16349             164589
3          1908.              71672.         16349             164589

Like {.col} in the previous example, {.fn} is a special placeholder that just works in across()

`if_any()` and `if_all()` in `filter()`

These are like across() but for logical conditions inside filter():

# Keep rows where ANY numeric column exceeds 10,000
ili |>
  filter(if_any(
    where(is.numeric),
    \(x) x > 10000
  )) |>
  select(region, end_of_week, ili_total, total_patients)

# A tibble: 156 × 4
   region    end_of_week ili_total total_patients
   <chr>     <date>          <dbl>          <dbl>
 1 Michigan  2024-10-05       1460         151348
 2 Ohio      2024-10-05        777          43116
 3 Wisconsin 2024-10-05        356          30427
 4 Michigan  2024-10-12       1696         153042
 5 Ohio      2024-10-12        971          44599
 6 Wisconsin 2024-10-12        396          29061
 7 Michigan  2024-10-19       1848         150246
 8 Ohio      2024-10-19       1095          42216
 9 Wisconsin 2024-10-19        423          29447
10 Michigan  2024-10-26       2053         158344
# ℹ 146 more rows

Try using if_all() to keep only rows where both ili_total and total_patients are above their respective medians.

Code

ili |>
  filter(if_all(
    c(ili_total, total_patients),
    \(x) x > median(x)
  )) |>
  nrow()

[1] 62

`purrr` for complex, row-wise operations

purrr is the tidyverse library for “functional programming”—applying functions systematically across data. If you’ve used for-loops in other languages, purrr is a more concise way to do the same kind of repetitive operations.

map_dbl() applies a function to each element of a vector and returns a numeric (double, aka dbl) vector:

map_dbl(c(1, 4, 9, 16), sqrt)

[1] 1 2 3 4

map2_dbl() applies a function to pairs of elements from two vectors:

map2_dbl(
  c(100, 200, 300),           # .x
  c(5000000, 8000000, 12e6),  # .y
  \(x, y) (x / y) * 100000   # function of .x and .y
)

[1] 2.0 2.5 2.5

Here’s where it gets powerful: using map2_dbl() inside mutate() for complex derived columns that need conditional logic

ili |>
  mutate(
    custom_metric = map2_dbl(
      ili_total,
      total_patients,
      \(cases, visits) if (visits > 0) cases / visits * 1000 else NA_real_
    )
  ) |>
  select(region, end_of_week, ili_total, total_patients, custom_metric) |>
  slice_head(n = 10)

# A tibble: 10 × 5
   region    end_of_week ili_total total_patients custom_metric
   <chr>     <date>          <dbl>          <dbl>         <dbl>
 1 Michigan  2024-10-05       1460         151348          9.65
 2 Ohio      2024-10-05        777          43116         18.0 
 3 Wisconsin 2024-10-05        356          30427         11.7 
 4 Michigan  2024-10-12       1696         153042         11.1 
 5 Ohio      2024-10-12        971          44599         21.8 
 6 Wisconsin 2024-10-12        396          29061         13.6 
 7 Michigan  2024-10-19       1848         150246         12.3 
 8 Ohio      2024-10-19       1095          42216         25.9 
 9 Wisconsin 2024-10-19        423          29447         14.4 
10 Michigan  2024-10-26       2053         158344         13.0

Admittedly, this rarely comes up, but we wanted you to get your money’s worth here!

Note

purrr also has pmap_dbl() for when you need to apply a function across three or more columns at once. We won’t cover it today, but the concept is the same: you pass a list of vectors and a function that takes that many arguments.

Grouping & Aggregating with `.by=`

In the intro workshop, we used group_by() |> summarize() to calculate grouped summaries. That approach works great, but it has a subtle gotcha: group_by() can leave residual grouping on your data that causes surprises later. The .by= argument is a cleaner alternative.

Side by side: `group_by()` vs `.by=`

These produce the same result:

# The old way
ili |>
  group_by(region) |>
  summarize(
    total_ili = sum(ili_total),
    mean_rate = mean(ili_per_provider)
  )

# A tibble: 3 × 3
  region    total_ili mean_rate
  <chr>         <dbl>     <dbl>
1 Michigan     192343     12.4 
2 Ohio          76590     16.0 
3 Wisconsin     28697      8.37

# The new way
ili |>
  summarize(
    total_ili = sum(ili_total),
    mean_rate = mean(ili_per_provider),
    .by = region
  )

# A tibble: 3 × 3
  region    total_ili mean_rate
  <chr>         <dbl>     <dbl>
1 Michigan     192343     12.4 
2 Ohio          76590     16.0 
3 Wisconsin     28697      8.37

The .by= result is identical except for it never leaves residual grouping. No need to remember to ungroup()!

`.by=` in `mutate()`: grouped mutates

.by= also works in mutate(), which is powerful for adding summary columns without collapsing your data. Try computing each state’s weekly ILI as a percentage of the 3-state total:

ili |>
  mutate(
    week_total = sum(ili_total),
    pct_of_total = ili_total / week_total * 100,
    .by = wk
  ) |>
  select(region, wk, ili_total, week_total, pct_of_total) |>
  slice_head(n = 9)

# A tibble: 9 × 5
  region       wk ili_total week_total pct_of_total
  <chr>     <dbl>     <dbl>      <dbl>        <dbl>
1 Michigan      1      1460       2593         56.3
2 Ohio          1       777       2593         30.0
3 Wisconsin     1       356       2593         13.7
4 Michigan      2      1696       3063         55.4
5 Ohio          2       971       3063         31.7
6 Wisconsin     2       396       3063         12.9
7 Michigan      3      1848       3366         54.9
8 Ohio          3      1095       3366         32.5
9 Wisconsin     3       423       3366         12.6

Multiple grouping variables

Pass a vector of column names to .by= for multiple grouping variables:

ili |>
  mutate(month = month(end_of_week, label = TRUE)) |>
  summarize(
    avg_ili = mean(ili_total),
    .by = c(region, month)
  ) |>
  slice_head(n = 12)

# A tibble: 12 × 3
   region    month avg_ili
   <chr>     <ord>   <dbl>
 1 Michigan  Oct     1764.
 2 Ohio      Oct     1021.
 3 Wisconsin Oct      412.
 4 Michigan  Nov     2291.
 5 Ohio      Nov     1222.
 6 Wisconsin Nov      473.
 7 Michigan  Dec     4015.
 8 Ohio      Dec     1856.
 9 Wisconsin Dec      711 
10 Michigan  Jan     8125.
11 Ohio      Jan     2549.
12 Wisconsin Jan     1146.

`.by=` in other verbs

.by= works in filter(), slice(), and more. Here’s a handy one: find the peak ILI week for each state.

ili |>
  slice_max(ili_total, n = 1, by = region)

# A tibble: 3 × 11
  region     year  week pct_unweighted_ili ili_total num_of_providers
  <chr>     <dbl> <dbl>              <dbl>     <dbl>            <dbl>
1 Michigan   2025     7              10.6      16349              297
2 Ohio       2025     6              13.4       6583               95
3 Wisconsin  2025     6               5.09      1722               66
# ℹ 5 more variables: total_patients <dbl>, wk <dbl>, ili_per_provider <dbl>,
#   population <dbl>, end_of_week <date>

Blink and you’ll miss it!

Did you notice by = above instead of .by =? Some tidyverse functions like slice_max() use by without the leading dot. It’s a small (annoying!) inconsistency in dplyr, but both do the same thing conceptually. When in doubt, check the function’s documentation with ?.

Pivoting & a `ggplot` Payoff

Personally, I hate making pivot tables in Excel. Pivoting in R can still be annoying, but I think tidyverse actually has a pretty nice system for it! pivot_longer() reshapes “wide” data into “long” data. This is one of the most common data-reshaping tasks in R, and it’s essential for making certain kinds of plots in ggplot2.

Why pivot?

Many people think about data in Excel format: weeks as rows, one column per state. Let’s build a table like that:

ili_wide <- ili |>
  select(wk, region, ili_per_provider) |>
  pivot_wider(
    names_from  = region,
    values_from = ili_per_provider
  )

ili_wide |> slice_head(n = 5)

# A tibble: 5 × 4
     wk Michigan  Ohio Wisconsin
  <dbl>    <dbl> <dbl>     <dbl>
1     1     4.85  8.35      5.39
2     2     5.62 10.1       6.09
3     3     6.12 11.6       6.41
4     4     6.80 13.1       7.15
5     5     7.24 12.5       7.02

This is fine for a spreadsheet, but ggplot2 wants “long” format—one row per observation, with a column that identifies the group. pivot_longer() does this conversion:

ili_long <- ili_wide |>
  pivot_longer(
    cols      = !wk,
    names_to  = "region",
    values_to = "ili_per_provider"
  )

ili_long |> slice_head(n = 10)

# A tibble: 10 × 3
      wk region    ili_per_provider
   <dbl> <chr>                <dbl>
 1     1 Michigan              4.85
 2     1 Ohio                  8.35
 3     1 Wisconsin             5.39
 4     2 Michigan              5.62
 5     2 Ohio                 10.1 
 6     2 Wisconsin             6.09
 7     3 Michigan              6.12
 8     3 Ohio                 11.6 
 9     3 Wisconsin             6.41
10     4 Michigan              6.80

The payoff: a monthly bar chart with `fill=`

Now let’s bring together several things we’ve learned today—floor_date(), .by=, fct_relevel(), and pivoting—to build a polished stacked bar chart.

First, summarize ILI by month and state:

ili_monthly <- ili |>
  mutate(month = floor_date(end_of_week, "month")) |>
  summarize(
    total_ili = sum(ili_total),
    .by = c(region, month)
  )

Now build the plot. Use fct_relevel() to control the stacking order and scale_fill_manual() to set the colors:

ili_monthly |>
  mutate(
    region = region |> fct_relevel("Michigan", "Ohio", "Wisconsin")
  ) |>
  ggplot(aes(x = month, y = total_ili, fill = region)) +
  geom_col() +
  scale_fill_manual(values = c(
    "Michigan"  = "blue",
    "Ohio"      = "red",
    "Wisconsin" = "green"
  )) +
  labs(
    x = "Month",
    y = "Total ILI cases",
    fill = "State",
    title = "Monthly ILI cases by state, 2024-2025 season"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    panel.grid.minor = element_blank()
  )

Try changing geom_col() to geom_col(position = "dodge") to get a side-by-side view instead of stacked. Which do you find more useful for comparing states?

One more pivot: comparing metrics side by side

What if we want to compare two different ILI measures on the same plot? We can pivot two columns into long format, clean up the labels with replace_values() and fct_relevel(), and use facet_wrap():

ili |>
  select(region, end_of_week, ili_per_provider, pct_unweighted_ili) |>
  pivot_longer(
    cols      = c(ili_per_provider, pct_unweighted_ili),
    names_to  = "metric",
    values_to = "value"
  ) |>
  mutate(
    metric = metric |>
      replace_values(
        "ili_per_provider" ~ "ILI per provider",
        "pct_unweighted_ili" ~ "% unweighted ILI"
      ) |>
      fct_relevel("ILI per provider", "% unweighted ILI")
  ) |>
  ggplot(aes(x = end_of_week, y = value, color = region)) +
  geom_line() +
  facet_wrap(~ metric, scales = "free_y", ncol = 1) +
  scale_color_manual(values = c(
    "Michigan"  = "blue",
    "Ohio"      = "red",
    "Wisconsin" = "green"
  )) +
  labs(
    x = "Week ending",
    y = NULL,
    color = "State",
    title = "Two measures of ILI burden, 2024-2025 season"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    panel.grid.minor = element_blank()
  )

This final plot brings together string manipulation (replace_values), factors (fct_relevel), dates (the end_of_week axis), pivoting, and ggplot2 layering. Not bad for two hours!

Thank you for following along!

Don’t forget that MI-Support offers R office hours if you’d ever like to continue the conversation about R or discuss a project (free of charge!) to integrate it into your workflow.

Welcome back!

Setup & Recap

Rebuild your workspace from the intro workshop

String Manipulation with stringr & glue

Basic string functions

Using stringr in a pipeline

Introducing glue

Factors with forcats

The problem: unordered characters

Ordering factors

Chaining fct_*() functions

Dates and Times with lubridate

The problem: our data has no dates!

Useful lubridate “accessors”

Periods, durations, and intervals

Formatting and parsing dates

Writing Your Own Functions

A practical example: rate per 100k

Default argument values

Lambda (anonymous) functions

Advanced mutate() with across(), tidyselect, and purrr

The problem: repetitive mutations

across() basics

tidyselect: powerful column selection

Custom functions inside across()

if_any() and if_all() in filter()

purrr for complex, row-wise operations

Grouping & Aggregating with .by=

Side by side: group_by() vs .by=

.by= in mutate(): grouped mutates

Multiple grouping variables

.by= in other verbs

Pivoting & a ggplot Payoff

Why pivot?

The payoff: a monthly bar chart with fill=

One more pivot: comparing metrics side by side

Thank you for following along!

String Manipulation with `stringr` & `glue`

Using `stringr` in a pipeline

Introducing `glue`

Factors with `forcats`

Chaining `fct_*()` functions

Dates and Times with `lubridate`

Advanced `mutate()` with `across()`, tidyselect, and `purrr`

`across()` basics

`tidyselect`: powerful column selection

Custom functions inside `across()`

`if_any()` and `if_all()` in `filter()`

`purrr` for complex, row-wise operations

Grouping & Aggregating with `.by=`

Side by side: `group_by()` vs `.by=`

`.by=` in `mutate()`: grouped mutates

`.by=` in other verbs

Pivoting & a `ggplot` Payoff

The payoff: a monthly bar chart with `fill=`