Intro to R worksheet

Author

MI-Support

Published

March 13, 2026

Welcome!

Use this worksheet to follow along with the live coding demonstration led by Dr. Marisa Eisenberg.

For steps involving code, we will include some helpful functions and formatting tips, along with links to documentation where we can.

What are we doing here today?

Our goal is to take you from “I’ve never touched R before, :(” to “I can make a nice-looking graph in R! :)”

To do that, we’re going to work with real data on respiratory illness in the United States to make a graph that closely matches the formatting of a figure in Oakland County Health Division’s Respiratory Health Dashboard.

Screenshot of Oakland County Health Divsion's Respiratory Disease Dashboard, focusing on a time series line plot of four series

Oakland County’s dashboard

Image of the graph to be constructed in this workshop, focusing on a time series line plot of four series

The graph we’ll work on today

Along the way, we hope to cover everything you need for R to be a useful tool in your toolkit tomorrow! (Well… Monday.)

Getting started

Load the files you’ll need

Download resources from MICOM Hub’s workshop page. Extract any zipped files and relocate all of the files you downloaded or unzipped to a new folder on your computer.

Open a file to write code in

Open RStudio, and open a new R script file (‘File’ > ‘New File’ > ‘R Script’, or Ctrl+Shift+N)

Add a “comment block” to the top of your script file that tells future you what’s going on here. Comments are lines or phrases in your code preceded by this # symbol. Putting that symbol before text tells R to ignore whatever comes after it, until the next line. Go ahead and just copy-paste the following onto the first line of your new file.

# intro_R_workshop_20260313.R
#
# This script file contains the code I wrote during MI-Support's Intro to R
# workshop at Oakland County Health Division on Friday, March 13, 2026.
# For help on this stuff later, I can use MI-Support's R Office Hours link:
# https://calendly.com/jackjacobs-misupport/30min
#
# Author: {your name here} ({your@email.here})

When you save this script (Ctrl+S), we recommend you use the filename on that first line.

Install and “load in” the functions we’ll use in this exercise

In your console pane, type the following command, then press Enter.

install.packages(c("tidyverse","this.path","janitor"))

Write the following at or near the top of your script. With your cursor blinking on the end of this line, press Ctrl+Enter (Cmd+Enter for Mac users) to load the tidyverse libraries (a bunch of useful functions) into your R session.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Tip

See RStudio’s online documentation for instructions on how to easily and quickly execute only the code you want from the editor pane with the Ctrl+Enter keyboard shortcut.

Woah!! What’s all this output mean? It’s not essential for you to understand now, but tidyverse is not, itself, a single library. It’s actually a collection of these nine libraries. (Each “library” is just a group of related functions.)

Loading data into R

Tell your computer “where you are”

Save this script file in the same folder on your computer where you put the data from our website. (Before you work with data in R, you need to tell R where to grab it from!)

Once you’ve saved this script file, you’re ready to tell R where to look for other files you want to use. Run the following code to see how you can find this information later.

library(this.path)
this.dir()
[1] "C:/Users/jckjcbs/Documents/mi_support/workshop_Intro_R"

You should see similar output in your console.

What does this do? It automatically finds that long, annoying string of C:/Users/this/that/the_other folder names so you can focus on just the files you care about. To see how this.path can help you, check this out:

here("fake file name")
[1] "C:/Users/jckjcbs/Documents/mi_support/workshop_Intro_R/fake file name"

Any time you invoke some file name (in this example, fake file name), it automatically adds on the correct “filepath” for you!

Load your dataframe

We’re going to work with data from the CDC’s FluView interactive dashboard. Our data tracks influenza-like illness (ILI) in seven states in the 2024-2025 respiratory season.

Look at the filename of the dataset we’re interested in right now: ILINet.csv. If you open the properties of this file (highlight the file and press Alt+Enter), you’ll see it’s a “Comma Separated Values File”, which is why its filename ends with .csv. This mean’s we are going to use the read_csv() function (from the readr library) to load it in.

Here’s how we load a CSV into R and save it to a “variable” (we’ll get to that) called ili

ili <- read_csv(here("ILINet.csv"))
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 365 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): PERCENTAGE OF VISITS FOR INFLUENZA-LIKE-ILLNESS REPORTED BY SENTINE...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Hmm… looks like we have some warnings. Let’s see what we got with RStudio’s special View() function. (You know it’s special because of the capital V in the front, which is unusual for an R function.)

View(ili)

When stuff goes wrong in R, you usually end up learning more about R

Oh no! This looks terrible! Using what you can see in RStudio and what you can see when you open the file in Excel, can you figure out what happened here?

Luckily, there’s a quick resolution. Let’s open “the docs” for this read_csv() function. Look on this page for how we might skip a line before asking R to look at our data. What can you find?

This problem is introducing a couple of new concepts, function documentation and function arguments, that will come up again and again.

  • Documentation exists for just about every single function in the tidyverse, and you can access it directly in your RStudio console (no internet connection needed!) with the ? shortcut.
    • Try running the following in your RStudio console: ?read_csv
  • Arguments are anything that you put in the parentheses of a function. You may know about functions from Excel! Excel “code” like =MIN(A1:A10) or =IF(C2="yes",D2,0) are functions, and the values separated by commas are the different “inputs” that you are “passing” to different “arguments”.
Excel is almost a coding language!

Similar to the example above, Excel may have already made you familiar with the concept of “variables” and how it’s a, let’s say, flexible concept. Look at A1:A10 in the example above. A1 might be a cell that contains the value 8. This is equivalent to the following R code assigning A1 as a variable with the value 8.

A1 <- 8

The same is true for the collection of values (in R, a “vector”) represented by A1:A10. This symbol is both 10 distinct, ordered values and something you can pass to Excel’s =MIN() function as a single thing—that’s a variable!

Fixing our data import

From here on out, this worksheet is going to give you less and less of the code you need to make this work. But I’ll give you some hints:

  • Check the documentation for read_csv(), and look for information about the skip= argument.
  • For the first argument you pass to this function, where you tell it what file to look for, remember to use the here() function.
  • Remember to “assign” the result to a variable called ili using the “assignment operator” <-.

To recap, the whole thing should look something like this:

ili <- read_csv(here(...), ...)

To see if your solution worked, run View(ili) in the console.

A bit of data cleaning

Okay! Hopefully you were able to see something like this in RStudio.

A view of the ili dataset with 15 columns and 364 entries, i.e. rows. This is meant to represent the outcome of the correct data loading procedure.

With a table of data loaded into our R session, let’s clean it up a bit.

Introducing the pipe operator: |> or %>%

Something you’ll see in almost any R code you’ll ever look at is the “pipe” operator. It has two variants for historical reasons that don’t really matter anymore. %>% is older and |> is newer. You can use whichever one you find easier to type. (I actually don’t type either! Instead I use the Ctrl+Shift+M shortcut to produce |> automatically.)

And when we say “you can use…”, we really mean “YOU WILL USE…” because this thing is so useful!

The pipe operator takes whatever is on the left side of it and “pipes it into” the first argument of the function on the right side of it. This maybe seems confusing or a little pointless, but it makes code soo much easier to read! So much so that the entire tidyverse was built up around this simple capability!

Here, paste this code into your script, run it, then View() your ili table again. We’ll explain what happened afterward.

# Let's clean this table up a bit.
# Our first pipeline: remove some columns and clean the names
ili <- ili |> 
  
  # Grab just the columns we care about
  select(
    REGION, YEAR, WEEK, `%UNWEIGHTED ILI`, ILITOTAL, `NUM. OF PROVIDERS`,
    `TOTAL PATIENTS`
  ) |> 
  
  # janitor is a nice little package
  # https://sfirke.github.io/janitor/articles/janitor.html
  janitor::clean_names()

A chunk of code written in this style is called a “pipeline”. (This isn’t exactly the same as the more general term “data pipeline”, but it’s more-or-less the same concept.) In this pipeline…

  1. We start with our ili table. The ili <- ili |> part means that whatever we’re about to do to ili is going to overwrite what we had in ili before.
  2. The select function allows us to grab only the columns we might want to use from our table, then “returns” (or “outputs”) the table with only the columns we indicated.
  3. The clean_names() function cleans up our column names so they’re easier to type.
    • From the often very helpful janitor package - check it out later!

All of this was done in a few lines of code with plenty of comments that explain what happened, plenty of whitespace to make everything easy to read, and enough new lines so that we don’t need to scroll side to side to see what’s happening in the code.

A quick preview of where we’re headed

We have pretty much everything we need to make a graph. Here, try this!

# Quick preview
ili |> 
  filter(region == "Michigan") |> 
  ggplot(aes(x = seq_along(ilitotal), y = ilitotal)) +
  geom_line()

Looks like an epidemic curve, right? We’ll get into it more later. Now please bear with us as we introduce more of the basics of R.

What is R?

We’ll walk you through some of the building blocks of the R language. These concepts will be extremely useful if you ever talk about R code with others.

Remember to use the ? shortcut in your console to see documentation for any of these functions. We encourage you to explore using these functions both with and without the pipe operator!

What is my dataset?

  • str()
  • The $ operator
  • nrow()
  • dim()
  • length()
  • unique()
  • n_distinct()
  • table()
  • colnames()

Backing up even further… what is R?

  • Printing to the console with cat()
  • Basic arithmetic
  • Variables and the assignment operator (<-)
  • Vectors (c())
    • Indexing with []
    • Using the : operator
    • is.na()
  • Numeric vectors
    • round()
    • mean(), sum()
  • Logical vectors (TRUE/FALSE)
    • Logical operators: ==, >, <=, etc.
    • Logical conjunctions (huh?): & and |
    • !
    • any() and all()
  • Character vectors
    • str_split_1()
    • paste0() and paste()
    • as.numeric() and as.character()
  • Date vectors
    • mdy() and lubridate, the tidyverse library that deals with dates and times (along with hms)
    • Intervals and the %--% operator
    • today()
    • as.period()
  • “Types” in general
    • class()

More basic data cleaning

Before we continue cleaning our ili dataset, we’re going to load in some data that might be useful later: the populations of the states in our data according to figures from the US Census Bureau’s 2020 Decennial Census. You can find this in the population.csv file that was included with the workshop materials. (To save time, we’ve cleaned this dataset a bit already.)

Load this data into a variable called pop using read_csv() and here()

Making new variables with mutate()

Recall, we want to make a line plot of ILI over time with our data. That means we will need some ILI quantity variable on the y-axis and some time variable on the x-axis. What variable(s) track(s) time in our data?

We’re going to want to make a new variable using the extremely useful case_when() function inside one of the main tidyverse “verbs,” mutate().

What should we use as our ILI quantity variable? There’s not really a single right answer here. It just depends on what we want to look at. Can you identify pros or cons of different options? Can you imagine a new variable that’s based on the ones we have?

Selecting a subset of rows with filter()

In the graph we’re trying to make, we only consider data from Michigan, Ohio, and Wisconsin, but we have 7 states! The filter() function selects which rows to keep from your dataset based on some TRUE/FALSE condition that you enter as an argument.

Use filter() in your pipeline along with the %in% operator and c() to keep only rows where the value of the region variable is "Michigan", "Ohio", or "Wisconsin".

Tip

The code we’re walking you through includes the %in% operator. From the example, I’m sure you can understand its usefulness! It’s not obvious, though, how to look up documentation for binary operators (e.g. +, /, ^, <-, ==, $, :) if you want to fully understand how they work. Here’s how to do it: for the %in% operator, run the following command in your console.

?`%in%`

The “backticks” (usually below your Esc button) let the console know that you want to treat the operator as its own thing, so to speak.

Real quick, let’s rename() some columns.

Renaming columns is easy. Check out the documentation for rename() and use it in your pipeline to make the following changes.

  • Change ilitotal to ili_total
  • Change percent_unweighted_ili to pct_unweighted_ili

Some more powerful data manipulation

Incorporating these states’ populations

Remember how we loaded in that state population dataset earlier? Now we want to use it to improve our analysis, but we need to be thoughtful about how it gets added to our data. We have 7 states’ population in that dataset, but we only have 3 states left in our ILI data, and unlike in Excel we can’t just sort by state then copy-paste the right numbers. (We want to be more careful than that anyway!)

We need to join our datasets. That is, we need to bring columns from two datasets together based on a particular relationship between some shared column or columns. In this case, we want to add the relevant state population to each row in our ili dataset. Look into the inner_join() function to do this.

We encourage you to try to figure this out yourself! But you can unfold the code block below if you’d like to see how you can do this.

Code
# We want to incorporate the scale of each state's population!
# We'll do something called a "join"
ili <- inner_join(
  ili, pop,
  by = c("region" = "state")
)
Note

You’ll see in dplyr’s join docs that you can use join_by() in the by = argument. This might seem intimidating, but we think it actually looks better!

Code
ili <- inner_join(ili, pop, join_by(region == state))

join_by() is an interesting function in that it’s built only to be used within these dplyr::*_join() functions. Sometimes this happens: very useful functions have arguments complex enough to have their own special function or class of functions. Another example might be a logistic “link” function in a linear regression function, though this is outside the scope of this workshop.

Aggregating ili data with group_by() and summarize()

Recall the graph we want to make: there is a time series in it that represents the average of all 3 states. You may have already noticed that there isn’t any multi-state average in the raw data. How do we calculate this? We calculate it using group_by() and summarize(), two operations that very often go together.

We think it’s easier to visualize what happens here than to explain it, so, using the new time variable we created earlier (which we called wk), try running these code chunks and view the resulting dataset.

ili |> 
  summarize(sum_ili_total = sum(ili_total))
# A tibble: 1 × 1
  sum_ili_total
          <dbl>
1        297630
ili |> 
  group_by(wk) |> 
  summarize(sum_ili_total = sum(ili_total)) |> 
  slice_head(n = 5)
# A tibble: 5 × 2
     wk sum_ili_total
  <dbl>         <dbl>
1     1          2593
2     2          3063
3     3          3366
4     4          3765
5     5          3815

In a new pipeline that creates a new dataset called ili_summary, use these functions to create an average count of ILI cases per provider for each week in our data. Call this variable ili_per_provider. We think you’ll at least use group_by(), summarize(), and mean() functions for this.

Next, we “concatenate” the rows from ili and ili_summary

We generated ili_summary by aggregating data from ili, but we want to visualize them together, so we’ll want these two datasets to become one dataset. Should we join them?

You may have already guessed from the title: no! Our new dataset has new data in it. We’re not joining data across a consistent variable like region.

Check out the documentation for bind_rows(). As you can see, for this to work the way we want it to, we need to columns to match, i.e. have the same names and types. Rename and/or mutate the relevant columns in ili_summary to match those in ili. (Confused about which columns those are? Think about the variables we’ll be plotting in the graph we’re creating—not just X and Y, but color, too!)

Once that’s done, concatenate these two datasets into a single one called ili_summary. To make things easier, only select the columns region, wk, ili_total, and ili_per_provider from the ili dataset. Lastly, sort the result by wk and region using arrange(). (Remember, you can do all the steps in this pragraph in a single pipeline! That’s how we’ll be doing it.)

# "Concatenate" our original graph and our summary
ili_summary <- ili |> 
  # Limit our original table to the variables in our summary
  select(region, wk, ili_total, ili_per_provider) |> 
  
  # This stacks our tables on top of one another
  bind_rows(ili_summary) |> 
  
  # Sort our table using multiple variables
  arrange(wk, region)

Making graphs (finally!)

Before we dive into it, let’s take stock of the data we have in ili_summary.

str(ili_summary)
tibble [208 × 4] (S3: tbl_df/tbl/data.frame)
 $ region          : chr [1:208] "3 states" "Michigan" "Ohio" "Wisconsin" ...
 $ wk              : num [1:208] 1 1 1 1 2 2 2 2 3 3 ...
 $ ili_total       : num [1:208] 2593 1460 777 356 3063 ...
 $ ili_per_provider: num [1:208] 6.2 4.85 8.35 5.39 7.27 ...

We need to circle back to a question from eariler: which of our two ILI quantity variables should we use, ili_total or ili_per_provider? Consult our trusty ili dataset while thinking about it.

Making something quick for exploratory data analysis

For our first graph (or second, actually), we’ll walk through the code step-by-step.

# Let's look at the course of ILI in these 7 states in the 2024-2025 season
ili_summary |> 
  filter(region == "3 states") |> 
  ggplot(aes(x = wk, y = ili_total)) +
  geom_line()

You should recognize the first two lines of code, which use the |> operator and filter() on a character column.

The next function is a big one: ggplot(). ggplot2 is the tidyverse library for plotting. A weird thing I want to point out right away is that—again, for historical reasons—ggplot2 uses the + operator as its “pipe” instead of |> or %>%. (Even I’m not 100% sure why, but I know that ggplot2 is older than both of the pipe operators and tidyverse as a whole!)

The first argument in ggplot() is your data, but you don’t see it in the function above because we piped it in! The second argument is special, it always takes the aes() function. This function specifies which variables match which components of your visual. In this case, we specify x = wk and y = ili_total. Try running just the first 3 lines of code (don’t include the +) using the Ctrl+Enter trick. What do you see?

On the next line after the +, we add a geom_line() layer to tell ggplot2 how to map our x and y layers onto the plot: as a line connecting the points. That’s it for the basics! Now we’ll walk through how to make something like this pretty, stopping at each step along the way to see how each piece contributes to the whole thing!

Making our target figure

We’ll walk through this last part together, but we’ll be using the following functions, which we’ve roughly grouped into the relevant concepts.

  • Preparing character vectors for plotting
    • replace_values(): A new-ish function that is similar to case_when()
    • fct_relevel(): This function introduces a new data type—factors, implemented in tidyverse with the forcats library—that are a bit of a hybrid. Factors are a hybrid data type that orders (i.e. ranks) character values. This specific function allows us to define that order manually.
      • We use factors in plotting to specify the order in which different data series are drawn. These will matter more in bar plots, for example.
  • Different aesthetic mappings in ggplot2
  • Layering different geom_*() functions
  • Specifying ggplot2 scales by hand with scale_*_manual()
    • This also introduces the concept of named vectors
  • Specifying plot labels with labs()
  • Themes! ggplot2 has several default themes to make your graph look close to how you want it in a number of different ways.
    • Modify small pieces of your graph, after you’ve already chosen a default theme, using theme() and specifying changes to particular components.
  • Saving your plot as an image with ggsave()

At the end, you hopefully have something that looks like this!

Thank you for following along!

Don’t forget that MI-Support offers R office hours if you’d ever like to continue the conversation about R or discuss a project (free of charge!) to integrate it into your workflow.





MI-Support logo