Data files and data frames
Last chapter, we learned about vectors: sequences of numbers or strings. But if you’ve ever worked with data, you know that data usually doesn’t get emailed to you as a vector, it comes as a table or spreadsheet. Tables are also the most common way to work with data in R, and in this chapter, we’ll learn more about that.
The first thing we’ll learn is that, in R, tables are called data frames. There are many ways to create data frames, and one basic way is to stick some vectors in the data.frame()
function. For example, data.frame(x, y, z)
would create a data frame that includes the data from vectors x
, y
, and z
as columns.
______
placeholder with a data frame with columns quarter
, revenue
, and expenses
, and assign it to the variable finances
. Then, press ▶ Run Code.
It can take up to a minute for R to start up here, so please be patient.
But let’s have a look, just to be sure!
(Remember that you can print out the contents of a variable by just writing its name.)
finances
.
Compared to a spreadsheet, data frames are less flexible. They typically only have column names, no row names, and you certainly can’t color the cells. But the lack of flexibility also means that data frames are more predictable and easier to program with than spreadsheets.
One good way to think of data frames is as a collection of vectors, each vector being a column. You can “get out” or access the vectors in a data frame using the $
operator. For example, finances$quarter
would be the vector c("Q1", "Q2", "Q3", "Q4")
.
profit
to finances
. Replace the ______
placeholder and use the sum()
function to sum up the profit
for all quarters.
But now we’re at a crossroads. R is an old programming language (most likely older than you!), and over the years, people have come up with different ways to compute on and manipulate data in R. In this tutorial, we have to decide which way to go. And I’ve already made up my mind here.
On account of it being intuitive, powerful, and popular, we’re going to use the tidyverse
packages to work with data in R. A package is, well, a package of new functions and functionality that can be added to R. The tidyverse
is a collection of packages for data manipulation, visualization, etc., that work well together.
To use a package, you pull it out of your library of installed packages using the library()
function.
Start using the tidyverse by running library(tidyverse)
.
One of the many useful functions in the tidyverse
is read_csv()
. It can be used to read data frames from comma separated value (CSV) files, a simple text format for storing tabular data. Here’s the first four lines from the CSV file hyderabad-sales-2023-june.csv
:
date,day_of_week,temp_max,sold_ice_creams,sold_coffee
2023-06-01,Thursday,38.9,13,17
2023-06-02,Friday,40.6,22,21
2023-06-03,Saturday,40.8,37,19
The first row shows the column headers, and each following row holds the comma separated values. To read in a comma separated file, say data.csv
, give it as a string argument like this: read_csv("data.csv")
.
hyderabad-sales-2023-june.csv
by reading it in with read_csv()
.
The printout above informs us that hyderabad-sales-2023-june.csv
was read in as A tibble: 30 × 5
. The 30
makes sense, there are 30 days in June, and there are 5
columns in the data frame, but what’s a tibble
? That’s just the tidyverse
’s version of R’s regular data frames, but for the most part, they can be used in the same way.
But we can’t use the data that we read in, at all! We read it in with read_csv
, we got a printout, but, as we didn’t assign it to a variable, we can’t do anything with it. Remember that you assign values to variable names using the arrow operator <-
. For example, a_random_number <- runif(1)
would put a random number between 0.0 and 1.0 “into” a_random_number
.
read_csv()
to read in hyderabad-sales-2023-june.csv
, but now, assign it to the variable sales
.
Three of the most common things to do when analyzing data are:
- Summarize the data. For example, to sum up a column.
- Filter the data. For example, we might only want to look at the rows stamped with “Saturday”.
- Group by some column (say the day of the week), and summarize each group.
The tidyverse
has functions for all of these!
The function that helps you summarize is called 🥁🥁🥁 summarize
! The first argument is the data frame to operate on, and every subsequent named argument defines a new summary. A bit abstract, maybe, but look at this:
# A tibble: 1 × 1
avg_sold_ice_creams
<dbl>
1 17.8
This takes the data frame sales
and calculates the mean
value of the column sold_ice_creams
, and assigns it to a new column named avg_sold_ice_creams
. The result is a new data frame with the summary (here just the single value).
summarize()
to calculate the total number of sold ice creams in sales
.
Like many other tidyverse
functions, summarize
allows you to freely reference column names from the data frame. For example, if we just write:
Then R will complain, and rightfully so, as sold_ice_creams
is not an existing variable name. But when we write:
# A tibble: 1 × 1
n_sold_ice_creams
<dbl>
1 534
Then the summarize
function knows to look among the columns in sales
before complaining. Not all R functions are nice like this, but the tidyverse
ones often are.
The tidyverse functions are also well suited to combine using the pipe operator |>
, so called because it takes data on the left and “pipes it in” as the first argument in the function on the right. Instead of writing sum(1, 2, 3)
, one can go:
Similarly, as summarize()
takes the data frame as the first argument, this is two ways of writing the same thing:
|>
), but it should still do the same thing.
How do I type a | (horizontal bar)?
This can be a bit tricky, depending on your keyboard. If you can’t figure it out, try searching for something like:
How to type vertical bar on a French|Swedish|Italian Mac|Windows keyboard?
There’s actually not really any point using |>
for simple statements, like the above, but it makes it much easier to compose complex data transformations. We’ll get to that soon!
For now, let’s learn how to filter out the rows we want using the 🥁🥁🥁 filter
function, which takes a data frame as the first argument, and then one or more logical expressions, and returns only those rows that match all expressions. For example, days that were warmer than 40° C:
# A tibble: 3 × 5
date day_of_week temp_max sold_ice_creams sold_coffee
<date> <chr> <dbl> <dbl> <dbl>
1 2023-06-02 Friday 40.6 22 21
2 2023-06-03 Saturday 40.8 37 19
3 2023-06-04 Sunday 42.4 35 13
Or the data from the 1st of June, 2023:
# A tibble: 1 × 5
date day_of_week temp_max sold_ice_creams sold_coffee
<date> <chr> <dbl> <dbl> <dbl>
1 2023-06-01 Thursday 38.9 13 17
There are many operators and functions that can be used in a logical expression, here are the most common ones:
Logical operator | |
---|---|
== |
Equal to (yes, it’s == and not = ) |
!= |
Not equal to |
> |
Greater than |
>= |
Greater than or equal to |
< |
Less than |
<= |
Less than or equal to |
______
with a filter()
expression that keeps only the Saturday sales data.
Tip: When using ==
to compare strings, uppercase and lowercase letters are not equal. "O_O" == "O_o"
is FALSE
.
Now comes the point of the pipe operator |>
! Using it we can combine, or chain together, several statements. For example, if we wanted to read in a CSV file and calculate the median doughnuts sold on Mondays, we could squish it all into one single line:
But that’s pretty unreadable! An alternative is to do one step at a time, assigning each intermediate result to a variable:
That’s alright, I guess, but with the pipe operator |>
we can simplify this even further!
ny_sales <- read_csv("new-york-sales-2025-april.csv")
ny_sales |>
filter(day_of_week == "Monday") |>
summarize(median_sold_doughnuts = median(sold_doughnuts))
# A tibble: 1 × 1
median_sold_doughnuts
<dbl>
1 142.
When using the |>
it’s common to have one function per line, with a two-space indent on all but the first line.
|>
that calculates the total number of ice creams we’ve sold on Saturdays in the sales
data.
That gave us the total number of sold ice creams for Saturdays, but what about all the other days? Enter group_by()
. This is a function, which doesn’t do anything by itself, but which modifies the behaviors of following functions. For example, here’s code giving us the median number of sold doughnuts:
# A tibble: 1 × 1
median_sold_doughnuts
<dbl>
1 140
But say we wanted to group the data by each day, and calculate the median of each group? It’s easy with group_by()
:
# A tibble: 7 × 2
day_of_week median_sold_doughnuts
<chr> <dbl>
1 Friday 145
2 Monday 142.
3 Saturday 164
4 Sunday 168.
5 Thursday 134
6 Tuesday 135
7 Wednesday 130
It can be a bit hard to understand which functions group_by()
can be used together with. But just the combo group_by(...) |> summarize(...)
will take you far!
filter
with a group_by
that gives you the total sold ice creams for each day_of_the_week
, not only Saturday.
We could stop here, we already have a quite nice summary of our ice cream sales. But let’s tack on one more thing to our data analysis pipeline.
It’s often nice to sort a table and that can be achieved with the arrange()
function which, just like summarize
and filter
, takes a data frame as the first argument and then one or more column names to sort by. For example, here’s the Hyderabadi sales data frame with the warmest days first:
# A tibble: 30 × 5
date day_of_week temp_max sold_ice_creams sold_coffee
<date> <chr> <dbl> <dbl> <dbl>
1 2023-06-04 Sunday 42.4 35 13
2 2023-06-03 Saturday 40.8 37 19
3 2023-06-02 Friday 40.6 22 21
4 2023-06-08 Thursday 40 18 18
5 2023-06-09 Friday 39.7 15 27
6 2023-06-10 Saturday 39.7 37 22
7 2023-06-07 Wednesday 39.6 19 25
8 2023-06-14 Wednesday 39.4 14 13
9 2023-06-17 Saturday 39.4 33 23
10 2023-06-11 Sunday 39.2 30 17
# ℹ 20 more rows
Here we wrapped temp_max
with desc(temp_max)
to sort in desc
ending order, otherwise the default is to sort in ascending order.
|>
to add an arrange()
at the end, sort the result from most ice creams sold to least sold.
Great work! Now you know how to read in data into a data frame and to summarize
, filter
, and arrange
it to learn more about the data. Another important way to learn more about a dataset is to visualize it, which is what the next chapter is all about: : 👉4. Visualization👈