Vectors and variables
Last chapter, we saw an example of a vector: a sequence of data of the same type, for example, a sequence of numbers or a sequence of strings. When analyzing data, you almost never deal with single numbers, and the reason you need to analyze the data in the first place is likely because there are heaps of it! That’s why handling sequences of strings and numbers is central to data analysis and why vectors are core to R.
Let me show you how central vectors are to R.
But it’s not just a number. It’s actually a one-item vector with a single number. In R, even single numbers are vectors.
Sometimes, you would want to create new vectors longer than a single number. This can be done using the c()
function that c
ombines many values. For example, c(2, 3, 5, 7, 9)
will create a numeric vector with all prime numbers between 1 and 10.
c()
with at least 5 items. (By the way, I’ll check if any of them are prime numbers.)
There are many functions that help you create vectors in R, one shortcut is the colon operator where, say, 10:30
would create the vector 10, 11, 12, ..., 28, 29, 30
.
:
operator to create the vector 1, 2, 3, ..., 98, 99, 100
. (I’ll, again, figure out which are prime.)
Many functions in R are vectorized, that is, they work both on single values, as well as vectors. For example, nchar("pizza")
returns 5
, the n
umber of char
acters in "pizza"
. But nchar()
also works on vectors of strings.
nchar()
to count the number of characters in each word.
And, especially, all math operators (+
, *
, etc.) are vectorized. That is, 1:3 * 5
would give you 5, 10, 15
.
c(15, 25, 35, 45, 55, 65, 75, 85, 95, 105)
by only changing the numbers in * 1 + 0
(leave 1:10
alone!)
When doing math with two vectors of the same length, the operation will be applied to each corresponding pair of values. It’s easier than it sounds. For example, c(10, 20, 30) + c(1, 2, 3)
gives 11, 22, 33
and 11:14 - 1:4
gives 10, 10, 10, 10
.
Moving around the numbers directly, like you did above, can work, but it gets messy. It can be made more organized by assigning the values to variables. This needs explanation, but first, let’s look at an example:
<- 3.141593 pi
Here we’re taking the value (3.141593
) and by using <-
, the assignment operator, we’re assigning it to (“putting it into”) a variable named pi
. Now, instead of writing 2 * 3.141593 * 5
, we can write 2 * pi * 5
. The assignment operator is made up of a <
and a -
, and is meant to look like a left-pointing arrow.
Variables can be given both short and long names, but they can’t include spaces. Instead, it’s common to use underscores (_
) to separate words in longer names.
______
placeholder and assigning the result to the variable quarterly_revenue
.
Variables need to be assigned before they can be used. This won’t work:
<- x + 1 # won't work as x doesn't exist at this point!
y <- 1 x
However, variable names can be reused and “overwritten”. For example, this is okay:
<- 10
x <- x + 1
x <- x + 1
x <- x + 1 x
x
? Write it in the box below and press ▶ Run Code
Here’s some more sales data for you!
c(13, 22, 37, 35, 9, 16, 19, 18, 15, 37,
30, 12, 14, 14, 16, 11, 33, 31, 19, 17,
15, 7, 15, 23, 12, 5, 7, 9, 9, 14)
This is the number of sold ice cream cones at my cafe in Hyderabad, India for each day in June 2023. (As opposed to the Hyderabadi temperature data we looked at last chapter, this data is unfortunately made up.)
sold_ice_creams
.
Another thing one can do with a vector is to subset it using the square brackets operator ([]
). For example, here’s how you would pick out the 1st value in sold_ice_creams
:
1] sold_ice_creams[
[1] 13
sold_ice_creams
.
You can also subset a range of values using the colon operator. For example, this would pick out the first three days of sales:
1:3] sold_ice_creams[
[1] 13 22 37
sold_ice_creams
.
A subset of a vector can be used as any other vector. For example, this here would calculate the median sales for the first week in June:
median(sold_ice_creams[1:7])
[1] 19
sum()
function to calculated the total sales for the first week in sold_ice_creams
.
As a last thing, let’s bring in the daily max temperature data from last chapter. Again, I’ve put that into the temp
variable.
Now, the plot()
function can make simple scatter plots that show two numeric vectors against each other. For example, here’s how one would plot age
against height
:
plot(x = age, y = height)
Let’s look at the relationship between the temperature and ice cream sales.
temp
on the x-axis and sold_ice_creams
on the y-axis.
You’ve completed the chapter, great work!
So the plot above is correct because the values in temp
and sold_ice_creams
vectors line up. But, rather than juggling several related vectors, wouldn’t it be better to stick them all into something like a spreadsheet or table?
Yes it would! And that’s what this next chapter is all about: 👉3. Data files and data frames👈