This reference post covers the basic mechanics of how the R language works, and how to do simple things with it before we get into more detailed use cases and examples. Other posts cover why you should choose R, use cases, and more advanced techniques & examples. Here, we summarize major concepts in R in a few rules, and cover the classes of objects (like vectors and dataframes), functions & packages, and types of data (like character and numeric) and objects (like vectors and dataframes).
I’ll provide some code examples below, so a few notes on formatting just for this blog: R
code will be in “chunks” below. If I am referring to a concept generically without a specific instance, I’ll wrap it in <>
(for example: <function_name>
is a place holder for any function, like sum()
. If I execute any code (as opposed to merely showing it) with R
, I will show R
’s output beneath the chunk, in the line that begins with [R:]
.
# Inside code chunks, lines that beging with # are comments
# Comments are ignored by R, but vital for explaining code to humans!
2 + 2 # they can come at the end of a line as well
[R:] 4
First things first - we can see that R is just like a fancy calculator app that you can type commands in and execute:
# 2 to the third power
2 ^ 3
[R:] 8
# take the base-10 logarithm base of 1,000
log10(1000)
[R:] 3
Getting a bit fancier, R
can also do boolean and logic operations. These will come in handy when we talk about logical
type of data below. We evaluate whether an expression is either TRUE
or FALSE
, like the following:
# is 2 less than 1?
2 < 1
[R:] FALSE
# is 1 greater than or equal to 2 AND (&) 1 greater than 0?
1 >= 2 & 1 > 0
[R:] FALSE
# is 1 greater than or equal to 2 OR (|) 1 greater than 0?
1 >= 2 | 1 > 0
[R:] TRUE
Beyond calculations, I think it’s fair to encapsulate the essence of how we will be using R for data analysis in a few simple rules:
Everything is an object (which has names and values)
Run functions on objects
Objects (and data) come in different types (or classes)
Everything is an Object
R is an object-oriented programming language, which essentially means that you will almost always be:
Creating objects that have certain values or attributes, and
Running functions on those objects.
Think of objects as nouns and functions as verbs that act (passively) on the noun: open the box, add the lists together, visualize the .
All objects have names and take on values. When you “call” an object by typing its name and executing the command, it displays the object’s value (and sometimes other attributes if it is a complex object).1
Some objects are built-in - that come with a name and a value pre-assigned. For example, the mathematical constant pi
, and all 26 letters of the English alphabet (lowercase):
pi
[R:] 3.141593
letters
[R:] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
Each object class has its own set of rules (or “methods”) for valid operations of functions on them. Some functions work only for certain classes of objects. You can also typically convert objects from one class to another.
Most R code is going to have the following syntax:
<do_this>(<to_this>)
<do_that>(<to_this>, <to_that>, <with_those>) # for functions with multiple arguments
Objects are created by assigning a value to a name (for the object) with the “assignment operator” (<-
).2 Mentally, you should read x <- y
as “(object) x gets (value) y”.3
# Create an object named "my_object" that gets the value "test"
my_object <- "test"
# make another object by combining integers 1, 2, 3, and 4
my_object_2 <- c(1, 2, 3, 4)
You can also combine or concatenate things into a single object with the c()
function, as in the previous example of my_object
. Much more on this below, as this is actually an example of a vector object.
Objects will be overwritten if you assign the same object a different value.
# create an object named x with value 2
x <- 2
# call x to print its value
x
[R:] 2
# re-assign x with 7 as its value
x <- 7
# call x again, see that it changed value!
x
[R:] 7
This might be fine, but it’s more likely that you will want to keep original objects and create entirely new objects from them.
In any case, you’re going to be creating a lot of objects. What do you do with them? Run functions on them.
Run Functions on Objects
To fully explore both the syntax and power of functions will take multiple posts. For now, it’s important just to understand what functions do to objects and what they look like: functions take an object (or multiple) and perform some operation (or multiple) on it. If objects are nouns, functions are the verbs that operate on them.
Functions are easy to spot in R
since they always end in a pair of parentheses, ()
. Anything that isn’t an object in R is probably a function. Functions have arguments, into which you “pass” an object as an input. At least one argument is an object you are performing the operation on - implying that more advanced functions can have additional arguments. Arguments go inside the parentheses. The function performs at least one operation on an object (or multiple), and often return an object (or multiple).
# take the sum of my_object_2 (from above)
sum(my_object_2)
[R:] 10
# make another object
some_decimals <- c(2.84, 23.102, 0.843)
# use the round() function on it
# note it has a second argument, the number of digits to round to
round(some_decimals, digits = 1)
[R:] 2.8 23.1 0.8
Getting Help
R is full of great and accessible documentation. In general, if you want to know what a function does, the arguments it takes, and some examples, you can run the command ?<function_name()>
or help(<function_name>)
, like so:
?mean() # alternatively: help(mean)
You can write your own functions, which will be the subject of (many) additional posts. Functions in R, and the paradigm of “functional programming” become a really powerful tool in your data analysis toolkit. But that’s for another time.
Packages/Libraries, and “base R”
One of the great things about R is that there is a large and robust community of users, who write their own functions (along with other objects) and bundle these together as packages. You can download and install packages from reputable sources (like CRAN4 automatically with install.packages("<the_package_name>")
5 As an example, install.packages("tidyverse")
. You need only do this once (aside from occasional package updates).
Once you’ve downloaded a package, you can load it into your R session with the library()
command, and put the package name inside the ()
(no quotes necessary), like so: library(tidyverse)
. Then you can start using functions defined by that package (along with other objects, such as datasets bundled in the package) that you otherwise wouldn’t have access to.
As an advanced tip: you can run a single function from a package (if it is installed) without loading the package by using the <package_name>::<function_name>
. For example, I could run the drop_na()
function, which is from the {tidyr}
package (without loading the package) with tidyr::drop_na()
.
There are some valid arguments from computer programming and software development for minimizing the number of packages your analysis depends upon (“dependencies”), and for learning the ins and outs of R without packages, known as base R
. However, my focus is on practical and “batteries-included” applications of R for data analysis. So, we will make extensive use of packages, particularly the {tidyverse}
, and wherever possible, we will substitute simpler & more intuitive operations from packages instead of a “first principles” approach that would build from base R
upwards.
Objects and Data Come in Classes/Types
Data Types
Data in R
comes in several different types.6 For our purposes, we can say that R essentially contains three major types of data: character
, numeric
, and logical
. There are more, but it helps to place them into these categories for now.
In general, you can always check what class an object is by passing it into the relevant function, class()
.7
1. character
character
data, also known as “strings” in computing, are sequences of alphanumeric characters. These are normally human-readable text of words or sentences that might be names, addresses, or categorical data that places an observation into a particular category (note that factor
s are a particularly useful type of data to deal with precisely this.)
The key is that each string is contained within quotation marks (either ‘single quotes’ or “double quotes”). Strings may contain spaces, noting again that the entire string is contained within quotes.
my_name <- "Ryan Safner"
my_name
[R:] "Ryan Safner"
class(my_name)
[R:] "character"
2. numeric
(integer
and float
)
Perhaps the most obvious types of data, when doing data analysis, are numeric
data - since we are dealing with numbers. numeric
data comes in two flavors: integer
, for integers, and double
, for numbers with decimal points.8
# make integer by adding "L" after a number
some_integers <- c(2L, 5L, 12L, 402L)
# call it
some_integers
[R:] 2 5 12 402
# check its class
class(some_integers)
[R:] "integer"
some_decimals <- c(0.3, 1.5, 23.12, 0.002)
# call it
some_decimals
[R:] 0.300 1.500 23.120 0.002
# check its class
class(some_decimals)
[R:] "numeric"
One of the important features of numeric data is that you can perform mathematic operations (all functions!) on them.
sum(some_integers) # calculate the sum
[R:] 421
mean(some_integers) # calculate the mean
[R:] 105.25
3. logical
logical
data are Boolean and only take on the value TRUE
or FALSE
.9
some_logicals <- c(TRUE, FALSE, TRUE)
# call it
some_logicals
[R:] TRUE FALSE TRUE
# check its class
class(some_logicals)
[R:] "logical"
These are particularly useful for evaluating conditional statements that involve logical operators such as:
>
is greater than<
is less than>=
is greater than or equal to<=
is less than or requal to==
is equal to10!=
is not equal to&
AND|
OR-%in%
is a member of a set
# create results with two elements, each the outcome of a conditional test:
# is 2 greater than 3?, and is 4 equal to 4?
results <- c(2 > 3, 4 == 4)
# call the results
results
[R:] FALSE TRUE
# check the class
class(results)
[R:] "logical"
Other Data Types
There are more data types in R than the three major ones I’ve discussed above. Some, like complex
for complex numbers, almost never come up in data analysis. Others do come up in different contexts, such as working with Date
and POSIXct
for date and time data; and factor
when working with categorical data. In truth, these types of data are stored and behave as integer
or double
for operations in R.
Object Classes and Data Structures
R contains many different classes of objects that you can create, or come built-in. The most basic, yet most critical (as all other objects are constructed from them), are vectors. We’ll discuss just a few others here: lists (combinations of vectors of different types), and dataframes (a friendlier type of list that is the backbone of data analysis).
For practical purposes, we could start right away with dataframes, since that is what we will be working with most of the time for data analysis, and they are the equivalent to a spreadsheet (with rows and columns) that is most familiar to anyone who has ever used Microsoft Excel. However, it is important to also understand vectors and the concept of vectorization in R, which unleashes its unique power as a programming language.
Vectors
A vector is simply a (1-dimensional) collection of elements of the same data type. We have already made several vectors above: my_object_2
, some_integers
, some_decimals
, some_logicals
, and results
, are all vectors. The giveaway is invoking the c()
function, which makes a vector by concatenating or combining elements into a single object.
Vectors are “atomic” in R
, meaning all elements must be the same data type (e.g. character
or double
). When you try to create a vector that has elements of different data types, R will “coerce” the elements to all be the same type according to coercion rules11 (in this case, character
):
# create a vector of different data types
vec_different_data <- c(1, "two", 3.5, "apple", FALSE)
# note it has coerced all elements to character strings
vec_different_data
[R:] "1" "two" "3.5" "apple" "FALSE"
# check the class to confirm
class(vec_different_data)
[R:] "character"
Similar to the mathematical use of vectors in linear algebra, operations
One of the most powerful features of R, distinct from other programming languages, is that R is built to be vectorized. This means that when a function is run on a vector, the function is applied to each element of the vector.
# create a vector
vec <- c(1,2,3)
# add a constant to a vector
2 + vec
[R:] 3 4 5
# multiply a vector by a constant
2 * vec
[R:] 2 4 6
# find the logarithm (base 10) of a vector
vec10 <- c(10,100,1000) # first create a new vector
log10(vec10)
[R:] 1 2 3
class(vec)
[R:] "numeric"
typeof(vec)
[R:] "double"
All other objects are essentially comprised of one or more vectors.
Lists
Lists are powerful objects that are like vectors in that they are collections of elements, but recursive - that is each element can itself be another object (even another list)! Unlike vectors, lists are not atomic, meaning they can have elements that are not the same data type. Furthermore, each element can have a name. Make them with the list()
function (instead of c()
).
my_list <- list("a", 4, TRUE, "apple", list(2, FALSE, "orange"))
Let’s make a list where there are two named elements, cities
and rates
. Note that the lengths do not have to match (cities has 3 elements, rates has 4).
helpful_list <- list(
"cities" = c("Denver", "Washington, D.C.", "New York"),
"rates" = c(20, 42, 61, 80)
)
helpful_list
$cities
[1] "Denver" "Washington, D.C." "New York"
$rates
[1] 20 42 61 80
See that it prints each named element (which itself is a collection of elements) starting with $
and then the named element (along with each element’s contents):
helpful_list$cities
helpful_list$rates
We can select just the first element cities
or the second rates
by calling the list combined with the $
and element name, like so:
helpful_list$rates
[R:] 20 42 61 80
helpful_list$cities
[R:] "Denver" "Washington, D.C." "New York"
# get the names of each named element
names(helpful_list)
[R:] "cities" "rates"
There’s much more to lists, but these attractive attributes bring us to the main event, dataframes:
Dataframes
A dataframe (referred to as data.frame
in R code) is a special kind of list of vectors of the same length, but each vector may be a different data type. Dataframes are essentially spreadsheets that are most familiar and useful for data analysis - each column is a variable and each row is an observation or case. We will make extensive use of dataframes - and “tibble
s,” a friendlier version of dataframes from the {tidyverse}
set of packages.
Let’s return to our helpful_list
example, but add an additional city, "Philadelphia"
to ensure both columns cities
and rates
are the same length.
# make a dataframe called df
df <- data.frame(
"cities" = c("Denver", "Washington, D.C.", "New York", "Philadelphia"),
"rates" = c(20, 42, 61, 80)
)
# call it
df
[R]:
cities rates
1 Denver 20
2 Washington, D.C. 42
3 New York 61
4 Philadelphia 80
Notice that when we construct this, each column (cities
and rates
) is a vector with 3 elements (see the c()
). When we call df
, see that it prints as a series of columns and rows. Each has a name, which we can find with names()
, which gives us the column names. Note we specifically can ask about the column names with colnames()
and even the row names with rownames()
(these are just row numbers, and generally we don’t want to give these more specific names).
# get relevant names
names(df)
[R:] "cities" "rates"
colnames(df)
[R:] "cities" "rates"
rownames(df)
[R:] "1" "2" "3" "4"
# get more information about the structure of the dataframe
str(df)
[R:]
'data.frame': 4 obs. of 2 variables:
$ cities: chr "Denver" "Washington, D.C." "New York" "Philadelphia"
$ rates : num 20 42 61 80
We can check the structure of the dataframe with str()
, which tells us lots of information: - the type of object (data.frame
) - the dimensions - number of rows (obs.
) and columns (variables
) - the names
of each column vector (after $
) - the data type of each column vector (chr
and num
) - the first few values (rows) for each column
Finally, see again that we can extract the columns from the dataframe using the column names and $
, just like a list:
# extract the "cities" column vector
df$cities
[R:] "Denver" "Washington, D.C." "New York" "Philadelphia"
# extract the "rates" column vector
df$rates
[R:] 20 42 61 80
# check the class of the whole object df
class(df)
[R:] "data.frame"
# check the class of just the "cities" column vector
class(df$cities)
[R:] "character"
# check the class of just the "rates" column vector
class(df$rates)
[R:] "numeric"
Next Steps
There’s much, much more to explore with each of these concepts described above. But now you should at least understand the basics of R – everything is an object, you run functions on objects, and there are different types and classes of data and objects.
Footnotes
In other languages like python, you would typically
print()
the object to see its value - you can do this in R as well withprint()
orcat
sometimes, if you really want to be explicit, but you generally can just type the object’s name.↩︎Yes, it’s two keystrokes,
<
and-
. Apparently, a long time ago there were actual keyboards on computers that contained a<-
key. There is a perennial debate within the R community about whether one should use<-
or=
for assignment. While there are valid arguments for=
, and it’s common in other languages like Python, suffice it to say that it’s best if you use<-
in R, as=
has a different meaning within R functions.↩︎Note, it’s also valid code to write the reverse,
y -> x
, it’s just quite uncommon.↩︎There are ways to download packages from other sources, and indeed you will occasionally want to do this. We’ll save that for later.↩︎
Yes, note the oddities of the syntax: it’s always plural packages even if you are installing a single package, and the package’s name must be in quotes.↩︎
I’m playing a bit fast an loose with “class” versus “type” here. Going forward, I’ll refer to different formats (character, numeric, etc.) of data as “types”, and different data structures (vector, list, etc.) as having different classes. For some more advanced reading, see here and here.↩︎
There are also ways to check whether something is a specific class (like
is.numeric()
, which returnsTRUE
orFALSE
), and to change an object’s class (likeas.numeric()
).↩︎“Double” refers to floating-point arithmetic; i.e. methods that computers use to store and manipulate numbers in binary. For the most point, we can ignore this, although from time to time particularly strange errors will rear their ugly head. “Double” refers to the fact that values are stored in 64 bits (“double precision”)↩︎
Note, under the hood,
R
is actually storing these asinteger
data that takes on either0
(forFALSE
) or1
(forTRUE
)!↩︎In general, one equals sign
=
denotes assignment to an object or argument of a function (like<-
, which we prefer), while two equals signs==
is a logical test! Sox = 5
is assigning the value5
to objectx
, butx == 5
is evaluating or testing whether “x is equal to 5” isTRUE
orFALSE
.↩︎In general, the hierarchy is
logical < integer < numeric < character
such that data types are converted to the highest type contained in the object.↩︎