This lab will introduce you to statistical computing for the social sciences. We begin by meeting the statistical software that we will use in our exercises. We will use R statistical software, which is free and open source. There are many statistical packages available to social scientists. A consideration of the relative merits of these is beyond the scope of the current exercise. In stead, we simply point out that R is widely used in quantitative social science.

To interact with statistical software, we issue commands in the form of code. Our software (henceforth, R) interprets those commands and performs whatever actions we have instructed.

We will access R using a utility called RStudio. In order to complete this lab, you will need both R and RStudio installed. Let’s begin by downloading both our statistical software (R) and the utility (RStudio). R can be downloaded here: https://mirrors.nics.utk.edu/cran/. Let me know if you have questions with the download process.

RStudio can be downloaded here: https://www.rstudio.com/products/rstudio/download/. Again, ask questions if necessary.

Hello, R

Our first task is to meet our statistical package, R. Double click on the R application. The text you see at the top is automatically generated, and contains some basic information about the software and its approved uses. You can safely ignore this. Beneath the text, you see the > character. This interface is called the R console. What you type at the console will be interpreted by R as instructions. When you type instructions and hit enter, you are using R in interactive mode. Interactive mode can be helpful for exploring data and testing portions of code.

Generally speaking, however, it is preferable to interact with your software in batch mode, using a code file. A code file is a set of instructions that can be repeated or modified by you at any time. Using a code file is an important part of generating reproducible, transparent research. A consideration of the principles of reproducibility and transparency is beyond the scope of the current exercise, but some informative materials on these topics can be accessed here: http://www.bitss.org/.

Let’s begin by entering a simple command at the console. Why don’t we start by introducing ourselves to R. Let’s generate an object (this is actually a vector, which we’ll define in a moment) that holds our names. Enter the following code:

my.name <- 'Micah'
my.name
## [1] "Micah"

We have just communicated with R using a function. Much more on these in a moment, but for now note that functions are the main way we interact with R.

Hello, R studio

For convenience, in this course we’ll interact with the R software using a special utility called RStudio. RStudio is a convenience tool that can be downloaded for free from the web (see above). To get started with RStudio, find the application and open it. The RStudio environemnt has four components. For now, you should focus only on the Console, which should appear in the lower left corner. The console that appears here in RStudio is the same as the console you were working with just a moment ago.

A quick note on why we are using RStudio. First, RStudio will provide us with a very convenient shortcut for working with data. Essentially, we’ll be able to skip what would otherwise be a cumbersome step. Second, RStudio will allow us to create ``Markdown’’ files, which will make it much easier for you to present your results in a convenient and neat format.

You should remember that nearly everything we do within RStudio can be recreated in R itself. However, we won’t cover that process in class.

Working with data

In this section we cover the basics of working with data. Let’s begin at the console in RStudio (can you find the console?). To begin, tell R your name again, as we did above.

my.name <- 'Micah'
my.name
## [1] "Micah"

Functions

For the most part, we communicate with R using functions. The basic concept of a function in statistical software derives from the mathematical definition of a function. As a refresher, a function in mathematics describes a mapping from one set to another. For a statistical package, a function takes in arguments and spits out values. For a given set of arguments, a function should generate only one set of values. Let’s look at an example of a function.

As a very simple example, imagine that you’ve decided you want the computer to print your name in lower-case letters. The following code will accomplish this:

tolower(my.name)
## [1] "micah"

tolower() is a function. Let’s examine its components. The argument we fed to the function was the object we defined to hold our names (my.name). The value that the function spits out is whatever string came in, with all of the letters converted to lower-case. R is quite sophisticated in the way it transforms arguments to values. See this example of strange input to the tolower() function.

tolower("4%7M89")
## [1] "4%7m89"

Most of the time, functions in R appear as a function name, which may have . or _, but no spaces, followed by parenthesis. The arguments of the function are inserted in the parenthesis. One very important exception to this is the assignment function. Assignment is accomplished using the following key combination: <-. The carrot points to the argument and the dash is followed by the value. We can read this as a sentence of the form, “R, please assign [VALUE] to the object names [ARGUMENT]”. Returning to the assignment of our name, above, the sentence would read, “R, please assign micah.gr to the object named my.name.

Now would be a good time to switch away from the console. For class assignments, we’re going to use a special kind of codefile called a markdown file. To create your own markdown file, go to File –> New File –> R Markdown, and fill in the requested information. This will generate a new file with two component. A preamble at the top (contained within ---) gives RStudio information that it needs to format your document. Leave that section untouched. Below the --- is additional text, which you can delete.

Your markdown file will contain two types of text. The first will be normal text, which you will use to write up your answers. The second type are code snippets. These are pieces of code that you will send directly to R.

For code snippets, follow these rules: 1. Start each code snippet on a new line 2. At the beginning of the code snippet, write ```{r} 3. Move to a new line and begin writing your code 4. When you have finished, move to a new line, and write `````

Vectors

The next important concept to consider is that of a vector. R stores most of the objects in its memory as vectors. Think of a vector as a suitcase or a filebox. It is a container that is used to store information. One of the first questions we should ask about a vector is, what type of information is it storing? The available types are:

Logical

R can store a vector that tells us whether a condition is true or false. Let’s take a simple example. Consider a set of 100 random numbers drawn from a normal distribution. Here is code to make the numbers, along with a historgram to show what they look like:

random.nums <- rnorm(100)
hist(random.nums)

Just as we would expect, some of these numbers are greater than 0, and some are less than zero. What if we wanted to know which of those conditions was true for each of the random numbers? We could generate a new vector that contains the value TRUE if the corresponding random number is strictly greater than 0, and FALSE if the number is less than or equal to 0. Here’s how:

heads.or.tails <- random.nums > 0
barplot(table(heads.or.tails))

The first line of code assigns a new vector which reports the answer to the question, for each element of random.nums is that element strictly greater than 0 or not.

  1. As an exercise, can you explain the name I chose for this logical vector? Is there something not quite right about this name? Hint: You may need to come back to this question after exploring the rest of the exercise.

Numeric

The random.nums vector illustrates another data type, numeric. Numeric variables can take any of the values that a number can take. They can be integers, as in the following:

integer.sequence <- 1:100
ages <- c(22,22,23,38,21,21,22,21,21,19,23,20,21)
ages
##  [1] 22 22 23 38 21 21 22 21 21 19 23 20 21

Character

Character vectors hold any alphanumeric characters, which includes letters (upper- and lower-case), numerals, and special characters. my.name is an example of a character vector.

Factor

The final vector type is the factor variable. Factors are very useful for capturing data that are “categorical” (more on data types in a moment). Before we generate this vector, we need a brief detour to understand indexing.

As you’ve seen, a vector can have many elements. We have a lot of control over which of these elements we want to access. This control is exercised using indexes, which we manipulate by inserting square brackets ([]) after the vector. We can simply tell R that we want the first element of a vector, or we could ask for the last element, or we could ask for the 10th through the 20th elements as in:

random.nums[1]
## [1] -0.23972
random.nums[length(random.nums)]
## [1] 1.771364
random.nums[10:20]
##  [1] -0.67453403 -0.70092380 -0.27307538  0.29678144 -0.19411653
##  [6]  0.03366124  0.32857786 -0.22781306  0.11555571 -1.37113337
## [11] -0.88999678
  1. What function did I use to extract the last element of the vector? Why does this make sense?

A more interesting example, would be to use our ages. Let’s extract Micah’s age from the vector of ages.

ages[4]
## [1] 38
max(ages) == ages[4]
## [1] TRUE
ages[ages == 38]
## [1] 38

An important note. If we want to extract the elements of a vector that exactly meet a certain criteria, we use ==, as in:

integer.sequence[integer.sequence == 10]
## [1] 10

We can also index by sending a logical rule to R. A simple example would be, extract only those elements that are greater than 0. Here’s how we would do that using random.nums:

random.nums[random.nums > 0]
##  [1] 0.017640663 0.178831339 0.883721199 1.612444199 1.783816695
##  [6] 1.976476408 0.296781442 0.033661245 0.328577865 0.115555715
## [11] 0.006724945 0.520121650 1.560186782 1.302264053 1.677628935
## [16] 0.409808728 1.316421041 0.586584139 0.391991913 1.662192902
## [21] 0.809065367 0.835570974 1.977673154 1.164899630 0.314810709
## [26] 0.203719521 0.094814899 0.907872866 0.948251652 2.003472590
## [31] 1.448485541 0.415966005 1.489619250 0.080196310 0.608785852
## [36] 0.018573542 0.944899998 0.005795195 2.338335893 0.729874233
## [41] 0.223564984 1.487660305 1.034981603 0.083116285 1.027511733
## [46] 0.871623275 1.524873528 0.281373399 0.883185755 1.771363737
  1. Extract the elements of integer.sequence greater than 10.

Now that we know something about indexes, we can build a vector of the factor type using one of the vecotrs we’ve already made.

unfair.coin <- rep(NA, 100)
unfair.coin[random.nums > 0] <- 'heads'
unfair.coin[random.nums <= 0] <- 'tails'
unfair.coin <- as.factor(unfair.coin)
table(unfair.coin)
## unfair.coin
## heads tails 
##    50    50

First exercise - building a data set

Let’s build a data set. Together. We’ll begin, by generating a vector to hold our names. What type of vector will it be? Character! As we did for the ages vector, we’ll use the c() function, to put all of our initials into a single vector.

initials <- c('ea', 'bb', 'as', 'mgr', 'ab', 'jt', 'bg', 'eb', 'dw', 'sg', 'fh', 're', 'mb')
length(initials)
## [1] 13

Now, we have two vectors. One containing our ages, and the other containing our initials. IMPORTANT: We happen to know that the vectors both follow the same order. This is because, our next step is to put these two vectors together into data.frame.

When R looks at vectors, it sees them as columns. This is arbitrary. This matters because, to put vectors together, we must use the cbind() function, which is short for column bind. IMPORTANT - we need to tell R that this new object should be stored as a data.frame. We do that using the as.data.frame() function.

dat <- as.data.frame(cbind(ages, initials))
dat
##    ages initials
## 1    22       ea
## 2    22       bb
## 3    23       as
## 4    38      mgr
## 5    21       ab
## 6    21       jt
## 7    22       bg
## 8    21       eb
## 9    21       dw
## 10   19       sg
## 11   23       fh
## 12   20       re
## 13   21       mb

Now, let’s generate another variable for our data set. This time, we can insert the vector directly into the data set. Our variable will be called viewing.pref.

dat$viewing.pref <- c('t', 'm', 't', 't', 't', 'm', 't', 'm', 't', 't', 't', 't', 't')
dat
##    ages initials viewing.pref
## 1    22       ea            t
## 2    22       bb            m
## 3    23       as            t
## 4    38      mgr            t
## 5    21       ab            t
## 6    21       jt            m
## 7    22       bg            t
## 8    21       eb            m
## 9    21       dw            t
## 10   19       sg            t
## 11   23       fh            t
## 12   20       re            t
## 13   21       mb            t