Interlude: Loading Data into R

Before we can analyse data using R, we have to “import” the data into R. How exactly this is done depends on how the data is stored, and more specifically on which file format is used. Here we consider two commonly used formats: comma-separated values (.csv) and Microsoft Excel files (.xls or .xlsx).

Importing CSV Files

The read.csv() command can be used to import .csv files into R: if we use the command

  x <- read.csv("file.csv")

then the contents of the file file.csv will be stored in the variable x. Optional arguments to read.csv() can be used to specify whether or not the file includes column names, and allow to deal with variations of how the data may be stored in the file. Some of these are explained below.

The most important things to know about the function read.csv() are:

  • The filename can either denote a file on the local computer, or a file available for download from the internet. If you want R to read the file directly from the internet, replace the file name with the web address (starting with http:// or https://).

  • If the input file is on the local computer, you may need to change R’s current directory to the directory where the file is stored before calling read.csv(). In RStudio you can use the menu “Session > Set Working Directory > Choose Directory …” to do this. If you are not sure what the filename for a given input file is, you can use the command file.choose() in place of the file name, to choose a file interactively.

  • By default, R uses the first row of the .csv file to set the column names of the data frame and assumes that the actual data starts in row 2 of the .csv file. If the file does not contain column names, you can use the header=FALSE option with read.csv() to tell R that the column names are not included in the data:

      x <- read.csv("file.csv", header=FALSE)

    You can see whether this option is needed by opening the file in Excel and looking whether the first row contains headers or not. Alternatively you can inspect the column names and the contents of the first data row in R to see whether everything looks right after importing the data.

  • R uses a special data type called Factor to represent categorical data. If the .csv file contains such data, the option stringsAsFactors=TRUE can be used to automatically convert text to factors. (This option used to be the default for R versions before 4.0.0.)

  • Sometimes, the columns in a .csv file are separated not by a comma, but using a semicolon instead. In this case you need to use the option sep=";" when you import the data:

      x <- read.csv("file.csv", sep=";")
  • Missing values should be represented by empty cells in the .csv file and are represented as special NA values in the data frame. If the .csv file uses a different encoding for missing values, the na.strings option can be used to tell read.csv() which cell values should be interpreted as missing values. For example, read.csv("file.csv", na.strings=c("", "-")) can be used for a file where missing values are indicated by either empty cells or cells containing a hyphen.

Further details about the function read.csv() can be found using the command help(read.csv) in R.

Example 6.2 In the 2016 version of the MATH1712 module, I performed a small questionnaire in the first lecture. The following R command can be used to load the data from the questionnaire into R

  x <- read.csv("https://teaching.seehuhn.de/rintro/Q2016.csv")

The variable x now contains the questionnaire results. Instead of directly downloading the data from the internet into R, you can alternatively first download the data using a web browser, and then import the data directly from your computer:

  x <- read.csv("Q2016.csv")

Both approaches give the same result.

Importing Microsoft Excel Files

The easiest way to import Excel files into R is to first convert these files to .csv format. To do this:

  • Open the file with the data in Excel.

  • Open a new, empty file (choosing “Blank workbook” in Excel).

  • Copy and paste the relevant cells into the empty file. It is important to just copy the required data and to leave out any explanatory text and and empty rows/columns. The data must form a tidy rectangle, with one sample per row and one variate per column. Optionally, you can put column headers into the first row.

  • Save the new file in .csv format in a folder where you will find it again.

  • Read the resulting .csv into R as explained above.

Alternatively, you can try the read_excel() function from the readxl package. (You may need to install this R package first.) You can learn how to use this function at https://readxl.tidyverse.org/ or using the command help(read_excel). Note that these functions return a “tibble” instead of a data frame, so some additional knowledge is required to use the result.

Checking the Imported Data

The following commands can be used to get a quick overview over the data:

  • dim(x) gives the number of rows and columns of the data frame. Similarly, nrow(x) shows the number of rows and ncol(x) shows the number of columns in the data frame.

  • str(x) shows the structure of any R object. This command is often an excellent way to understand the nature of any problems one may encounter while importing data into R.

  • summary(x) prints, for each variate, the values of various summary statistics. For variates with numeric values, these are the minimum, first quartile, median, mean, third quartile, maximum, and the number of missing values. For attribute data this gives the counts for each observed value.

  • head(x) shows the first few rows of the data.

Every time you import a dataset into R, you should use some of these commands to check that everything went well. In case you discover problems, you should either fix the data file (e.g. using Microsoft Excel) or by using the correct options for the read.csv() command.

Example 6.3 Continuing with the dataset from example above, we can try the following commands: We first check that the data has plausible dimensions:

dim(x)
[1] 220   5

This tells us that the data has \(n=220\) rows and \(p=5\) columns, as expected. Next, we get some details about the five columns of the data frame:

str(x)
'data.frame':   220 obs. of  5 variables:
 $ gender     : chr  "M" "M" "M" "M" ...
 $ height     : num  180 183 183 182 164 ...
 $ handedness : chr  "R" "R" "R" "R" ...
 $ travel.time: int  20 25 20 12 5 20 15 10 11 7 ...
 $ R.skills   : chr  "basic" "basic" "basic" "basic" ...

This shows, for each column, the name, the “data type” and the first few values. The “data type” is a good indicator to detect problems; it should read int for integer valued numeric data, num for all other numeric data, and Factor for attribute data. For attribute data, the range of observed values is also shown. For our dataset, we see that the fields $gender, $handedness and $R.skills are represented as text data (type chr) instead of categorical data. To fix this, we re-import the data using the stringsAsFactors=TRUE option:

x <- read.csv("data/Q2016.csv", stringsAsFactors=TRUE)
str(x)
'data.frame':   220 obs. of  5 variables:
 $ gender     : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 2 ...
 $ height     : num  180 183 183 182 164 ...
 $ handedness : Factor w/ 3 levels "L","R","both": 2 2 2 2 2 2 2 2 2 2 ...
 $ travel.time: int  20 25 20 12 5 20 15 10 11 7 ...
 $ R.skills   : Factor w/ 4 levels "basic","good",..: 1 1 1 1 1 1 1 3 1 1 ...

Now the output is as expected, and the correct “levels” are shown for the categorical variables.

Finally, we can print summary statistics for all columns:

summary(x)
 gender      height      handedness  travel.time       R.skills  
 F:102   Min.   :147.0   L   : 18   Min.   :  0.00   basic :157  
 M:118   1st Qu.:167.0   R   :201   1st Qu.:  5.00   good  :  4  
         Median :175.0   both:  1   Median : 15.00   medium: 28  
         Mean   :173.7              Mean   : 19.15   none  : 31  
         3rd Qu.:180.3              3rd Qu.: 25.00               
         Max.   :195.6              Max.   :150.00               
         NA's   :23                                              

None of these commands showed any remaining problems, so we are now ready to use the data.

Common Problems

There are many things which can go wrong when importing data. Some commonly encountered problems are the following:

  • If the data file contains no line with headers, but the header=FALSE option for read.csv() has been omitted, the first row of data will be used in place of the column names. This can, for example, be seen in the str() output. The solution to this problem is to correctly use the header=FALSE option.

  • If the data in a .csv file is not separated by commas but by some other character like a semicolon, R will be unable to separate the columns. When this is the case, the imported data will appear to only have one column, where each entry shows as some garbled version of the data for the whole row. The solution to this problem is to use the sep=... option.

  • If attribute values are encoded inconsistently, e.g. if a mix of m and M is used to encode the gender “male”, this will be visible in the str() output. One solution to this problem is to fix the .csv file using Microsoft Excel, before trying to import it into R again.

  • If a numeric column in the input file contains one or more entries which are neither numbers nor empty, R will interpret the whole column as attribute data. This problem can be detected in the str() output, when a numeric column is listed as having data type Factor. One solution to this problem is to use Microsoft Excel to remove or fix the offending entry.