Interlude: Loading Data into R
Before we can analyse data using R, we have to “import” the data
into R. How exactly this is done depends on how the data is stored,
and more specifically on which file format is used. Here we consider
two commonly used formats: comma-separated values (.csv
)
and Microsoft Excel files (.xls
or .xlsx
).
Importing CSV Files
The read.csv()
command can be used to import .csv
files
into R: if we use the command
then the contents of the file file.csv
will be stored in the variable x
.
Optional arguments to read.csv()
can be used to specify whether or not the
file includes column names, and allow to deal with variations of how the data
may be stored in the file. Some of these are explained below.
The most important things to
know about the function read.csv()
are:
The filename can either denote a file on the local computer, or a file available for download from the internet. If you want R to read the file directly from the internet, replace the file name with the web address (starting with
http://
orhttps://
).If the input file is on the local computer, you may need to change R’s current directory to the directory where the file is stored before calling
read.csv()
. In RStudio you can use the menu “Session > Set Working Directory > Choose Directory …” to do this. If you are not sure what the filename for a given input file is, you can use the commandfile.choose()
in place of the file name, to choose a file interactively.By default, R uses the first row of the
.csv
file to set the column names of the data frame and assumes that the actual data starts in row 2 of the.csv
file. If the file does not contain column names, you can use theheader=FALSE
option withread.csv()
to tell R that the column names are not included in the data:You can see whether this option is needed by opening the file in Excel and looking whether the first row contains headers or not. Alternatively you can inspect the column names and the contents of the first data row in R to see whether everything looks right after importing the data.
R uses a special data type called
Factor
to represent categorical data. If the.csv
file contains such data, the optionstringsAsFactors=TRUE
can be used to automatically convert text to factors. (This option used to be the default for R versions before 4.0.0.)Sometimes, the columns in a
.csv
file are separated not by a comma, but using a semicolon instead. In this case you need to use the optionsep=";"
when you import the data:Missing values should be represented by empty cells in the
.csv
file and are represented as specialNA
values in the data frame. If the.csv
file uses a different encoding for missing values, thena.strings
option can be used to tellread.csv()
which cell values should be interpreted as missing values. For example,read.csv("file.csv", na.strings=c("", "-"))
can be used for a file where missing values are indicated by either empty cells or cells containing a hyphen.
Further details about the function read.csv()
can be found using the
command help(read.csv)
in R.
Example 6.2 In the 2016 version of the MATH1712 module, I performed a small questionnaire in the first lecture. The following R command can be used to load the data from the questionnaire into R
The variable x
now contains the questionnaire results.
Instead of
directly downloading the data from the internet into R, you can
alternatively first download the data using a web browser, and then
import the data directly from your computer:
Both approaches give the same result.
Importing Microsoft Excel Files
The easiest way to import Excel files into R is to first convert these
files to .csv
format. To do this:
Open the file with the data in Excel.
Open a new, empty file (choosing “Blank workbook” in Excel).
Copy and paste the relevant cells into the empty file. It is important to just copy the required data and to leave out any explanatory text and and empty rows/columns. The data must form a tidy rectangle, with one sample per row and one variate per column. Optionally, you can put column headers into the first row.
Save the new file in
.csv
format in a folder where you will find it again.Read the resulting
.csv
into R as explained above.
Alternatively, you can try the read_excel()
function from the readxl
package. (You may need to install this R package first.) You can learn how to
use this function at
https://readxl.tidyverse.org/ or using the
command help(read_excel)
.
Note that these functions return a “tibble”
instead of a data frame, so some additional knowledge is required to use the
result.
Checking the Imported Data
The following commands can be used to get a quick overview over the data:
dim(x)
gives the number of rows and columns of the data frame. Similarly,nrow(x)
shows the number of rows andncol(x)
shows the number of columns in the data frame.str(x)
shows the structure of any R object. This command is often an excellent way to understand the nature of any problems one may encounter while importing data into R.summary(x)
prints, for each variate, the values of various summary statistics. For variates with numeric values, these are the minimum, first quartile, median, mean, third quartile, maximum, and the number of missing values. For attribute data this gives the counts for each observed value.head(x)
shows the first few rows of the data.
Every time you import a dataset into R, you should use some
of these commands to check that everything went well.
In case you discover problems, you should either fix the data file
(e.g. using Microsoft Excel) or by using the correct options
for the read.csv()
command.
Example 6.3 Continuing with the dataset from example above, we can try the following commands: We first check that the data has plausible dimensions:
[1] 220 5
This tells us that the data has \(n=220\) rows and \(p=5\) columns, as expected. Next, we get some details about the five columns of the data frame:
'data.frame': 220 obs. of 5 variables:
$ gender : chr "M" "M" "M" "M" ...
$ height : num 180 183 183 182 164 ...
$ handedness : chr "R" "R" "R" "R" ...
$ travel.time: int 20 25 20 12 5 20 15 10 11 7 ...
$ R.skills : chr "basic" "basic" "basic" "basic" ...
This shows, for each column, the name, the “data type” and the
first few values. The “data type” is a good indicator to detect
problems; it should read int
for integer valued numeric data,
num
for all other numeric data, and Factor
for
attribute data. For attribute data, the range of observed values is
also shown. For our dataset, we see that the fields $gender
,
$handedness
and $R.skills
are represented as text data (type chr
)
instead of categorical data. To fix this, we re-import the data
using the stringsAsFactors=TRUE
option:
'data.frame': 220 obs. of 5 variables:
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 2 ...
$ height : num 180 183 183 182 164 ...
$ handedness : Factor w/ 3 levels "L","R","both": 2 2 2 2 2 2 2 2 2 2 ...
$ travel.time: int 20 25 20 12 5 20 15 10 11 7 ...
$ R.skills : Factor w/ 4 levels "basic","good",..: 1 1 1 1 1 1 1 3 1 1 ...
Now the output is as expected, and the correct “levels” are shown for the categorical variables.
Finally, we can print summary statistics for all columns:
gender height handedness travel.time R.skills
F:102 Min. :147.0 L : 18 Min. : 0.00 basic :157
M:118 1st Qu.:167.0 R :201 1st Qu.: 5.00 good : 4
Median :175.0 both: 1 Median : 15.00 medium: 28
Mean :173.7 Mean : 19.15 none : 31
3rd Qu.:180.3 3rd Qu.: 25.00
Max. :195.6 Max. :150.00
NA's :23
None of these commands showed any remaining problems, so we are now ready to use the data.
Common Problems
There are many things which can go wrong when importing data. Some commonly encountered problems are the following:
If the data file contains no line with headers, but the
header=FALSE
option forread.csv()
has been omitted, the first row of data will be used in place of the column names. This can, for example, be seen in thestr()
output. The solution to this problem is to correctly use theheader=FALSE
option.If the data in a
.csv
file is not separated by commas but by some other character like a semicolon, R will be unable to separate the columns. When this is the case, the imported data will appear to only have one column, where each entry shows as some garbled version of the data for the whole row. The solution to this problem is to use thesep=...
option.If attribute values are encoded inconsistently, e.g. if a mix of
m
andM
is used to encode the gender “male”, this will be visible in thestr()
output. One solution to this problem is to fix the.csv
file using Microsoft Excel, before trying to import it into R again.If a numeric column in the input file contains one or more entries which are neither numbers nor empty, R will interpret the whole column as attribute data. This problem can be detected in the
str()
output, when a numeric column is listed as having data typeFactor
. One solution to this problem is to use Microsoft Excel to remove or fix the offending entry.