Before we can analyse data using R, we have to “import” the data
into R. How exactly this is done depends on how the data is stored,
and more specifically on which file format is used. Here we consider
two commonly used formats: comma-separated values (
and Microsoft Excel files (
read.csv() command can be used to import
into R: if we use the command
then the contents of the file
file.csv will be stored in the variable
Optional arguments to
read.csv() can be used to specify whether or not the
file includes column names, and allow to deal with variations of how the data
may be stored in the file. Some of these are explained below.
The most important things to
know about the function
The filename can either denote a file on the local computer, or a file available for download from the internet. If you want R to read the file directly from the internet, replace the file name with the web address (starting with
If the input file is on the local computer, you may need to change R’s current directory to the directory where the file is stored before calling
read.csv(). In RStudio you can use the menu “Session > Set Working Directory > Choose Directory …” to do this. If you are not sure what the filename for a given input file is, you can use the command
file.choose()in place of the file name, to choose a file interactively.
By default, R uses the first row of the
.csvfile to set the column names of the data frame and assumes that the actual data starts in row 2 of the
.csvfile. If the file does not contain column names, you can use the
read.csv()to tell R that the column names are not included in the data:
You can see whether this option is needed by opening the file in Excel and looking whether the first row contains headers or not. Alternatively you can inspect the column names and the contents of the first data row in R to see whether everything looks right after importing the data.
R uses a special data type called
Factorto represent categorical data. If the
.csvfile contains such data, the option
stringsAsFactors=TRUEcan be used to automatically convert text to factors. (This option used to be the default for R versions before 4.0.0.)
Sometimes, the columns in a
.csvfile are separated not by a comma, but using a semicolon instead. In this case you need to use the option
sep=";"when you import the data:
Missing values should be represented by empty cells in the
.csvfile and are represented as special
NAvalues in the data frame. If the
.csvfile uses a different encoding for missing values, the
na.stringsoption can be used to tell
read.csv()which cell values should be interpreted as missing values. For example,
read.csv("file.csv", na.strings=c("", "-"))can be used for a file where missing values are indicated by either empty cells or cells containing a hyphen.
Further details about the function
read.csv() can be found using the
help(read.csv) in R.
Example 6.2 In the 2016 version of the MATH1712 module, I performed a small questionnaire in the first lecture. The following R command can be used to load the data from the questionnaire into R
x now contains the questionnaire results.
directly downloading the data from the internet into R, you can
alternatively first download the data using a web browser, and then
import the data directly from your computer:
Both approaches give the same result.
The easiest way to import Excel files into R is to first convert these
.csv format. To do this:
Open the file with the data in Excel.
Open a new, empty file (choosing “Blank workbook” in Excel).
Copy and paste the relevant cells into the empty file. It is important to just copy the required data and to leave out any explanatory text and and empty rows/columns. The data must form a tidy rectangle, with one sample per row and one variate per column. Optionally, you can put column headers into the first row.
Save the new file in
.csvformat in a folder where you will find it again.
Read the resulting
.csvinto R as explained above.
Alternatively, you can try the
read_excel() function from the
package. (You may need to install this R package first.) You can learn how to
use this function at
https://readxl.tidyverse.org/ or using the
Note that these functions return a “tibble”
instead of a data frame, so some additional knowledge is required to use the
The following commands can be used to get a quick overview over the data:
dim(x)gives the number of rows and columns of the data frame. Similarly,
nrow(x)shows the number of rows and
ncol(x)shows the number of columns in the data frame.
str(x)shows the structure of any R object. This command is often an excellent way to understand the nature of any problems one may encounter while importing data into R.
summary(x)prints, for each variate, the values of various summary statistics. For variates with numeric values, these are the minimum, first quartile, median, mean, third quartile, maximum, and the number of missing values. For attribute data this gives the counts for each observed value.
head(x)shows the first few rows of the data.
Every time you import a dataset into R, you should use some
of these commands to check that everything went well.
In case you discover problems, you should either fix the data file
(e.g. using Microsoft Excel) or by using the correct options
Example 6.3 Continuing with the dataset from example above, we can try the following commands: We first check that the data has plausible dimensions:
 220 5
This tells us that the data has \(n=220\) rows and \(p=5\) columns, as expected. Next, we get some details about the five columns of the data frame:
'data.frame': 220 obs. of 5 variables: $ gender : chr "M" "M" "M" "M" ... $ height : num 180 183 183 182 164 ... $ handedness : chr "R" "R" "R" "R" ... $ travel.time: int 20 25 20 12 5 20 15 10 11 7 ... $ R.skills : chr "basic" "basic" "basic" "basic" ...
This shows, for each column, the name, the “data type” and the
first few values. The “data type” is a good indicator to detect
problems; it should read
int for integer valued numeric data,
num for all other numeric data, and
attribute data. For attribute data, the range of observed values is
also shown. For our dataset, we see that the fields
$R.skills are represented as text data (type
instead of categorical data. To fix this, we re-import the data
'data.frame': 220 obs. of 5 variables: $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 2 ... $ height : num 180 183 183 182 164 ... $ handedness : Factor w/ 3 levels "L","R","both": 2 2 2 2 2 2 2 2 2 2 ... $ travel.time: int 20 25 20 12 5 20 15 10 11 7 ... $ R.skills : Factor w/ 4 levels "basic","good",..: 1 1 1 1 1 1 1 3 1 1 ...
Now the output is as expected, and the correct “levels” are shown for the categorical variables.
Finally, we can print summary statistics for all columns:
gender height handedness travel.time R.skills F:102 Min. :147.0 L : 18 Min. : 0.00 basic :157 M:118 1st Qu.:167.0 R :201 1st Qu.: 5.00 good : 4 Median :175.0 both: 1 Median : 15.00 medium: 28 Mean :173.7 Mean : 19.15 none : 31 3rd Qu.:180.3 3rd Qu.: 25.00 Max. :195.6 Max. :150.00 NA's :23
None of these commands showed any remaining problems, so we are now ready to use the data.
There are many things which can go wrong when importing data. Some commonly encountered problems are the following:
If the data file contains no line with headers, but the
read.csv()has been omitted, the first row of data will be used in place of the column names. This can, for example, be seen in the
str()output. The solution to this problem is to correctly use the
If the data in a
.csvfile is not separated by commas but by some other character like a semicolon, R will be unable to separate the columns. When this is the case, the imported data will appear to only have one column, where each entry shows as some garbled version of the data for the whole row. The solution to this problem is to use the
If attribute values are encoded inconsistently, e.g. if a mix of
Mis used to encode the gender “male”, this will be visible in the
str()output. One solution to this problem is to fix the
.csvfile using Microsoft Excel, before trying to import it into R again.
If a numeric column in the input file contains one or more entries which are neither numbers nor empty, R will interpret the whole column as attribute data. This problem can be detected in the
str()output, when a numeric column is listed as having data type
Factor. One solution to this problem is to use Microsoft Excel to remove or fix the offending entry.