The key problem that readr solves is parsing a flat file into a tibble. Parsing is the process of taking a text file and turning it into a rectangular tibble where each column is the appropriate part. Parsing takes place in three basic stages:
The flat file is parsed into a rectangular matrix of strings.
The type of each column is determined.
Each column of strings is parsed into a vector of a more specific type.
It’s easiest to learn how this works in the opposite order Below, you’ll learn how the:
Vector parsers turn a character vector in to a more specific type.
Column specification describes the type of each column and the strategy readr uses to guess types so you don’t need to supply them all.
Rectangular parsers turn a flat file into a matrix of rows and columns.
parse_*() is coupled with a
col_*() function, which will be used in the process of parsing a complete tibble.
It’s easiest to learn the vector parses using
parse_ functions. These all take a character vector and some options. They return a new vector the same length as the old, along with an attribute describing any problems.
parse_double() are strict: the input string must be a single number with no leading or trailing characters.
parse_number() is more flexible: it ignores non-numeric prefixes and suffixes, and knows how to deal with grouping marks. This makes it suitable for reading currencies and percentages:
readr supports three types of date/time data:
Each function takes a
format argument which describes the format of the string. If not specified, it uses a default value:
parse_time() uses the
time_format specified by the
locale(). The default value is
%At which uses an automatic time parser that recognises times of the form
H:M optionally followed by seconds and am/pm.
In most cases, you will need to supply a
format, as documented in
When reading a column that has a known set of values, you can read directly into a factor.
parse_factor() will generate generate a warning if a value is not in the supplied levels.
parse_factor(c("a", "b", "a"), levels = c("a", "b", "c")) #>  a b a #> Levels: a b c parse_factor(c("a", "b", "d"), levels = c("a", "b", "c")) #> Warning: 1 parsing failure. #> row col expected actual #> 3 -- value in level set d #>  a b <NA> #> attr(,"problems") #> # A tibble: 1 x 4 #> row col expected actual #> <int> <int> <chr> <chr> #> 1 3 NA value in level set d #> Levels: a b c
It would be tedious if you had to specify the type of every column when reading a file. Instead readr, uses some heuristics to guess the type of each column. You can access these results yourself using
The guessing policies are described in the documentation for the individual functions. Guesses are fairly strict. For example, we don’t guess that currencies are numbers, even though we can parse them:
For bigger files, you can often make the specification simpler by changing the default column type using
mtcars_spec <- spec_csv(readr_example("mtcars.csv")) #> Parsed with column specification: #> cols( #> mpg = col_double(), #> cyl = col_double(), #> disp = col_double(), #> hp = col_double(), #> drat = col_double(), #> wt = col_double(), #> qsec = col_double(), #> vs = col_double(), #> am = col_double(), #> gear = col_double(), #> carb = col_double() #> ) mtcars_spec #> cols( #> mpg = col_double(), #> cyl = col_double(), #> disp = col_double(), #> hp = col_double(), #> drat = col_double(), #> wt = col_double(), #> qsec = col_double(), #> vs = col_double(), #> am = col_double(), #> gear = col_double(), #> carb = col_double() #> ) cols_condense(mtcars_spec) #> cols( #> .default = col_double() #> )
By default readr only looks at the first 1000 rows. This keeps file parsing speedy, but can generate incorrect guesses. For example, in
challenge.csv the column types change in row 1001, so readr guesses the wrong types. One way to resolve the problem is to increase the number of rows:
Another way is to manually specify the
col_type, as described below.
readr comes with five parsers for rectangular file formats:
read_csv2()for csv files
read_tsv()for tabs separated files
read_fwf()for fixed-width files
read_log()for web log files
Each of these functions firsts calls
spec_xxx() (as described above), and then parses the file according to that column specification:
df1 <- read_csv(readr_example("challenge.csv")) #> Parsed with column specification: #> cols( #> x = col_double(), #> y = col_logical() #> ) #> Warning: 1000 parsing failures. #> row col expected actual file #> 1001 y 1/0/T/F/TRUE/FALSE 2015-01-16 '.../readr/extdata/challenge.csv' #> 1002 y 1/0/T/F/TRUE/FALSE 2018-05-18 '.../readr/extdata/challenge.csv' #> 1003 y 1/0/T/F/TRUE/FALSE 2015-09-05 '.../readr/extdata/challenge.csv' #> 1004 y 1/0/T/F/TRUE/FALSE 2012-11-28 '.../readr/extdata/challenge.csv' #> 1005 y 1/0/T/F/TRUE/FALSE 2020-01-13 '.../readr/extdata/challenge.csv' #> .... ... .................. .......... .................................................................. #> See problems(...) for more details.
The rectangular parsing functions almost always succeed; they’ll only fail if the format is severely messed up. Instead, readr will generate a data frame of problems. The first few will be printed out, and you can access them all with
problems(df1) #> # A tibble: 1,000 x 5 #> row col expected actual file #> <int> <chr> <chr> <chr> <chr> #> 1 1001 y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Users/jhester/Library/R/3.… #> 2 1002 y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Users/jhester/Library/R/3.… #> 3 1003 y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Users/jhester/Library/R/3.… #> 4 1004 y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Users/jhester/Library/R/3.… #> 5 1005 y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Users/jhester/Library/R/3.… #> 6 1006 y 1/0/T/F/TRUE/FALSE 2016-04-17 '/Users/jhester/Library/R/3.… #> 7 1007 y 1/0/T/F/TRUE/FALSE 2011-05-14 '/Users/jhester/Library/R/3.… #> 8 1008 y 1/0/T/F/TRUE/FALSE 2020-07-18 '/Users/jhester/Library/R/3.… #> 9 1009 y 1/0/T/F/TRUE/FALSE 2011-04-30 '/Users/jhester/Library/R/3.… #> 10 1010 y 1/0/T/F/TRUE/FALSE 2010-05-11 '/Users/jhester/Library/R/3.… #> # ... with 990 more rows
You’ve already seen one way of handling bad guesses: increasing the number of rows used to guess the type of each column.
Another approach is to manually supply the column specification.
In the previous examples, you may have noticed that readr printed the column specification that it used to parse the file:
You can also access it after the fact using
(This also allows you to access the full column specification if you’re reading a very wide file. By default, readr will only print the specification of the first 20 columns.)
If you want to manually specify the column types, you can start by copying and pasting this code, and then tweaking it fix the parsing problems.
In general, it’s good practice to supply an explicit column specification. It is more work, but it ensures that you get warnings if the data changes in unexpected ways. To be really strict, you can use
stop_for_problems(df3). This will throw an error if there are any parsing problems, forcing you to fix those problems before proceeding with the analysis.
The output of all these functions is a tibble. Note that characters are never automatically converted to factors (i.e. no more
stringsAsFactors = FALSE) and column names are left as is, not munged into valid R identifiers (i.e. there is no
check.names = TRUE). Row names are never set.