I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:
Web log files with
You can install it by running:
Compared to the equivalent base functions, readr functions are around 10x faster. They’re also easier to use because they’re more consistent, they produce data frames that are easier to use (no more
stringsAsFactors = FALSE!), they have a more flexible column specification, and any parsing problems are recorded in a data frame. Each of these features is described in more detail below.
All readr functions work the same way. There are four important arguments:
file gives the file to read; a url or local path. A local path can point to a a zipped, bzipped, xzipped, or gzipped file - it’ll be automatically uncompressed in memory before reading. You can also pass in a connection or a raw vector.
For small examples, you can also supply literal data: if
file contains a new line, then the data will be read directly from the string. Thanks to data.table for this great idea!
col_names: describes the column names (equivalent to
header in base R). It has three possible values:
TRUEwill use the the first row of data as column names.
FALSEwill number the columns sequentially.
col_types: overrides the default column types (equivalent to
colClasses in base R). More on that below.
progress: By default, readr will display a progress bar if the estimated loading time is greater than 5 seconds. Use
progress = FALSE to suppress the progress indicator.
The output has been designed to make your life easier:
Characters are never automatically converted to factors (i.e. no more
stringsAsFactors = FALSE!).
Column names are left as is, not munged into valid R identifiers (i.e. there is no
check.names = TRUE). Use backticks to refer to variables with unusual names, e.g.
The output has class
c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).
Row names are never set.
Readr heuristically inspects the first 100 rows to guess the type of each columns. This is not perfect, but it’s fast and it’s a reasonable start. Readr can automatically detect these column types:
col_logical()[l], contains only
col_euro_double()[e], “Euro” doubles that use
,as the decimal separator.
col_date()[D]: Y-m-d dates.
col_datetime()[T]: ISO8601 date times
col_character()[c], everything else.
You can manually specify other column types:
col_skip() [_], don’t import this column.
col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9,
. (this is useful for parsing currency data).
col_factor(levels, ordered), parse a fixed set of known values into a (optionally ordered) factor.
There are two ways to override the default choices with the
Use a compact string:
"dc__d". Each letter corresponds to a column so this specification means: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with column types that need parameters.)
With a (named) list of col objects:
Any omitted columns will be parsed automatically, so the previous call is equivalent to:
One of the most helpful features of readr is its ability to import dates and date times. It can automatically recognise the following formats:
Dates in year-month-day form:
2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is
02/01/2015 the 2nd of January or the 1st of February?
Date times as ISO8601 form: e.g.
2001-02-03 04:05:06.07 -0800,
20010203 etc. I don’t support every possible variant yet, so please let me know if it doesn’t work for your data (more details in
If your dates are in another format, don’t despair. You can use
col_datetime() to explicit specify a format string. Readr implements it’s own
strptime() equivalent which supports the following format strings:
\%Y (4 digits).
\%y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
\%m (2 digits),
\%b (abbreviated name in current locale),
\%B (full name in current locale).
\%d (2 digits),
\%e (optional leading space)
\%S (integer seconds),
\%OS (partial seconds)
\%Z (as name, e.g.
\%z (as offset from UTC, e.g.
\%. skips one non-digit charcater,
\%* skips any number of non-digit characters.
If there are any problems parsing the file, the
read_ function will throw a warning telling you how many problems there are. You can then use the
problems() function to access a data frame that gives information about each problem:
csv <- "x,y 1,a b,2 " df <- read_csv(csv, col_types = "ii") #> Warning: 2 parsing failures. #> row col expected actual file #> 1 y an integer a literal data #> 2 x an integer b literal data problems(df) #> # A tibble: 2 x 5 #> row col expected actual file #> <int> <chr> <chr> <chr> <chr> #> 1 1 y an integer a literal data #> 2 2 x an integer b literal data df #> # A tibble: 2 x 2 #> x y #> <int> <int> #> 1 1 NA #> 2 NA 2
Readr also provides a handful of other useful functions:
read_lines() works the same way as
readLines(), but is a lot faster.
read_file() reads a complete file into a string.
type_convert() attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers. It uses the same rules as the