Column specifications are now coloured when printed. This makes it easy to see at a glance when a column is input as a different type then the rest. Colouring can be disabled by setting
options(crayon.enabled = FALSE).
Fix for compilation using custom architectures on macOS (#919)
Fix for valgrind errors (#941)
readr’s blank line skipping has been modified to be more consistent and to avoid edge cases that affected the behavior in 1.2.0. The skip parameter now behaves more similar to how it worked previous to readr 1.2.0, but in addition the parameter
skip_blank_rows can be used to control if fully blank lines are skipped. (#923)
readr 1.3.0 returns results with a
spec_tbl_df subclass. This differs from a regular tibble only that the
spec attribute (which holds the column specification) is lost as soon as the object is subset (and a normal
tbl_df object is returned).
tbl_df’s lost their attributes once they were subset. However recent versions of tibble retain the attributes when subetting, so the
spec_tbl_df subclass is needed to ensure the previous behavior.
This should only break compatibility if you are explicitly checking the class of the returned object. A way to get backwards compatible behavior is to call subset with no arguments on your object, e.g.
hmsobjects with NA values are now written without whitespace padding (#930).
read_*()functions now return
spec_tbl_dfobjects, which differ from regular
tbl_dfobjects only in that the
specattribute is removed (and they are demoted to regular
tbl_dfobjects) as soon as they are subset (#934).
write_csv2()now properly respects the
readr functions no longer guess columns are of type integer, instead these columns are guessed as numeric. Because R uses 32 bit integers and 64 bit doubles all integers can be stored in doubles, guaranteeing no loss of information. This change was made to remove errors when numeric columns were incorrectly guessed as integers. If you know a certain column is an integer and would like to read them as such you can do so by specifying the column type explicitly with the
readr now always skips blank lines automatically when parsing, which may change the number of lines you need to pass to the
skip parameter. For instance if your file had a one blank line then two more lines you want to skip previously you would pass
skip = 3, now you only need to pass
skip = 2.
There is now a family of
melt_*() functions in readr. These functions store data in ‘long’ or ‘melted’ form, where each row corresponds to a single value in the dataset. This form is useful when your data is ragged and not rectangular.
data <-"a,b,c 1,2 w,x,y,z" readr::melt_csv(data) #> # A tibble: 9 x 4 #> row col data_type value #> <dbl> <dbl> <chr> <chr> #> 1 1 1 character a #> 2 1 2 character b #> 3 1 3 character c #> 4 2 1 integer 1 #> 5 2 2 integer 2 #> 6 3 1 character w #> 7 3 2 character x #> 8 3 3 character y #> 9 3 4 character z
Thanks to Duncan Garmonsway (@nacnudus) for great work on the idea an implementation of the
readr 1.2.0 changes how R connections are parsed by readr. In previous versions of readr the connections were read into an in-memory raw vector, then passed to the readr functions. This made reading connections from small to medium datasets fast, but also meant that the dataset had to fit into memory at least twice (once for the raw data, once for the parsed data). It also meant that reading could not begin until the full vector was read through the connection.
Now we instead write the connection to a temporary file (in the R temporary directory), than parse that temporary file. This means connections may take a little longer to be read, but also means they will no longer need to fit into memory. It also allows the use of the chunked readers to process the data in parts.
Future improvements to readr would allow it to parse data from connections in a streaming fashion, which would avoid many of the drawbacks of either method.
melt_*()functions added for reading ragged data (#760, @nacnudus).
AccumulateCallbackR6 class added to provide an example of accumulating values in a single result (#689, @blakeboswell).
read_fwf()can now accept overlapping field specifications (#692, @gergness)
type_convert()now allows character column specifications and also silently skips non-character columns (#369, #699)
trim_wsargument to control whether the fields should be trimmed before parsing (#636, #735).
parse_number()now parses numbers in scientific notation using
write_excel_csv2()function to allow writing csv files with comma as a decimal separator and semicolon as a column separator (#753, @olgamie).
read_*()files now support reading from the clipboard by using
separgument, to specify the line separator (#665).
sftpas a URL protocol (#707, @jdeboer).
parse_date*() accepts%a` for local day of week (#763, @tigertoes).
write_csv2()added to complement
write_excel_csv2()and allow writing csv file readable by
as.col_spec()is now exported (#517).
write*()functions gain a
quote_escapeargument to control how quotes are escaped in the output (#854).
read*()functions now have a more informative error when trying to read a remote bz2 file (#891).
spec_table2()function added to correspond to
levels = NULLby default (#862, @mikmart).
"f"can now be used as a shortcode for
read_delim()and friends (#810, @mikmart).
standardise_path()now uses a case-insentitive comparison for the file extensions (#794).
parse_guess()now guesses logical types when given (lowercase) ‘true’ and ‘false’ inputs (#818).
read_*()now do not print a progress bar when running inside a RStudio notebook chunk (#793)
read_table2()now skips comments anywhere in the file (#908).
parse_factor()now handles the case of empty strings separately, so you can have a factor level that is an empty string (#864).
read_delim()now correctly reads quoted headers with embeded newlines (#784).
fwf_positions()now always returns
col_namesas a character (#797).
format_*()now explicitly marks it’s output encoding as UTF-8 (#697).
read_delim()now ignores whitespace between the delimiter and quoted fields (#668).
read_table2()now properly ignores blank lines at the end of a file like
read_table()now skip blank lines at the start of a file (#680, #747).
guess_parser()now guesses a logical type for columns which are all missing. This is useful when binding multiple files together where some files have missing columns. (#662).
read_*()now converts string
files to UTF-8 before parsing, which is convenient for non-UTF-8 platforms in most cases (#730, @yutannihilation).
write_csv()writes integers up to 10^15 without scientific notation (#765, @zeehio)
read_*()no longer throws a “length of NULL cannot be changed” warning when trying to resize a skipped column (#750, #833).
read_*()now handles non-ASCII paths properly with R >=3.5.0 on Windows (#838, @yutannihilation).
trim_wsparameter now trims both spaces and tabs (#767)
locale(tz = "")after loading a timezone due to incomplete reinitialization of the global locale.
include_naargument, to include
NAin the factor levels (#541).
parse_factor()will now can accept
levels = NULL, which allows one to generate factor levels based on the data (like stringsAsFactors = TRUE) (#497).
parse_numeric()now returns the full string if it contains no numbers (#548).
parse_time()now correctly handles 12 AM/PM (#579).
problems()now returns the file path in additional to the location of the error in the file (#581).
read_csv2()gives a message if it updates the default locale (#443, @krlmlr).
read_delim()now signals an error if given an empty delimiter (#557).
write_*()functions witting whole number doubles are no longer written with a trailing
fwf_cols()allows for specifying the
read_fwf()with named arguments of either column positions or widths (#616, @jrnold).
nargument to control how many lines are read for whitespace to determine column structure (#518, @Yeedle).
read_fwf()gives error message if specifications have overlapping columns (#534, @gergness)
read_table()can now handle
read_table()can now handle files with many lines of leading comments (#563).
read_table2()which allows any number of whitespace characters as delimiters, a more exact replacement for
parse_numeric()have been removed.
guess_encoding()returns a tibble, and works better with lists of raw vectors (as returned by
ListCallbackR6 Class to provide a more flexible return type for callback functions (#568, @mmuurr)
tibble::as.tibble()now used to construct tibbles (#538).
quoteargument, (#631, @noamross)
parse_factor()now converts data to UTF-8 based on the supplied locale (#615).
read_*()functions with the
guess_maxargument now throw errors on inappropriate inputs (#588).
read_*_chunked()functions now properly end the stream if
FALSEis returned from the callback.
read_fwf()when columns are skipped using
col_typesnow report the correct column name (#573, @cb4ds).
spec()declarations that are long now print properly (#597).
read_table()does not print
guess_encoding()now returns a tibble for all ASCII input as well (#641).
The process by which readr guesses the types of columns has received a substantial overhaul to make it easier to fix problems when the initial guesses aren’t correct, and to make it easier to generate reproducible code. Now column specifications are printing by default when you read from a file:
And you can extract those values after the fact with
This makes it easier to quickly identify parsing problems and fix them (#314). If the column specification is long, the new
cols_condense() is used to condense the spec by identifying the most common type and setting it as the default. This is particularly useful when only a handful of columns have a different type (#466).
Once you have figured out the correct column types for a file, it’s often useful to make the parsing strict. You can do this either by copying and pasting the printed output, or for very long specs, saving the spec to disk with
write_rds(). In production scripts, combine this with
stop_for_problems() (#465): if the input data changes form, you’ll fail fast with an error.
You can now also adjust the number of rows that readr uses to guess the column types with
extdata/challenge.csv which is carefully created to cause problems with the default column type guessing heuristics.
Single ‘-’ or ‘.’ are now parsed as characters, not numbers (#297).
Numbers followed by a single trailing character are parsed as character, not numbers (#316).
We now guess at times using the
time_format specified in the
We have made a number of improvements to the reification of the
col_names and the actual data:
Missing colum name names are now given a default name (
X7 etc) (#318). Duplicated column names are now deduplicated. Both changes generate a warning; to suppress it supply an explicit
skip = 1 if there’s an existing ill-formed header).
col_types() accepts a named list as input (#401).
The date time parsers recognise three new format strings:
%I for 12 hour time format (#340).
%AT are “automatic” date and time parsers. They are both slightly less flexible than previous defaults. The automatic date parser requires a four digit year, and only accepts
/ as separators (#442). The flexible time parser now requires colons between hours and minutes and optional seconds (#424).
%Y are now strict and require 2 or 4 characters respectively.
Date and time parsing functions received a number of small enhancements:
parse_number() is slightly more flexible - it now parses numbers up to the first ill-formed character. For example
parse_number("...3...") now return -3 and 3 respectively. We also fixed a major bug where parsing negative numbers yielded positive values (#308).
parse_logical() now accepts
1 as well as lowercase
read_*() functions gain a
quoted_na argument to control whether missing values within quotes are treated as missing values or as strings (#295).
Experimental support for chunked reading a writing (
read_*_chunked()) functions. The API is unstable and subject to change in the future (#427).
Printing double values now uses an implementation of the grisu3 algorithm which speeds up writing of large numeric data frames by ~10X. (#432) ‘.0’ is appended to whole number doubles, to ensure they will be read as doubles as well. (#483)
extdata/challenge.csv which is carefully created to cause problems with the default column type guessing heuristics.
read_*() can read into long vectors, substantially increasing the number of rows you can read (#309).
read_fwf() received a number of improvements:
You can now read fixed width files with ragged final columns, by setting the final end position in
fwf_positions() or final width in
NA (#353, @ghaarsma).
fwf_empty() does this automatically.
readr_example() makes it easy to access example files bundled with readr.
Doubles are parsed with
boost::spirit::qi::long_double to work around a bug in the spirit library when parsing large numbers (#412).
Fix bug when detecting column types for single row files without headers (#333).
readr now has a strategy for dealing with settings that vary from place to place: locales. The default locale is still US centric (because R itself is), but you can now easily override the default timezone, decimal separator, grouping mark, day & month names, date format, and encoding. This has lead to a number of changes:
locale() controls all the input settings that vary from place-to-place.
parse_euro_double() have been deprecated. Use the
decimal_mark parameter to
The default encoding is now UTF-8. To load files that are not in UTF-8, set the
encoding parameter of the
locale() (#40). New
guess_encoding() function uses stringi to help you figure out the encoding of a file.
%b use the month names (full and abbreviate) defined in the locale (#242). They also inherit the tz from the locale, rather than using an explicit
vignette("locales") for more details.
cols() lets you pick the default column type for columns not otherwise explicitly named (#148). You can refer to parsers either with their full name (e.g.
col_character()) or their one letter abbreviation (e.g.
read_fwf() is now much more careful with new lines. If a line is too short, you’ll get a warning instead of a silent mistake (#166, #254). Additionally, the last column can now be ragged: the width of the last field is silently extended until it hits the next line break (#146). This appears to be a common feature of “fixed” width files in the wild.
comment argument allows you to ignore comments (#68).
trim_ws argument controls whether leading and trailing whitespace is removed. It defaults to
Specifying the wrong number of column names, or having rows with an unexpected number of columns, generates a warning, rather than an error (#189).
Multiple NA values can be specified by passing a character vector to
na (#125). The default has been changed to
na = c("", "NA"). Specifying
na = "" now works as expected with character columns (#114).
vignette("column-types") which describes how the defaults work and how to override them (#122).
col_time() allows you to parse times (hours, minutes, seconds) into number of seconds since midnight. If the format is omitted, it uses a flexible parser that looks for hours, then optional colon, then minutes, then optional colon, then optional seconds, then optional am/pm (#249).
parse_datetime() no longer incorrectly reads partial dates (e.g. 19, 1900, 1900-01) (#136). These triggered common false positives and after re-reading the ISO8601 spec, I believe they actually refer to periods of time, and should not be translated in to a specific instant (#228).
“%.” now requires a non-digit. New “%+” skips one or more non-digits.
You can now use
%p to refer to AM/PM (and am/pm) (#126).
%B formats (month and abbreviated month name) ignore case when matching (#219).
parse_number() is a somewhat flexible numeric parser designed to read currencies and percentages. It only reads the first number from a string (using the grouping mark defined by the locale).
parse_numeric() has been deprecated because the name is confusing - it’s a flexible number parser, not a parser of “numerics”, as R collectively calls doubles and integers. Use
As well as improvements to the parser, I’ve also made a number of tweaks to the heuristics that readr uses to guess column types:
Bumped up row inspection for column typing guessing from 100 to 1000.
A column is guessed as
col_number() only if it parses as a regular number when you ignoring the grouping marks.
Now use R’s platform independent
iconv wrapper, thanks to BDR (#149).
Pathological zero row inputs (due to empty input,
n_max) now return zero row data frames (#119).
col_types specification now understands
? (guess) and
- (skip) (#188).
parse_*() gains a
na argument that allows you to specify which values should be converted to missing.
problems() now reports column names rather than column numbers (#143). Whenever there is a problem, the first five problems are printing out in a warning message, so you can more easily see what’s wrong.
read_*() can read from a remote gz compressed file (#163).
read_lines() gains a progress bar. It now also correctly checks for interrupts every 500,000 lines so you can interrupt long running jobs. It also correctly estimates the number of lines in the file, considerably speeding up the reading of large files (60s -> 15s for a 1.5 Gb file).
read_lines_raw() allows you to read a file into a list of raw vectors, one element for each line.
trim_ws arguments, and removes missing values before determining column types.
Quotes are only used when they’re needed (#116): when the string contains a quote, the delimiter, a new line or NA.
na argument that specifies how missing values should be written (#187)
POSIXt vectors are saved in a ISO8601 compatible format (#134).