This vignette provides an overview of column type specification with readr. Currently it focuses on how automatic guessing works, but over time we expect to cover more topics.

library(readr)

## Automatic guessing

If you don’t explicit specify column types with the col_types argument, readr will attempt to guess them using some simple heuristics. By default, it will inspect 1000 values, evenly spaced from the first to the last row. This is a heuristic designed to always be fast (no matter how large your file is) and, in our experience, does a good job in most cases.

If needed, you can request that readr use more rows by supplying the guess_max argument. You can even supply guess_max = Inf to use every row to guess the column types. You might wonder why this isn’t the default. That’s because it’s slow: it has to look at every column twice, once to determine the type and once to parse the value. In most cases, you’re best off supplying col_types yourself.

### Legacy behavior

Column type guessing was substantially worse in the first edition of readr (meaning, prior to v2.0.0), because it always looked at the first 1000 rows, and through some application of Murphy’s Law, it appears that many real csv files have lots of empty values at the start, followed by more “excitement” later in the file. Let’s demonstrate the problem with a slightly tricky file: the column x is mostly empty, but has some numeric data at the very end, in row 1001.

tricky_dat <- tibble::tibble(
x = rep(c("", "2"), c(1000, 1)),
y = "y"
)
tfile <- tempfile("tricky-column-type-guessing-", fileext = ".csv")
write_csv(tricky_dat, tfile)

The first edition parser doesn’t guess the right type for x so the 2 becomes an NA:

df <- with_edition(1, read_csv(tfile))
#>
#> ── Column specification ──────────────────────────────────────────────────
#> cols(
#>   x = col_logical(),
#>   y = col_character()
#> )
#> Warning: 1 parsing failure.
#>  row col           expected actual                                                           file
#> 1001   x 1/0/T/F/TRUE/FALSE      2 '/tmp/RtmpS9Qlhn/tricky-column-type-guessing-21d443a5efd0.csv'
tail(df)
#> # A tibble: 6 × 2
#>   x     y
#>   <lgl> <chr>
#> 1 NA    y
#> 2 NA    y
#> 3 NA    y
#> 4 NA    y
#> 5 NA    y
#> 6 NA    y

For this specific case, we can fix the problem by marginally increasing guess_max:

df <- with_edition(1, read_csv(tfile, guess_max = 1001))
#>
#> ── Column specification ──────────────────────────────────────────────────
#> cols(
#>   x = col_double(),
#>   y = col_character()
#> )
tail(df)
#> # A tibble: 6 × 2
#>       x y
#>   <dbl> <chr>
#> 1    NA y
#> 2    NA y
#> 3    NA y
#> 4    NA y
#> 5    NA y
#> 6     2 y

Unlike the second edition, we don’t recommend using guess_max = Inf with the legacy parser, because the engine pre-allocates a large amount of memory in the face of this uncertainty. This means that reading with guess_max = Inf can be extremely slow and might even crash your R session. Instead specify the col_types:

df <- with_edition(1, read_csv(tfile, col_types = list(x = col_double())))
tail(df)
#> # A tibble: 6 × 2
#>       x y
#>   <dbl> <chr>
#> 1    NA y
#> 2    NA y
#> 3    NA y
#> 4    NA y
#> 5    NA y
#> 6     2 y