Base R equivalents of tidyverse verbs

Did you know that common tidyverse verbs like filter, select, and mutate have base R equivalents? In this post, I’ll show some examples of how many of the data wrangling tasks frequently done using the tidyverse can be done using base R functions, that don’t require you to load any additional packages.

For the first few examples, we’ll be using the airquality dataset:

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Filtering

tidyverse filtering is done using the filter function:

library(tidyverse)
airquality |> filter(Temp > 92)

  Ozone Solar.R Wind Temp Month Day
1    NA     259 10.9   93     6  11
2    76     203  9.7   97     8  28
3   118     225  2.3   94     8  29
4    84     237  6.3   96     8  30
5    85     188  6.3   94     8  31
6    73     183  2.8   93     9   3
7    91     189  4.6   93     9   4

The base equivalent to filter is subset. In most cases, it’s a direct replacement for filter:

airquality |> subset(Temp > 92)

    Ozone Solar.R Wind Temp Month Day
42     NA     259 10.9   93     6  11
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4

When filtering using multiple conditions, we can separate them using either a comma , or an ampersand & when using filter:

library(tidyverse)
airquality |> filter(Temp > 92,
                     Wind < 7)

  Ozone Solar.R Wind Temp Month Day
1   118     225  2.3   94     8  29
2    84     237  6.3   96     8  30
3    85     188  6.3   94     8  31
4    73     183  2.8   93     9   3
5    91     189  4.6   93     9   4

When using subset, we always use an ampersand:

airquality |> subset(Temp > 92 &
                     Wind < 7)

    Ozone Solar.R Wind Temp Month Day
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4

To only keep unique rows, we can use the tidyverse function distinct:

airquality |> distinct() |> nrow()

[1] 153

…or the base function unique:

airquality |> unique() |> nrow()

[1] 153

distinct is better for more advanced use cases, but in the example above, both approaches work equally well.

Subsetting

When using tidyverse syntax, we can use select to select a subset of the variables in our data:

airquality |> select(Wind, Temp) |> names()

[1] "Wind" "Temp"

airquality |> select(Wind:Day) |> names()

[1] "Wind"  "Temp"  "Month" "Day"

airquality |> select(starts_with("W")) |> names()

[1] "Wind"

Using base R, we can again use subset:

airquality |> subset(select = c(Wind, Temp)) |> names()

[1] "Wind" "Temp"

airquality |> subset(select = Wind:Day) |> names()

[1] "Wind"  "Temp"  "Month" "Day"

airquality |> subset(select = grepl("^W", names(airquality))) |> names()

[1] "Wind"

Because subset is used both for filtering and subsetting, we can do both simultaneously:

airquality |> subset(Temp > 92 &
                     Wind < 7,
                     select = Wind:Day)

    Wind Temp Month Day
121  2.3   94     8  29
122  6.3   96     8  30
123  6.3   94     8  31
126  2.8   93     9   3
127  4.6   93     9   4

I quite like this, as I often want to do both things at once.

Modifying variables

What about modify variables and creating new ones? Let’s create a new variable containing the temperature measurements on the Celsius scale, and replace the numeric Month variable with month names. A tidyverse approach can look like this:

[1] "C"

library(glue)
airquality |>
    mutate(TempC = (Temp-32)/1.8,
           Month = month(ymd(glue("2025-0{mon}-01", mon = Month)), label = TRUE)) |>
    head()

  Ozone Solar.R Wind Temp Month Day    TempC
1    41     190  7.4   67   May   1 19.44444
2    36     118  8.0   72   May   2 22.22222
3    12     149 12.6   74   May   3 23.33333
4    18     313 11.5   62   May   4 16.66667
5    NA      NA 14.3   56   May   5 13.33333
6    28      NA 14.9   66   May   6 18.88889

In base R, we instead use transform (and because we know about base features, we’ll also use month.names for retrieving the month names):

airquality |>
    transform(TempC = (Temp-32)/1.8,
              Month = month.name[Month]) |>
    head()

  Ozone Solar.R Wind Temp Month Day    TempC
1    41     190  7.4   67   May   1 19.44444
2    36     118  8.0   72   May   2 22.22222
3    12     149 12.6   74   May   3 23.33333
4    18     313 11.5   62   May   4 16.66667
5    NA      NA 14.3   56   May   5 13.33333
6    28      NA 14.9   66   May   6 18.88889

No advantage to using mutate here. Indeed, using the base R month.name (which you of course can use also with mutate) makes the code more succinct.

Where mutate really starts to shine is in more advanced use cases, for instance when we we want to apply the same transformation to several variables. We may, for instance, want to the logarithm of all numeric variables.

The solution recommended by the tidyverse developers uses across and where:

airquality |>
    mutate(TempC = (Temp-32)/1.8,
           Month = month(ymd(glue("2025-0{mon}-01", mon = Month)), label = TRUE)) |>
    mutate(across(where(is.numeric), log)) |> 
    head()

     Ozone  Solar.R     Wind     Temp Month       Day    TempC
1 3.713572 5.247024 2.001480 4.204693   May 0.0000000 2.967561
2 3.583519 4.770685 2.079442 4.276666   May 0.6931472 3.101093
3 2.484907 5.003946 2.533697 4.304065   May 1.0986123 3.149883
4 2.890372 5.746203 2.442347 4.127134   May 1.3862944 2.813411
5       NA       NA 2.660260 4.025352   May 1.6094379 2.590267
6 3.332205       NA 2.701361 4.189655   May 1.7917595 2.938574

I prefer mutate_if, which I think is a lot cleaner:

airquality |>
    mutate(TempC = (Temp-32)/1.8,
           Month = month(ymd(glue("2025-0{mon}-01", mon = Month)), label = TRUE)) |>
    mutate_if(is.numeric, log) |> 
    head()

     Ozone  Solar.R     Wind     Temp Month       Day    TempC
1 3.713572 5.247024 2.001480 4.204693   May 0.0000000 2.967561
2 3.583519 4.770685 2.079442 4.276666   May 0.6931472 3.101093
3 2.484907 5.003946 2.533697 4.304065   May 1.0986123 3.149883
4 2.890372 5.746203 2.442347 4.127134   May 1.3862944 2.813411
5       NA       NA 2.660260 4.025352   May 1.6094379 2.590267
6 3.332205       NA 2.701361 4.189655   May 1.7917595 2.938574

In base R, we could for instance use lapply and a custom function. As this creates a list, we’d then have to add a step to turn it into a data frame again:

airquality |>
    transform(TempC = (Temp-32)/1.8,
              Month = month.name[Month]) |>
    lapply(\(x) if(is.numeric(x)) log(x) else x) |> 
    as.data.frame() |> 
    head()

     Ozone  Solar.R     Wind     Temp Month       Day    TempC
1 3.713572 5.247024 2.001480 4.204693   May 0.0000000 2.967561
2 3.583519 4.770685 2.079442 4.276666   May 0.6931472 3.101093
3 2.484907 5.003946 2.533697 4.304065   May 1.0986123 3.149883
4 2.890372 5.746203 2.442347 4.127134   May 1.3862944 2.813411
5       NA       NA 2.660260 4.025352   May 1.6094379 2.590267
6 3.332205       NA 2.701361 4.189655   May 1.7917595 2.938574

Not that difficult for someone with a bit of experience of R programming, but a little unwieldy compared to mutate_if. Of course, if we want to do this kind of thing a lot, we could of course also define our own apply_if function:

apply_if <- function(df, condition_function, func)
{
    df |> 
    lapply(\(x) if(condition_function(x)) func(x) else x) |> 
    as.data.frame()
}

airquality |>
    transform(TempC = (Temp-32)/1.8,
              Month = month.name[Month]) |>
    apply_if(is.numeric, log) |> 
    head()

     Ozone  Solar.R     Wind     Temp Month       Day    TempC
1 3.713572 5.247024 2.001480 4.204693   May 0.0000000 2.967561
2 3.583519 4.770685 2.079442 4.276666   May 0.6931472 3.101093
3 2.484907 5.003946 2.533697 4.304065   May 1.0986123 3.149883
4 2.890372 5.746203 2.442347 4.127134   May 1.3862944 2.813411
5       NA       NA 2.660260 4.025352   May 1.6094379 2.590267
6 3.332205       NA 2.701361 4.189655   May 1.7917595 2.938574

Grouped summaries

Now, let’s compute the mean temperature (in Celsius) for the different months. Using tidyverse syntax, we might do something like this:

airquality |> 
    mutate(TempC = (Temp-32)/1.8,
           Month = month(ymd(glue("2025-0{mon}-01", mon = Month)), label = TRUE)) |>
    group_by(Month) |> 
    summarise(MeanTemp = mean(TempC))

# A tibble: 5 × 2
  Month MeanTemp
  <ord>    <dbl>
1 May       18.6
2 Jun       26.2
3 Jul       28.8
4 Aug       28.9
5 Sep       24.9

Using base R, we can use aggregate instead:

airquality |>
    transform(TempC = (Temp-32)/1.8,
              Month = month.name[Month]) |>
    aggregate(TempC ~ Month, mean)

      Month    TempC
1    August 28.87097
2      July 28.83513
3      June 26.16667
4       May 18.63799
5 September 24.94444

Long-to-wide and wide-to-long

Reshaping data is another common task. The Indometh dataset is in long format:

head(Indometh)

  Subject time conc
1       1 0.25 1.50
2       1 0.50 0.94
3       1 0.75 0.78
4       1 1.00 0.48
5       1 1.25 0.37
6       1 2.00 0.19

To reshape it to wide format, we can use the tidyverse function pivot_wider:

Indometh |> 
    pivot_wider(id_cols = Subject,
                names_from = time,
                values_from = conc) ->
    Indometh_wide

Indometh_wide

# A tibble: 6 × 12
  Subject `0.25` `0.5` `0.75`   `1` `1.25`   `2`   `3`   `4`   `5`   `6`   `8`
  <ord>    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1         1.5   0.94   0.78  0.48   0.37  0.19  0.12  0.11  0.08  0.07  0.05
2 2         2.03  1.63   0.71  0.7    0.64  0.36  0.32  0.2   0.25  0.12  0.08
3 3         2.72  1.49   1.16  0.8    0.8   0.39  0.22  0.12  0.11  0.08  0.08
4 4         1.85  1.39   1.02  0.89   0.59  0.4   0.16  0.11  0.1   0.07  0.07
5 5         2.05  1.04   0.81  0.39   0.3   0.23  0.13  0.11  0.08  0.1   0.06
6 6         2.31  1.44   1.03  0.84   0.64  0.42  0.24  0.17  0.13  0.1   0.09

And then back to long format using pivot_longer:

Indometh_wide |> 
    pivot_longer(cols = -Subject,
                 names_to = "time",
                 values_to = "conc") |> 
    head()

# A tibble: 6 × 3
  Subject time   conc
  <ord>   <chr> <dbl>
1 1       0.25   1.5 
2 1       0.5    0.94
3 1       0.75   0.78
4 1       1      0.48
5 1       1.25   0.37
6 1       2      0.19

In base R, we can instead use reshape in both directions. From long to wide:

Indometh |> 
    reshape(direction = "wide",
            idvar = "Subject",
            timevar = "time") ->
    Indometh_wide

Indometh_wide

   Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3 conc.4
1        1      1.50     0.94      0.78   0.48      0.37   0.19   0.12   0.11
12       2      2.03     1.63      0.71   0.70      0.64   0.36   0.32   0.20
23       3      2.72     1.49      1.16   0.80      0.80   0.39   0.22   0.12
34       4      1.85     1.39      1.02   0.89      0.59   0.40   0.16   0.11
45       5      2.05     1.04      0.81   0.39      0.30   0.23   0.13   0.11
56       6      2.31     1.44      1.03   0.84      0.64   0.42   0.24   0.17
   conc.5 conc.6 conc.8
1    0.08   0.07   0.05
12   0.25   0.12   0.08
23   0.11   0.08   0.08
34   0.10   0.07   0.07
45   0.08   0.10   0.06
56   0.13   0.10   0.09

…and from wide to long:

Indometh_wide |> 
    reshape(direction = "long") |> 
    head()

       Subject time conc.0.25
1.0.25       1 0.25      1.50
2.0.25       2 0.25      2.03
3.0.25       3 0.25      2.72
4.0.25       4 0.25      1.85
5.0.25       5 0.25      2.05
6.0.25       6 0.25      2.31

Similarly, merge from base R can be used instead of the various _join functions from tidyverse.

Other operations

Other tasks, like ordering a data frame, are done using bracket notation in base R. Using tidyverse syntax, we can sort the airquality data by Temp and then move the Month variable in the airquality data to the leftmost column in the data frame as follows:

airquality |> 
    arrange(Temp) |> 
    relocate(Month) |> 
    head()

  Month Ozone Solar.R Wind Temp Day
1     5    NA      NA 14.3   56   5
2     5     6      78 18.4   57  18
3     5    NA      66 16.6   57  25
4     5    NA      NA  8.0   57  27
5     5    18      65 13.2   58  15
6     5    NA     266 14.9   58  26

Using only base R, we’d do the following:

airquality |> 
    _[order(airquality$Temp),] |> 
    _[union("Month", names(airquality))] |> 
    head()

   Month Ozone Solar.R Wind Temp Day
5      5    NA      NA 14.3   56   5
18     5     6      78 18.4   57  18
25     5    NA      66 16.6   57  25
27     5    NA      NA  8.0   57  27
15     5    18      65 13.2   58  15
26     5    NA     266 14.9   58  26

A clear win for the tidyverse in this case, I think. The union step of the base solution is particularly headache-inducing.

Why use base R instead of the `tidyverse`?

Over the years, I’ve met a lot of R users who know very little about base R functions, and even a small number who brag about only using tidyverse syntax, as if it somehow makes them better R users. Even though I’m an avid tidyverse user, I find this a little strange. Base R contains a lot of gems, and it certainly doesn’t hurt to know a few of them. In some cases, they are cleaner than pure tidyverse solutions. Consider the convert-month-number-to-name example we looked at earlier. A colleague of mine wrote this code (albeit for a different dataset) a few years ago:

airquality |> 
    mutate(Month = month(ymd(glue("2025-0{mon}-01", mon = Month)), label = TRUE))

It requires us to load three packages: dplyr (for mutate), lubridate (for month and ymd), and glue. Alternatively, we could just use the much cleaner base solution without having to load any packages at all:

airquality |> transform(Month = month.name[Month])

In some cases, I’d argue that “why use base R instead of the tidyverse?” probably is the wrong question to ask. This is particularly true when developing packages, where having fewer dependencies is an advantage. 90 % of the data wrangling I do using the tidyverse makes use of filter, select, mutate, pivot_wider and pivot_longer. If I only need those verbs, I might as well use base R. Instead of asking “why use base R instead of the tidyverse?” we should sometimes ask ourselves why we’d want to load more packages than necessary.

What do I use these days then? Base R and the tidyverse, of course. Knowing both lets me select the right tool for each task. Sometimes base R is quicker and cleaner, and sometimes tidyverse is. This is also the reason why I include both (along with other approaches, like data.table) in Modern Statistics with R. For statisticians and programmers alike, it is always better to have a larger toolbox.