Did you know that common tidyverse verbs like filter, select, and mutate have base R equivalents? In this post, I’ll show some examples of how many of the data wrangling tasks frequently done using the tidyverse can be done using base R functions, that don’t require you to load any additional packages.
For the first few examples, we’ll be using the airquality dataset:
I quite like this, as I often want to do both things at once.
Modifying variables
What about modify variables and creating new ones? Let’s create a new variable containing the temperature measurements on the Celsius scale, and replace the numeric Month variable with month names. A tidyverse approach can look like this:
Ozone Solar.R Wind Temp Month Day TempC
1 41 190 7.4 67 May 1 19.44444
2 36 118 8.0 72 May 2 22.22222
3 12 149 12.6 74 May 3 23.33333
4 18 313 11.5 62 May 4 16.66667
5 NA NA 14.3 56 May 5 13.33333
6 28 NA 14.9 66 May 6 18.88889
No advantage to using mutate here. Indeed, using the base R month.name (which you of course can use also with mutate) makes the code more succinct.
Where mutate really starts to shine is in more advanced use cases, for instance when we we want to apply the same transformation to several variables. We may, for instance, want to the logarithm of all numeric variables.
The solution recommended by the tidyverse developers uses across and where:
Ozone Solar.R Wind Temp Month Day TempC
1 3.713572 5.247024 2.001480 4.204693 May 0.0000000 2.967561
2 3.583519 4.770685 2.079442 4.276666 May 0.6931472 3.101093
3 2.484907 5.003946 2.533697 4.304065 May 1.0986123 3.149883
4 2.890372 5.746203 2.442347 4.127134 May 1.3862944 2.813411
5 NA NA 2.660260 4.025352 May 1.6094379 2.590267
6 3.332205 NA 2.701361 4.189655 May 1.7917595 2.938574
In base R, we could for instance use lapply and a custom function. As this creates a list, we’d then have to add a step to turn it into a data frame again:
Ozone Solar.R Wind Temp Month Day TempC
1 3.713572 5.247024 2.001480 4.204693 May 0.0000000 2.967561
2 3.583519 4.770685 2.079442 4.276666 May 0.6931472 3.101093
3 2.484907 5.003946 2.533697 4.304065 May 1.0986123 3.149883
4 2.890372 5.746203 2.442347 4.127134 May 1.3862944 2.813411
5 NA NA 2.660260 4.025352 May 1.6094379 2.590267
6 3.332205 NA 2.701361 4.189655 May 1.7917595 2.938574
Not that difficult for someone with a bit of experience of R programming, but a little unwieldy compared to mutate_if. Of course, if we want to do this kind of thing a lot, we could of course also define our own apply_if function:
Similarly, merge from base R can be used instead of the various _join functions from tidyverse.
Other operations
Other tasks, like ordering a data frame, are done using bracket notation in base R. Using tidyverse syntax, we can sort the airquality data by Temp and then move the Month variable in the airquality data to the leftmost column in the data frame as follows:
Month Ozone Solar.R Wind Temp Day
5 5 NA NA 14.3 56 5
18 5 6 78 18.4 57 18
25 5 NA 66 16.6 57 25
27 5 NA NA 8.0 57 27
15 5 18 65 13.2 58 15
26 5 NA 266 14.9 58 26
A clear win for the tidyverse in this case, I think. The union step of the base solution is particularly headache-inducing.
Why use base R instead of the tidyverse?
Over the years, I’ve met a lot of R users who know very little about base R functions, and even a small number who brag about only using tidyverse syntax, as if it somehow makes them better R users. Even though I’m an avid tidyverse user, I find this a little strange. Base R contains a lot of gems, and it certainly doesn’t hurt to know a few of them. In some cases, they are cleaner than pure tidyverse solutions. Consider the convert-month-number-to-name example we looked at earlier. A colleague of mine wrote this code (albeit for a different dataset) a few years ago:
It requires us to load three packages: dplyr (for mutate), lubridate (for month and ymd), and glue. Alternatively, we could just use the much cleaner base solution without having to load any packages at all:
airquality |>transform(Month = month.name[Month])
In some cases, I’d argue that “why use base R instead of the tidyverse?” probably is the wrong question to ask. This is particularly true when developing packages, where having fewer dependencies is an advantage. 90 % of the data wrangling I do using the tidyverse makes use of filter, select, mutate, pivot_wider and pivot_longer. If I only need those verbs, I might as well use base R. Instead of asking “why use base R instead of the tidyverse?” we should sometimes ask ourselves why we’d want to load more packages than necessary.
What do I use these days then? Base R and the tidyverse, of course. Knowing both lets me select the right tool for each task. Sometimes base R is quicker and cleaner, and sometimes tidyverse is. This is also the reason why I include both (along with other approaches, like data.table) in Modern Statistics with R. For statisticians and programmers alike, it is always better to have a larger toolbox.