CRAN Recipes: DPLYR, Stringr, Lubridate, and RegEx in R
()
About this ebook
Want to use the power of R sooner rather than later? Don’t have time to plow through wordy texts and online manuals? Use this book for quick, simple code to get your projects up and running. It includes code and examples applicable to many disciplines. Written in everyday language with a minimum of complexity, each chapter provides the building blocks you need to fit R’s astounding capabilities to your analytics, reporting, and visualization needs.
CRAN Recipes recognizes how needless jargon and complexity get in your way. Busy professionals need simple examples and intuitive descriptions; side trips and meandering philosophical discussions are left for other books.
Here R scripts are condensed, to the extent possible, to copy-paste-run format. Chapters and examples are structured to purpose rather than particular functions (e.g., “dirty data cleanup” rather than the R package name “janitor”). Everyday language eliminatesthe need to know functions/packages in advance.
What You Will Learn
- Carry out input/output; visualizations; data munging; manipulations at the group level; and quick data exploration
- Handle forecasting (multivariate, time series, logistic regression, Facebook’s Prophet, and others)
- Use text analytics; sampling; financial analysis; and advanced pattern matching (regex)
- Manipulate data using DPLYR: filter, sort, summarize, add new fields to datasets, and apply powerful IF functions
- Create combinations or subsets of files using joins
- Write efficient code using pipes to eliminate intermediate steps (MAGRITTR)
- Work with string/character manipulation of all types (STRINGR)
- Discover counts, patterns, and how to locate whole words
- Do wild-card matching, extraction, and invert-match
- Work with dates using LUBRIDATE
- Fix dirty data; attractive formatting; bad habits to avoid
Who This Book Is For
Programmers/data scientists with at least some prior exposure to R.
Related to CRAN Recipes
Related ebooks
R Data Science Quick Reference: A Pocket Guide to APIs, Libraries, and Packages Rating: 0 out of 5 stars0 ratingsPropeller Programming: Using Assembler, Spin, and C Rating: 0 out of 5 stars0 ratingsMATLAB Machine Learning Recipes: A Problem-Solution Approach Rating: 0 out of 5 stars0 ratingsPointers in C Programming: A Modern Approach to Memory Management, Recursive Data Structures, Strings, and Arrays Rating: 0 out of 5 stars0 ratingsPractical Rust 1.x Cookbook, Second Edition Rating: 0 out of 5 stars0 ratingsRaku Recipes: A Problem-Solution Approach Rating: 0 out of 5 stars0 ratingsOracle Database Transactions and Locking Revealed: Building High Performance Through Concurrency Rating: 0 out of 5 stars0 ratingsTensorFlow 2.x in the Colaboratory Cloud: An Introduction to Deep Learning on Google’s Cloud Service Rating: 0 out of 5 stars0 ratingsModern Full-Stack Development: Using TypeScript, React, Node.js, Webpack, and Docker Rating: 0 out of 5 stars0 ratingsC in 30 Pages Rating: 5 out of 5 stars5/5C# 7 Quick Syntax Reference: A Pocket Guide to the Language, APIs, and Library Rating: 0 out of 5 stars0 ratingsDeveloping Web Components with TypeScript: Native Web Development Using Thin Libraries Rating: 0 out of 5 stars0 ratingsC# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition Rating: 0 out of 5 stars0 ratingsIntroducing Vala Programming: A Language and Techniques to Boost Productivity Rating: 0 out of 5 stars0 ratingsPractical Test Automation: Learn to Use Jasmine, RSpec, and Cucumber Effectively for Your TDD and BDD Rating: 0 out of 5 stars0 ratingsLearn R By Coding Rating: 0 out of 5 stars0 ratingsC# 8 Quick Syntax Reference: A Pocket Guide to the Language, APIs, and Library Rating: 0 out of 5 stars0 ratingsGood Habits for Great Coding: Improving Programming Skills with Examples in Python Rating: 0 out of 5 stars0 ratings42 Astoundingly Useful Scripts and Automations for the Macintosh Rating: 0 out of 5 stars0 ratingsClean C++20: Sustainable Software Development Patterns and Best Practices Rating: 0 out of 5 stars0 ratingsDomain-Specific Languages in R: Advanced Statistical Programming Rating: 0 out of 5 stars0 ratingsDeep Belief Nets in C++ and CUDA C: Volume 3: Convolutional Nets Rating: 0 out of 5 stars0 ratingsPro C# 8 with .NET Core 3: Foundational Principles and Practices in Programming Rating: 0 out of 5 stars0 ratingsDRBD-Cookbook: How to create your own cluster solution, without SAN or NAS! Rating: 0 out of 5 stars0 ratingsC# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications Rating: 0 out of 5 stars0 ratingsDatabase-Driven Web Development: Learn to Operate at a Professional Level with PERL and MySQL Rating: 0 out of 5 stars0 ratingsBeginning Ada Programming: From Novice to Professional Rating: 0 out of 5 stars0 ratingsRaspberry Pi Assembly Language Programming: ARM Processor Coding Rating: 0 out of 5 stars0 ratings
Computers For You
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsSlenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsAlan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The Mega Box: The Ultimate Guide to the Best Free Resources on the Internet Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Dark Aeon: Transhumanism and the War Against Humanity Rating: 5 out of 5 stars5/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Managing Humans: Biting and Humorous Tales of a Software Engineering Manager Rating: 4 out of 5 stars4/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5
Reviews for CRAN Recipes
0 ratings0 reviews
Book preview
CRAN Recipes - William Yarberry
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
W. YarberryCRAN Recipeshttps://doi.org/10.1007/978-1-4842-6876-6_1
1. DPLYR
William Yarberry¹
(1)
Kingwood, TX, USA
DPLYR is one of my favorite R packages. Its logical and consistent rules replace the older, motley collection of syntactically inconsistent packages and functions. It’s like a Swiss Army knife in the woods—don’t leave home without it.
Most of the book’s code examples use built-in R datasets or toy dataframe hard-coded into the program. For practice, you should substitute your own data when running the snippets of code.
1.1 Filter Commands
The filter command is used to eliminate rows (records) you do not want. The following commands use built-in datasets as the input dataframe. The dataset mtcars
is used in the following. The output shows cars with six cylinders only.
Note
The following shown libraries will be used in all code unless otherwise noted. DPLYR is included in the mega-package tidyverse.
1.1.1 Single-Condition Filter
library(tidyverse)
data(mtcars
)
#select only cars with six cylinders
six.cyl.only <- filter(mtcars, cyl == 6)
six.cyl.only
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
In the filter command, equals
is a double equals sign ==
.
1.1.2 Multiple-Condition Filter
Filter the dataset mtcars for both six cylinders and 110 horsepower:
six.cylinders.and.110.horse.power <- filter(mtcars, cyl == 6,
hp == 110)
six.cylinders.and.110.horse.power
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
1.1.3 OR Logic for Filtering
You can use as many OR symbols (pipe |) as needed.
Filter based on the OR logical operator:
gear.eq.4.or.more.than.8 <- filter(mtcars, gear == 4|cyl > 6)
gear.eq.4.or.more.than.8
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
1.1.4 Filter by Minimums, Maximums, and Other Numeric Criteria
The output shows, as one would expect, a single row with the smallest engine displacement:
smallest.engine.displacement <- filter(mtcars, disp ==
min(disp))
smallest.engine.displacement
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
Filter with conditions separated by commas:
data(ChickWeight
)
chick.subset <- filter(ChickWeight, Time < 3, weight > 53)
chick.subset
## weight Time Chick Diet
## 1 55 2 22 2
## 2 55 2 40 3
## 3 55 2 43 4
## 4 54 2 50 4
1.1.5 Filter Out Missing Values (NAs) for a Specific Column
The built-in dataset airquality
has a missing value in the fifth row of the first column (Ozone):
data(airquality
)
head(airquality,10) #before filter
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
Remove any row with missing values in the Ozone column:
no.missing.ozone = filter(airquality, !is.na(Ozone))
head(no.missing.ozone,8) #after filter
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 28 NA 14.9 66 5 6
## 6 23 299 8.6 65 5 7
## 7 19 99 13.8 59 5 8
## 8 8 19 20.1 61 5 9
Note that although the row with NA for Ozone has been eliminated, the row with an NA for Solar.R is still there.
1.1.6 Filter Rows with NAs Anywhere in the Dataset
Use complete.cases() to remove any rows containing an NA in any column:
airqual.no.NA.anywhere <- filter(airquality[1:10,],
complete.cases(airquality[1:10,]))
airqual.no.NA.anywhere
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 23 299 8.6 65 5 7
## 6 19 99 13.8 59 5 8
## 7 8 19 20.1 61 5 9
1.1.7 Filter by %in%
%in%
is a powerful operator, providing a convenient shorthand for including/excluding specified values:
data(iris
)
table(iris$Species) #counts of species in the dataset
##
## setosa versicolor virginica
## 50 50 50
iris.two.species <- filter(iris,
Species %in% c(setosa
, virginica
))
table(iris.two.species$Species)
##
## setosa versicolor virginica
## 50 0 50
Show the number of rows before and after filtering:
nrow(iris); nrow(iris.two.species)
## [1] 150
## [1] 100
1.1.8 Filter for Ozone > 29 and Include Only Three Columns
data(airquality
)
airqual.3.columns <- filter(airquality, Ozone > 29)[,1:3]
head(airqual.3.columns)
## Ozone Solar.R Wind
## 1 41 190 7.4
## 2 36 118 8.0
## 3 34 307 12.0
## 4 30 322 11.5
## 5 32 92 12.0
## 6 45 252 14.9
1.1.9 Filter by Total Frequency of a Value Across All Rows
This logic uses group_by
to enable counting of rows based on number of gears. After the counts of gears are made, then only those rows whose total counts exceed ten are included in the output. All you want to see here are records that have at least 11 rows with a specific number of gears in the car. The filter is driven solely by frequency of occurrence. Your question may be phrased as just show me records where common gear configurations occur.
Five gears are not nearly as common as three and four, so in the filtered dataframe, they are omitted. In the following first table, there are 15 records with a car having three gears, 12 records for four gears, and five records for five gears. After applying the filter and creating a new dataframe, there are no records having five gears:
table(mtcars$gear)
##
## 3 4 5
## 15 12 5
more.frequent.no.of.gears <- mtcars %>%
group_by(gear) %>%
filter(n() > 10) #
table(more.frequent.no.of.gears$gear)
##
## 3 4
## 15 12
Additional criteria can be added to the filter by including a requirement that the horsepower be less than 105:
more.frequent.no.of.gears.and.low.horsepower <- mtcars %>%
group_by(gear) %>%
filter(n() > 10, hp < 105)
table(more.frequent.no.of.gears.and.low.horsepower$gear)
##
## 3 4
## 1 7
1.1.10 Filter by Column Name Using starts with
In this code, records are selected where the column name starts with an S
:
names(iris) #show the column names
## [1] Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
iris.display <- iris %>% dplyr::select(starts_with(S
))
head(iris.display) #use head to reduce number of rows output
## Sepal.Length Sepal.Width Species
## 1 5.1 3.5 setosa
## 2 4.9 3.0 setosa
## 3 4.7 3.2 setosa
## 4 4.6 3.1 setosa
## 5 5.0 3.6 setosa
## 6 5.4 3.9 setosa
1.1.11 Filter Rows: Columns Meet Criteria (filter_at)
Use filter_at to find rows which meet some criteria such as maximum:
new.mtcars <- mtcars %>% filter_at(vars(cyl, hp),
all_vars(. == max(.)))
new.mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Maserati Bora 15 8 301 335 3.54 3.57 14.6 0 1 5 8
Note that only one car, the Maserati Bora, had both the maximum number of cylinders and the maximum horsepower for each column, respectively.
Another example dataset comes from Suzan Baert’s blog (https://suzan.rbind.io/2018/02/dplyr-tutorial-3/#filter-at), using sleep study research.
Load the msleep dataframe from the package ggplot2:
msleep <- ggplot2::msleep
msleep
## # A tibble: 83 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
##
## 1 Chee~ Acin~ carni Carn~ lc 12.1 NA NA 11.9
## 2 Owl ~ Aotus omni Prim~
## 3 Moun~ Aplo~ herbi Rode~ nt 14.4 2.4 NA 9.6
## 4 Grea~ Blar~ omni Sori~ lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti~ domesticated 4 0.7 0.667 20
## 6 Thre~ Brad~ herbi Pilo~
## 7 Nort~ Call~ carni Carn~ vu 8.7 1.4 0.383 15.3
## 8 Vesp~ Calo~
## 9 Dog Canis carni Carn~ domesticated 10.1 2.9 0.333 13.9
## 10 Roe ~ Capr~ herbi Arti~ lc 3 NA NA 21
## # ... with 73 more rows, and 2 more variables: brainwt
msleep.over.5 <- msleep %>%
select(name, sleep_total:sleep_rem, brainwt:bodywt) %>%
filter_at(vars(contains(sleep
)), all_vars(.>5))
msleep.over.5
## # A tibble: 2 x 5
## name sleep_total sleep_rem brainwt bodywt
##
## 1 Thick-tailed opposum 19.4 6.6 NA 0.37
## 2 Giant armadillo 18.1 6.1 0.081 60
For the preceding code, ignore the select statement for the moment (covered later). The filter_at function says to look at only variables containing the word sleep.
Within those variables (in this case, two of them), filter for any values greater than 5. The .
means any variable with sleep in the name. Only two rows met the criteria for the filter in this case.
1.2 Arrange (Sort)
Arrange, the sorting function, is as old as the alphabet. Based on the defined ASCII order, it rearranges a dataframe or vector in a sequence defined as either ascending or descending. Sort keys are defined as primary, secondary, and so on.
Load the msleep dataframe from the package ggplot2:
msleep <- ggplot2::msleep
msleep[,1:4]
## # A tibble: 83 x 4
## name genus vore order
##
## 1 Cheetah Acinonyx carni Carnivora
## 2 Owl monkey Aotus omni Primates
## 3 Mountain beaver Aplodontia herbi Rodentia
## 4 Greater short-tailed shrew Blarina omni Soricomorpha
## 5 Cow Bos herbi Artiodactyla
## 6 Three-toed sloth Bradypus herbi Pilosa
## 7 Northern fur seal Callorhinus carni Carnivora
## 8 Vesper mouse Calomys
## 9 Dog Canis carni Carnivora
## 10 Roe deer Capreolus herbi Artiodactyla
## # ... with 73 more rows
1.2.1 Ascending
animal.name.sequence <- arrange(msleep, vore, order)
animal.name.sequence[,1:4]
## # A tibble: 83 x 4
## name genus vore order
##
## 1 Cheetah Acinonyx carni Carnivora
## 2 Northern fur seal Callorhinus carni Carnivora
## 3 Dog Canis carni Carnivora
## 4 Domestic cat Felis carni Carnivora
## 5 Gray seal Halichoerus carni Carnivora
## 6 Tiger Panthera carni Carnivora
## 7 Jaguar Panthera carni Carnivora
## 8 Lion Panthera carni Carnivora
## 9 Caspian seal Phoca carni Carnivora
## 10 Genet Genetta carni Carnivora
## # ... with 73 more rows
1.2.2 Descending
animal.name.sequence.desc <- arrange(msleep, vore, desc(order))
head(animal.name.sequence.desc[,1:4])
## # A tibble: 6 x 4
## name genus vore order
##
## 1 Northern grasshopper mouse Onychomys carni Rodentia
## 2 Slow loris Nyctibeus carni Primates
## 3 Thick-tailed opposum Lutreolina carni Didelphimorphia
## 4 Long-nosed armadillo Dasypus carni Cingulata
## 5 Pilot whale Globicephalus carni Cetacea
## 6 Common porpoise Phocoena carni Cetacea
In section Mutate,
you’ll see how a variable can be created on the fly and then used in the same statement for sorting.
1.3 Rename
Rename allows you to change the name of one or more columns. It is a convenience function and changes no data.
Rename one or more columns in a dataset:
names(iris)
## [1] Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
Show new column names:
renamed.iris <- rename(iris, width.of.petals = Petal.Width,
various.plants.and.animals = Species)
names(renamed.iris)
## [1] Sepal.Length
Sepal.Width
## [3] Petal.Length
width.of.petals
## [5] various.plants.and.animals
1.4 Mutate
Mutate adds new variables to a dataframe. It requires the original dataframe as the first argument and then arguments to create new variables as the remaining arguments. The following example adds the natural log of length and weight to the dataframe created earlier that contains just the length and weight variables.
Add a new, calculated variable to a dataframe:
data(ChickWeight
)
ChickWeight[1:2,] #first two rows
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
First two rows, with new field added:
Chickweight.with.log <- mutate(ChickWeight,
log.of.weight = log10(weight))
Chickweight.with.log[1:2,]
## weight Time Chick Diet log.of.weight
## 1 42 0 1 1 1.623249
## 2 51 2 1 1 1.707570