Regex - Extract Pattern From String, Strip Text, Convert To Numeric and Sum in R Data
Regex - Extract Pattern From String, Strip Text, Convert To Numeric and Sum in R Data
- Stack Overflow
x Dismiss
Sign up
Extract pattern from string, strip text, convert to numeric and sum in R data.table?
I have a (100k rows) data.table mydata with one of the column that looks like this:
library(data.table)
library(stringr)
How do I efficiently - preferably in 1 line of code - pull out all numbers that precede the M's (they can be varying digits long), convert them to
numeric and find their sum.
I have managed to do this with 3 rounds the sapply function, and creating some additional columns which I don't need:
You did not provide reproducible example so here is pseudo code answer in comment :) f = function(x)
unlist(lapply(strsplit(x, "M"), [[ ,1L)) then dt[, .(col = f(col))] , sum shouldn't be a problem. –
jangorecki Nov 27 '15 at 0:59
2 Answers
You can make a function to grab the sum of numbers appearing before all instances of the
letter M from the string, then create a column in your data.table .
mydata
# A c
# 1: 17M1I26M570M20S1M 614
# 2: 17M1I260M570M20S1M 848
Edit: Changed regex using @thelatemail's suggestion in comments for more efficient
matching.
https://stackoverflow.com/questions/33948847/extract-pattern-from-string-strip-text-convert-to-numeric-and-sum-in-r-data-ta 1/2
9/17/2017 regex - Extract pattern from string, strip text, convert to numeric and sum in R data.table? - Stack Overflow
ialm
4,981 2 23 41
1 Using str_match_all(x, "\\d+(?=M)") will remove the need to subset later, and will store less data in the
intermediate matches variable. – thelatemail Nov 27 '15 at 1:38
library(dplyr)
library(tidyr)
library(stringi)
library(rex)
regex_1 =
rex(capture(digits),
capture(letter) )
data =
data_frame(
a = c("17M1I26M570M20S1M",
"17M1I260M570M20S1M") )
key =
data %>%
select(a) %>%
distinct %>%
mutate(match_list =
a %>%
stri_extract_all_regex(regex_1) ) %>%
unnest(match_list) %>%
extract(match_list,
c("number", "letter"),
regex_1) %>%
group_by(a) %>%
mutate(order = 1:n(),
number = as.numeric(number))
key %>%
group_by(a) %>%
summarize(total = sum(number)) %>%
right_join(data)
https://stackoverflow.com/questions/33948847/extract-pattern-from-string-strip-text-convert-to-numeric-and-sum-in-r-data-ta 2/2