Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
110 views

Regex - Extract Pattern From String, Strip Text, Convert To Numeric and Sum in R Data

This document discusses efficiently extracting numeric patterns from strings, converting them to numbers, and summing them in R. It provides an example of strings containing numbers separated by "M" characters. Several approaches are presented, including using str_match_all() and str_extract_all() from stringr to extract the numbers as a list, converting to numeric and summing. The most efficient method uses a regular expression to extract the numbers and letters in one pass and group_by() to calculate the sum for each string.

Uploaded by

gprasadatvu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views

Regex - Extract Pattern From String, Strip Text, Convert To Numeric and Sum in R Data

This document discusses efficiently extracting numeric patterns from strings, converting them to numbers, and summing them in R. It provides an example of strings containing numbers separated by "M" characters. Several approaches are presented, including using str_match_all() and str_extract_all() from stringr to extract the numbers as a list, converting to numeric and summing. The most efficient method uses a regular expression to extract the numbers and letters in one pass and group_by() to calculate the sum for each string.

Uploaded by

gprasadatvu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

9/17/2017 regex - Extract pattern from string, strip text, convert to numeric and sum in R data.table?

- Stack Overflow

x Dismiss

Join the Stack Overflow Community

Stack Overflow is a community of 7.7 million


programmers, just like you, helping each other.
Join them; it only takes a minute:

Sign up

Extract pattern from string, strip text, convert to numeric and sum in R data.table?

I have a (100k rows) data.table mydata with one of the column that looks like this:

library(data.table)
library(stringr)

mdata <- data.table(A = c("17M1I26M570M20S1M", "17M1I260M570M20S1M"))

How do I efficiently - preferably in 1 line of code - pull out all numbers that precede the M's (they can be varying digits long), convert them to
numeric and find their sum.

I have managed to do this with 3 rounds the sapply function, and creating some additional columns which I don't need:

mdata$c <- sapply(mydata[, A], function(x) unlist(str_extract_all(x, "\\d+M")))


mdata$c2 <-sapply(mydata[, c], function(x) unlist(as.numeric(gsub( "M", "",x))))
mdata$c3 <- sapply(mydata[,c2], function(x) sum(x))

Is there a cleaner,computationally more efficient way to do this?

regex r data.table sapply

edited Nov 27 '15 at 2:03 asked Nov 27 '15 at 0:39


dvanic
99 13

You did not provide reproducible example so here is pseudo code answer in comment :) f = function(x)
unlist(lapply(strsplit(x, "M"), [[ ,1L)) then dt[, .(col = f(col))] , sum shouldn't be a problem. –
jangorecki Nov 27 '15 at 0:59

@jangorecki Edited to reproducible example. – dvanic Nov 27 '15 at 2:05

2 Answers

You can make a function to grab the sum of numbers appearing before all instances of the
letter M from the string, then create a column in your data.table .

Example code below:

# Load data.table and stringr packages


library(data.table)
library(stringr)

# Data provided in the question


mydata <- data.table(A = c("17M1I26M570M20S1M", "17M1I260M570M20S1M"))

# Function to grab the sum of numbers before the letter M in a string


sum_before_m <- function(x) {
# Grab all numbers that appear before M
matches <- str_match_all(x, "\\d+(?=M)")
# Grab the matches column in the list, transform to numeric, then sum
sapply(matches, function(y) sum(as.numeric(y)))
}

# Run the function for the column A


mydata[, c := sum_before_m(A)]

mydata
# A c
# 1: 17M1I26M570M20S1M 614
# 2: 17M1I260M570M20S1M 848

Edit: Changed regex using @thelatemail's suggestion in comments for more efficient
matching.

edited Nov 30 '15 at 17:30 answered Nov 27 '15 at 1:11

https://stackoverflow.com/questions/33948847/extract-pattern-from-string-strip-text-convert-to-numeric-and-sum-in-r-data-ta 1/2
9/17/2017 regex - Extract pattern from string, strip text, convert to numeric and sum in R data.table? - Stack Overflow
ialm
4,981 2 23 41

1 Using str_match_all(x, "\\d+(?=M)") will remove the need to subset later, and will store less data in the
intermediate matches variable. – thelatemail Nov 27 '15 at 1:38

Here is a tidy way to do it.

library(dplyr)
library(tidyr)
library(stringi)
library(rex)

regex_1 =
rex(capture(digits),
capture(letter) )

data =
data_frame(
a = c("17M1I26M570M20S1M",
"17M1I260M570M20S1M") )

key =
data %>%
select(a) %>%
distinct %>%
mutate(match_list =
a %>%
stri_extract_all_regex(regex_1) ) %>%
unnest(match_list) %>%
extract(match_list,
c("number", "letter"),
regex_1) %>%
group_by(a) %>%
mutate(order = 1:n(),
number = as.numeric(number))

key %>%
group_by(a) %>%
summarize(total = sum(number)) %>%
right_join(data)

answered Nov 27 '15 at 1:56


bramtayl
2,864 2 5 15

https://stackoverflow.com/questions/33948847/extract-pattern-from-string-strip-text-convert-to-numeric-and-sum-in-r-data-ta 2/2

You might also like