0% found this document useful (0 votes)

36 views

Working With Text Data in Python

The document provides examples of using string methods in pandas to manipulate and extract information from text data. It shows how to format, detect patterns, extract matches, split, modify case, pad and join strings using methods like .str.contains(), .str.split(), .str.lower(), .str.pad(), and others. The examples use pandas Series containing string data about suits of cards and rock/paper/scissors to demonstrate the various string manipulation techniques.

Uploaded by

Clóvis Nóbrega

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Working With Text Data in Python

Uploaded by

Clóvis Nóbrega

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Working with text  > Formatting settings > Detecting Matches

data in Python # Generate an example DataFramed named df

df = pd.DataFrame({"x": [0.123, 4.567, 8.901]})

# Detect if a regex pattern is present in strings with .str.contains()

suits.str.contains("[ae]") # False True True True

# x
# Count the number of matches with .str.count()

# 0 0.123
suits.str.count("[ae]") # 0 1 2 2

# 1 4.567

Learn Python online at www.DataCamp.com # 2 8.901 # Locate the position of substrings with str.find()

suits.str.find("e") # -1 -1 1 4

# Visualize and format table output

df.style.format(precision = 1)

- x The output of style.format

> Extracting matches
Example data used throughout 0 0.1 is an HTML table
>
this cheat sheet 1 4.5 # Extract matches from strings with str.findall()

suits.str.findall(".[ae]") # [] ["ia"] ["he"[ ["pa", "de"]

2 8.9

Throughout this cheat sheet, we’ll be using two pandas series named suits and # Extract capture groups with .str.extractall()

rock_paper_scissors. suits.str.extractall("([ae])(.)")

# 0 1

import pandas as pd

Splitting strings
# match

suits = pd.Series(["clubs", "Diamonds", "hearts", "Spades"])

> # 1 0
# 2 0
a m

e a

rock_paper_scissors = pd.Series(["rock ", " paper", "scissors"]) # 3 0 a d

# Split strings into list of characters with .str.split(pat="")

# 1 e s

suits.str.split(pat="") 

# Get subset of strings that match with x[x.str.contains()]

String lengths and substrings

# [, "c" "l" "u" "b" "s", ]
suits[suits.str.contains("d")] # "Diamonds" "Spades"

> # [, "D" "i" "a" "m" "o" "n" "d" "s", ]

# [, "h" "e" "a" "r" "t" "s", ]

# [, "S" "p" "a" "d" "e" "s", ]

# Get the number of characters with .str.len()

suits.str.len() # Returns 5 8 6 6

# Split strings by a separator with .str.split()

suits.str.split(pat = "a") 

> Replacing matches

# Get substrings by position with .str[]

# Replace a regex match with another string with .str.replace()

suits.str[2:5] # Returns "ubs" "amo" "art" "ade"

# ["clubs"]

suits.str.replace("a", "4") # "clubs" "Di4monds" "he4rts" "Sp4des"

# ["Di", "monds"]

# Get substrings by negative position with .str[]

# ["he", "rts"]

# Remove a suffix with .str.removesuffix()

suits.str[:-3] # "cl" "Diamo" "hea" "Spa

# ["Sp", "des"]

suits.str.removesuffix # "club" "Diamond" "heart" "Spade"

# Remove whitespace from the start/end with .str.strip()

# Split strings and return DataFrame with .str.split(expand=True)
# Replace a substring with .str.slice_replace()

rock_paper_scissors.str.strip() # "rock" "paper" "scissors"

suits.str.split(pat = "a", expand=True) 

rhymes = pd.Series(["vein", "gain", "deign"])

rhymes.str.slice_replace(0, 1, "r") # "rein" "rain" "reign"

# Pad strings to a given length with .str.pad()

# 0 1

suits.str.pad(8, fillchar="_") # "_clubs" "Diamonds" "hearts" "__Spades" # 0 clubs None

# 1 Di monds

# 2 he rts

# 3 Sp des

> Changing case

# Convert to lowercase with .str.lower()
> Joining or concatenating strings Learn Python Online at
suits.str.lower() # "clubs" "diamonds" "hearts" "spades"

www.DataCamp.com
# Convert to uppercase with .str.upper()
# Combine two strings with +

suits.str.upper() # "CLUBS" "DIAMONDS" "HEARTS" "SPADES"

suits + "5" # "clubs5" "Diamonds5" "hearts5" "Spades5"

# Convert to title case with .str.title()

# Collapse character vector to string with .str.cat()

pd.Series("hello, world!").str.title() # "Hello, World!"

suits.str.cat(sep=", ") # "clubs, Diamonds, hearts, Spades"

# Convert to sentence case with .str.capitalize()

# Duplicate and concatenate strings with *

pd.Series("hello, world!").str.capitalize() # "Hello, world!" suits * 2 # "clubsclubs" "DiamondsDiamonds" "heartshearts" "SpadesSpades"

My Sabre Scribe Scripting Guide
100% (2)
My Sabre Scribe Scripting Guide
61 pages
COMP10001 MST Summary
No ratings yet
COMP10001 MST Summary
6 pages
Chuletas DataCamp-3
No ratings yet
Chuletas DataCamp-3
1 page
Python Programming Unit-II
No ratings yet
Python Programming Unit-II
23 pages
String Function
No ratings yet
String Function
6 pages
Regular Expressions in Python
No ratings yet
Regular Expressions in Python
43 pages
Python String Methods - Cheatsheet
No ratings yet
Python String Methods - Cheatsheet
7 pages
Working With Text Data in R
No ratings yet
Working With Text Data in R
1 page
Python-Codebook - Code of Geeks - by-COG - Compressed PDF
No ratings yet
Python-Codebook - Code of Geeks - by-COG - Compressed PDF
17 pages
Data Structures and Strings in Python Dark Mode
No ratings yet
Data Structures and Strings in Python Dark Mode
22 pages
Python String Functions
No ratings yet
Python String Functions
15 pages
Strings
No ratings yet
Strings
57 pages
Strings in Python Complete
No ratings yet
Strings in Python Complete
45 pages
Strings: Built-In Functions
No ratings yet
Strings: Built-In Functions
6 pages
python string
No ratings yet
python string
4 pages
string python
No ratings yet
string python
8 pages
Python Strings
No ratings yet
Python Strings
35 pages
Python Codebook by COG Updated - Compressed 1 PDF
No ratings yet
Python Codebook by COG Updated - Compressed 1 PDF
19 pages
Class 3
No ratings yet
Class 3
5 pages
Print
No ratings yet
Print
5 pages
Dap M2-1
No ratings yet
Dap M2-1
83 pages
Python Strings
No ratings yet
Python Strings
10 pages
Python Basics Notes by Ahmed Naeim
No ratings yet
Python Basics Notes by Ahmed Naeim
53 pages
String Functions and Regular Expressions: Anastasis Oulas Evangelos Pafilis Jacques Lagnel
No ratings yet
String Functions and Regular Expressions: Anastasis Oulas Evangelos Pafilis Jacques Lagnel
37 pages
Tut 2 DV
No ratings yet
Tut 2 DV
5 pages
Ch-10 (String Manipulation)
No ratings yet
Ch-10 (String Manipulation)
5 pages
Untitled Document
No ratings yet
Untitled Document
4 pages
Python_Strings_Comprehensive_Guide (1)
No ratings yet
Python_Strings_Comprehensive_Guide (1)
5 pages
Strings and Characters
No ratings yet
Strings and Characters
24 pages
Compound Datatype Operators and Functions
No ratings yet
Compound Datatype Operators and Functions
7 pages
7 String
No ratings yet
7 String
20 pages
Ch 4 Strings_Python
No ratings yet
Ch 4 Strings_Python
25 pages
Unit4 Part1
No ratings yet
Unit4 Part1
20 pages
PFSD stringFuncLIST
No ratings yet
PFSD stringFuncLIST
105 pages
ppt
No ratings yet
ppt
3 pages
CH08 (1)
No ratings yet
CH08 (1)
16 pages
UNIT4
No ratings yet
UNIT4
67 pages
unit-2 ch-9 strings
No ratings yet
unit-2 ch-9 strings
25 pages
Advanced Python Programming Practical Manual
No ratings yet
Advanced Python Programming Practical Manual
29 pages
Lecture 7 Re Part2 Split
No ratings yet
Lecture 7 Re Part2 Split
8 pages
String Operators & Method
No ratings yet
String Operators & Method
31 pages
Python Unit 2
No ratings yet
Python Unit 2
23 pages
Python String Processing Cheatsheet KDnuggets
No ratings yet
Python String Processing Cheatsheet KDnuggets
1 page
6 Strings11
No ratings yet
6 Strings11
14 pages
STRINGS
No ratings yet
STRINGS
1 page
Python Notes
No ratings yet
Python Notes
13 pages
Strings in Python
No ratings yet
Strings in Python
9 pages
String Built in Functions
No ratings yet
String Built in Functions
15 pages
py 3
No ratings yet
py 3
16 pages
String in Python-1
No ratings yet
String in Python-1
18 pages
Python Complete Unit 3
No ratings yet
Python Complete Unit 3
40 pages
Document 7
No ratings yet
Document 7
27 pages
Unit 3 Powerpoint
100% (1)
Unit 3 Powerpoint
43 pages
Python unit 3
No ratings yet
Python unit 3
46 pages
Python Module 2
No ratings yet
Python Module 2
76 pages
Computer Project
No ratings yet
Computer Project
13 pages
Python String Methods Complete
No ratings yet
Python String Methods Complete
8 pages
Regular Expression 4
No ratings yet
Regular Expression 4
16 pages
ICT582 Topic 03
No ratings yet
ICT582 Topic 03
40 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Week02 Lecture Chapter01 Part 1
No ratings yet
Week02 Lecture Chapter01 Part 1
71 pages
STA IR0507E RFID Reader SDK Instruction
100% (1)
STA IR0507E RFID Reader SDK Instruction
36 pages
Introduction To PL SQL
No ratings yet
Introduction To PL SQL
22 pages
The HP 48 Programmers Toolkit
No ratings yet
The HP 48 Programmers Toolkit
130 pages
Solution Notes
No ratings yet
Solution Notes
3 pages
CRM 7 - Dropdown Boxes
No ratings yet
CRM 7 - Dropdown Boxes
18 pages
O Pr1jan22
No ratings yet
O Pr1jan22
6 pages
Loge 2 e 2 Sales
No ratings yet
Loge 2 e 2 Sales
2,474 pages
Python Manual
No ratings yet
Python Manual
56 pages
Tree View
No ratings yet
Tree View
42 pages
XML With C#
No ratings yet
XML With C#
23 pages
The Syntax of The C Programming Language Is A Set of Rules That Specifies Whether The
No ratings yet
The Syntax of The C Programming Language Is A Set of Rules That Specifies Whether The
43 pages
C Program Print Stars Pyra
No ratings yet
C Program Print Stars Pyra
21 pages
Historian - For - Linux - User - API - v2.2.0
No ratings yet
Historian - For - Linux - User - API - v2.2.0
25 pages
Ebooks File TCL and The TK Toolkit 2nd Edition John Ousterhout All Chapters
100% (6)
Ebooks File TCL and The TK Toolkit 2nd Edition John Ousterhout All Chapters
84 pages
Peer Control Data Interface Part 2
No ratings yet
Peer Control Data Interface Part 2
77 pages
2024-Updated As Per Rules-C Programming Lab Manual
No ratings yet
2024-Updated As Per Rules-C Programming Lab Manual
58 pages
CLASS 11 RECORD PROGRAMS
No ratings yet
CLASS 11 RECORD PROGRAMS
26 pages
PostgreSQL Cheat Sheet String Functions
No ratings yet
PostgreSQL Cheat Sheet String Functions
1 page
B.Tech CSE (DS) - R23
No ratings yet
B.Tech CSE (DS) - R23
53 pages
A Pi Constants
No ratings yet
A Pi Constants
31 pages
Array
No ratings yet
Array
29 pages
3GPP TS 22.030
No ratings yet
3GPP TS 22.030
28 pages
Program No
No ratings yet
Program No
24 pages
Legato Technologies Assessment 1: Test Summary
100% (1)
Legato Technologies Assessment 1: Test Summary
44 pages
CSC 201 Unit 4 Counting
No ratings yet
CSC 201 Unit 4 Counting
37 pages
Dynamic Programming
No ratings yet
Dynamic Programming
45 pages
Reference Guide - P.1: True - False AND Durable - Writes True - False Keys USING Class - Name With Options Map
No ratings yet
Reference Guide - P.1: True - False AND Durable - Writes True - False Keys USING Class - Name With Options Map
7 pages
Eit 12345
No ratings yet
Eit 12345
28 pages

Working With Text Data in Python

Uploaded by

Working With Text Data in Python

Uploaded by

Working with text > Formatting settings > Detecting Matches

data in Python # Generate an example DataFramed named df

df = pd.DataFrame({"x": [0.123, 4.567, 8.901]})

# Detect if a regex pattern is present in strings with .str.contains()

suits.str.contains("[ae]") # False True True True

# Visualize and format table output

- x The output of style.format

suits.str.findall(".[ae]") # [] ["ia"] ["he"[ ["pa", "de"]

suits = pd.Series(["clubs", "Diamonds", "hearts", "Spades"])

rock_paper_scissors = pd.Series(["rock ", " paper", "scissors"]) # 3 0 a d

# Split strings into list of characters with .str.split(pat="")

# Get subset of strings that match with x[x.str.contains()]

String lengths and substrings

> # [, "D" "i" "a" "m" "o" "n" "d" "s", ]

# [, "h" "e" "a" "r" "t" "s", ]

# [, "S" "p" "a" "d" "e" "s", ]

# Get the number of characters with .str.len()

# Split strings by a separator with .str.split()

> Replacing matches

# Replace a regex match with another string with .str.replace()

suits.str[2:5] # Returns "ubs" "amo" "art" "ade"

suits.str.replace("a", "4") # "clubs" "Di4monds" "he4rts" "Sp4des"

# Get substrings by negative position with .str[]

# Remove a suffix with .str.removesuffix()

suits.str[:-3] # "cl" "Diamo" "hea" "Spa

suits.str.removesuffix # "club" "Diamond" "heart" "Spade"

# Remove whitespace from the start/end with .str.strip()

rock_paper_scissors.str.strip() # "rock" "paper" "scissors"

suits.str.split(pat = "a", expand=True)

rhymes.str.slice_replace(0, 1, "r") # "rein" "rain" "reign"

# Pad strings to a given length with .str.pad()

suits.str.pad(8, fillchar="_") # "___clubs" "Diamonds" "__hearts" "__Spades" # 0 clubs None

> Changing case

suits.str.upper() # "CLUBS" "DIAMONDS" "HEARTS" "SPADES"

suits + "5" # "clubs5" "Diamonds5" "hearts5" "Spades5"

# Convert to title case with .str.title()

pd.Series("hello, world!").str.title() # "Hello, World!"

suits.str.cat(sep=", ") # "clubs, Diamonds, hearts, Spades"

# Convert to sentence case with .str.capitalize()

pd.Series("hello, world!").str.capitalize() # "Hello, world!" suits * 2 # "clubsclubs" "DiamondsDiamonds" "heartshearts" "SpadesSpades"

You might also like

Working with text  > Formatting settings > Detecting Matches

suits.str.split(pat = "a", expand=True) 

suits.str.pad(8, fillchar="_") # "_clubs" "Diamonds" "hearts" "__Spades" # 0 clubs None