A Tutorial of Text Mining in R Using TM Package
A Tutorial of Text Mining in R Using TM Package
A Tutorial of Text Mining in R Using TM Package
Package
Among all things for the people working on Data Analytics, one thing they will surely come across is
Data Mining. Data Mining is all about examining huge to extremely huge amount of structured and
unstructured data to form actionable insights.
This article is your guide to get started with Text Mining in R using TM package. It explains enormous
power that R and its packages have to offer on Text Mining. A person with elementary R knowledge
can use this article to get started with Text Mining. It guides user till exploratory data analysis and N-
Grams generation.
Important Terms:
Before we dig dip into Text Mining, we need to get familiar with some of the important concepts
related to Text Mining.
a. TM package: R package for Text Mining [1]
b. Corpus & Corpora: Corpus is a large collection of text. It is a body of written or spoken material
upon which a linguistic analysis is based. Plural form of Corpus is Corpora which essentially is
collections of documents containing natural language text. [2]
c. Document Term Matrix (DTM): A Document Term Matrix is a mathematical matrix that describes
the frequency of terms that occur in a collection of documents. It has documents in rows and word
frequencies in columns.
d. Stemming: Stemming is the process of converting words into their basis form making it easier for
analysis e.g. Words like win, winning and winner are converted and counted to their basic form i.e.
win.
e. Stop Words: These are most common words in a language that get repeated. However, they add little
value to text mining e.g. I, our, they’ll, etc. There are 174 stop words in English.
f. Bad Words: These are offensive words which need to be removed before we start data mining.
With above introduction and basics, let’s get started with implementing Text Mining in R.
Step 1: Install & load necessary libraries. Out of these, TM is R’s text mining package. Other packages
are supplementary packages that are used for reading lines from file, plotting, preparing word clouds,
N-Gram generation, etc.
Note: If any of above libraries are not installed, use install.packages() to get those installed.
Set constants that are to be used multiple times. This is considered as good programming practice.
Step 2: Read text file contents [3]. Optional — Gather and display basic file attributes viz. file size,
number of lines in file, number of words in file.