Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Tutorial of Text Mining in R Using TM Package

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Tutorial of Text Mining in R Using TM

Package
Among all things for the people working on Data Analytics, one thing they will surely come across is
Data Mining. Data Mining is all about examining huge to extremely huge amount of structured and
unstructured data to form actionable insights.
This article is your guide to get started with Text Mining in R using TM package. It explains enormous
power that R and its packages have to offer on Text Mining. A person with elementary R knowledge
can use this article to get started with Text Mining. It guides user till exploratory data analysis and N-
Grams generation.
Important Terms:
Before we dig dip into Text Mining, we need to get familiar with some of the important concepts
related to Text Mining.
a. TM package: R package for Text Mining [1]
b. Corpus & Corpora: Corpus is a large collection of text. It is a body of written or spoken material
upon which a linguistic analysis is based. Plural form of Corpus is Corpora which essentially is
collections of documents containing natural language text. [2]
c. Document Term Matrix (DTM): A Document Term Matrix is a mathematical matrix that describes
the frequency of terms that occur in a collection of documents. It has documents in rows and word
frequencies in columns.
d. Stemming: Stemming is the process of converting words into their basis form making it easier for
analysis e.g. Words like win, winning and winner are converted and counted to their basic form i.e.
win.
e. Stop Words: These are most common words in a language that get repeated. However, they add little
value to text mining e.g. I, our, they’ll, etc. There are 174 stop words in English.
f. Bad Words: These are offensive words which need to be removed before we start data mining.
With above introduction and basics, let’s get started with implementing Text Mining in R.
Step 1: Install & load necessary libraries. Out of these, TM is R’s text mining package. Other packages
are supplementary packages that are used for reading lines from file, plotting, preparing word clouds,
N-Gram generation, etc.
Note: If any of above libraries are not installed, use install.packages() to get those installed.
Set constants that are to be used multiple times. This is considered as good programming practice.

Step 2: Read text file contents [3]. Optional — Gather and display basic file attributes viz. file size,
number of lines in file, number of words in file.

Step 3: Create file corpus, clean the corpus


Step 4: This step illustrates few basic exploratory data analysis steps that can act as reference for
detailed exploratory data analysis.

Output is not shown.


Step 5: Visualize frequency of words occurring in text file by using word clouds. Following code
snippet generates two word clouds to show un-stemmed and stemmed corpus word clouds:
Step 6: Last step of this guide is to generate N-Grams (uni, bi and tri grams) and plot histograms of top
10 occurring N-Grams.
Further steps could be use above generated N-Grams text mining activities like word predictions, etc.
References:
a. [1] TM package — https://cran.r-project.org/web/packages/tm/tm.pdf
b. [2] Corpus & Corpora — http://language.worldofcomputing.net/linguistics/introduction/what-is-
corpus.html
c. Text file referred in this guide uses text dump of following WIKI page —
https://en.wikipedia.org/wiki/Text_mining

Tomado de: https://medium.com/text-mining-in-data-science-a-tutorial-of-text/text-mining-in-data-


science-51299e4e594

You might also like