Natural Language Processing With Java - Sample Chapter
Natural Language Processing With Java - Sample Chapter
$ 44.99 US
29.99 UK
P U B L I S H I N G
Richard M Reese
Natural Language
Processing with Java
ee
Sa
m
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Natural Language
Processing with Java
Explore various approaches to organize and extract useful text
from unstructured data using Java
Richard M Reese
Natural Language
Processing with Java
Natural Language Processing (NLP) has been used to address a wide range of
problems, including support for search engines, summarizing and classifying text
for web pages, and incorporating machine learning technologies to solve problems
such as speech recognition and query analysis. It has found use wherever documents
contain useful information.
NLP is used to enhance the utility and power of applications. It does so by making user
input easier and converting text to more usable forms. In essence, NLP processes natural
text found in a variety of sources, using a series of core NLP tasks to transform or extract
information from the text.
This book focuses on core NLP tasks that will likely be encountered in an NLP
application. Each NLP task presented in this book starts with a description of the problem
and where it can be used. The issues that make each task difficult are introduced so that
you can understand the problem in a better way. This is followed by the use of numerous
Java techniques and APIs to support an NLP task.
Introduction to NLP
Natural Language Processing (NLP) is a broad topic focused on the use of
computers to analyze natural languages. It addresses areas such as speech
processing, relationship extraction, document categorization, and summation of
text. However, these types of analysis are based on a set of fundamental techniques
such as tokenization, sentence detection, classification, and extracting relationships.
These basic techniques are the focus of this book. We will start with a detailed
discussion of NLP, investigate why it is important, and identify application areas.
There are many tools available that support NLP tasks. We will focus on the Java
language and how various Java Application Programmer Interfaces (APIs) support
NLP. In this chapter, we will briefly identify the major APIs, including Apache's
OpenNLP, Stanford NLP libraries, LingPipe, and GATE.
This is followed by a discussion of the basic NLP techniques illustrated in this book.
The nature and use of these techniques is presented and illustrated using one of the
NLP APIs. Many of these techniques will use models. Models are similar to a set
of rules that are used to perform a task such as tokenizing text. They are typically
represented by a class that is instantiated from a file. We round off the chapter with
a brief discussion on how data can be prepared to support NLP tasks.
NLP is not easy. While some problems can be solved relatively easily, there are many
others that require the use of sophisticated techniques. We will strive to provide a
foundation for NLP processing so that you will be able to understand better which
techniques are available and applicable for a given problem.
[1]
Introduction to NLP
NLP is a large and complex field. In this book, we will only be able to address a
small part of it. We will focus on core NLP tasks that can be implemented using Java.
Throughout this book, we will demonstrate a number of NLP techniques using both
the Java SE SDK and other libraries, such as OpenNLP and Stanford NLP. To use these
libraries, there are specific API JAR files that need to be associated with the project in
which they are being used. A discussion of these libraries is found in the Survey of
NLP tools section and contains download links to the libraries. The examples in this
book were developed using NetBeans 8.0.2. These projects required the API JAR files
to be added to the Libraries category of the Projects Properties dialog box.
What is NLP?
A formal definition of NLP frequently includes wording to the effect that it is a
field of study using computer science, artificial intelligence, and formal linguistics
concepts to analyze natural language. A less formal definition suggests that it is a
set of tools used to derive meaningful and useful information from natural language
sources such as web pages and text documents.
Meaningful and useful implies that it has some commercial value, though it is
frequently used for academic problems. This can readily be seen in its support of
search engines. A user query is processed using NLP techniques in order to generate
a result page that a user can use. Modern search engines have been very successful
in this regard. NLP techniques have also found use in automated help systems and
in support of complex query systems as typified by IBM's Watson project.
When we work with a language, the terms, syntax, and semantics, are frequently
encountered. The syntax of a language refers to the rules that control a valid sentence
structure. For example, a common sentence structure in English starts with a subject
followed by a verb and then an object such as "Tim hit the ball". We are not used
to unusual sentence order such as "Hit ball Tim". Although the rule of syntax for
English is not as rigorous as that for computer languages, we still expect a sentence
to follow basic syntax rules.
The semantics of a sentence is its meaning. As English speakers, we understand
the meaning of the sentence "Tim hit the ball". However, English and other natural
languages can be ambiguous at times and a sentence's meaning may only be
determined from its context. As we will see, various machine learning techniques
can be used to attempt to derive the meaning of text.
As we progress with our discussions, we will introduce many linguistic terms that
will help us better understand natural languages and provide us with a common
vocabulary to explain the various NLP techniques. We will see how the text can be
split into individual elements and how these elements can be classified.
[2]
Chapter 1
In general, these approaches are used to enhance applications, thus making them
more valuable to their users. The uses of NLP can range from relatively simple
uses to those that are pushing what is possible today. In this book, we will show
examples that illustrate simple approaches, which may be all that is required for
some problems, to the more advanced libraries and classes available to address
sophisticated needs.
Parts of Speech Tagging (POS): In this task, text is split up into different
grammatical elements such as nouns and verbs. This is useful in analyzing
the text further.
Introduction to NLP
Chapter 1
Stemming is another task that may need to be applied. Stemming is the process of
finding the word stem of a word. For example, words such as "walking", "walked",
or "walks" have the word stem "walk". Search engines often use stemming to assist
in asking a query.
Closely related to stemming is the process of Lemmatization. This process
determines the base form of a word called its lemma. For example, for the word
"operating", its stem is "oper" but its lemma is "operate". Lemmatization is a more
refined process than stemming and uses vocabulary and morphological techniques
to find a lemma. This can result in more precise analysis in some situations.
Words are combined into phrases and sentences. Sentence detection can be
problematic and is not as simple as looking for the periods at the end of a sentence.
Periods are found in many places including abbreviations such as Ms. and in
numbers such as 12.834.
We often need to understand which words in a sentence are nouns and which
are verbs. We are sometimes concerned with the relationship between words.
For example, Coreferences resolution determines the relationship between
certain words in one or more sentences. Consider the following sentence:
"The city is large but beautiful. It fills the entire valley."
The word "it" is the coreference to city. When a word has multiple meanings
we might need to perform Word Sense Disambiguation to determine the meaning
that was intended. This can be difficult to do at times. For example, "John went
back home".
Does the home refer to a house, a city, or some other unit? Its meaning can
sometimes be inferred from the context in which it is used. For example,
"John went back home. It was situated at the end of a cul-de-sac."
In spite of these difficulties, NLP is able to perform these tasks reasonably
well in most situations and provide added value to many problem
domains. For example, sentiment analysis can be performed on customer
tweets resulting in possible free product offers for dissatisfied customers.
Medical documents can be readily summarized to highlight the relevant
topics and improved productivity.
Summarization is the process of producing a short description of
different units. These units can include multiple sentences, paragraphs,
a document, or multiple documents. The intent may be to identify
those sentences that convey the meaning of the unit, determine the
prerequisites for understanding a unit, or to find items within these units.
Frequently, the context of the text is important in accomplishing this task.
[5]
Introduction to NLP
There also exists a number of NLP libraries/APIs for Java. A partial list of
Java-based NLP APIs are found in the following table. Most of these are open
source. In addition, there are a number of commercial APIs available. We will
focus on the open source APIs:
API
Apertium
URL
http://www.apertium.org/
General
Architecture for
Text Engineering
http://gate.ac.uk/
http://cogcomp.cs.illinois.edu/page/software_
view/11
http://www.linguastream.org/
LingPipe
http://alias-i.com/lingpipe/
Mallet
http://mallet.cs.umass.edu/
MontyLingua
http://web.media.mit.edu/~hugo/montylingua/
Apache OpenNLP
http://opennlp.apache.org/
UIMA
http://uima.apache.org/
Stanford Parser
http://nlp.stanford.edu/software
[6]
Chapter 1
Many of these NLP tasks are combined to form a pipeline. A pipeline consists
of various NLP tasks, which are integrated into a series of steps to achieve some
processing goal. Examples of frameworks that support pipelines are GATE and
Apache UIMA.
In the next section, we will coverer several NLP APIs in more depth. A brief
overview of their capabilities will be presented along with a list of useful links
for each API.
Apache OpenNLP
The Apache OpenNLP project addresses common NLP tasks and will be used
throughout this book. It consists of several components that perform specific
tasks, permit models to be trained, and support for testing the models. The general
approach, used by OpenNLP, is to instantiate a model that supports the task from
a file and then executes methods against the model to perform a task.
For example, in the following sequence, we will tokenize a simple string. For
this code to execute properly, it must handle the FileNotFoundException
and IOException exceptions. We use a try-with-resource block to open a
FileInputStream instance using the en-token.bin file. This file contains a
model that has been trained using English text:
try (InputStream is = new FileInputStream(
new File(getModelDir(), "en-token.bin"))){
// Insert code to tokenize the text
} catch (FileNotFoundException ex) {
An instance of the TokenizerModel class is then created using this file inside
the try block. Next, we create an instance of the Tokenizer class, as shown here:
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
The tokenize method is then applied, whose argument is the text to be tokenized.
The method returns an array of String objects:
String tokens[] = tokenizer.tokenize("He lives at 1511 W."
+ "Randolph.");
[7]
Introduction to NLP
A for-each statement displays the tokens as shown here. The open and close brackets
are used to clearly identify the tokens:
for (String a : tokens) {
System.out.print("[" + a + "] ");
}
System.out.println();
In this case, the tokenizer recognized that W. was an abbreviation and that the last
period was a separate token demarking the end of the sentence.
We will use the OpenNLP API for many of the examples in this book. OpenNLP
links are listed in the following table:
OpenNLP
Home
Website
https://opennlp.apache.org/
Documentation
https://opennlp.apache.org/documentation.html
Javadoc
http://nlp.stanford.edu/nlp/javadoc/javanlp/
index.html
https://opennlp.apache.org/cgi-bin/download.cgi
Download
Wiki
https://cwiki.apache.org/confluence/display/
OPENNLP/Index%3bjsessionid=32B408C73729ACCCDD07
1D9EC354FC54
Stanford NLP
The Stanford NLP Group conducts NLP research and provides tools for NLP tasks.
The Stanford CoreNLP is one of these toolsets. In addition, there are other tool
sets such as the Stanford Parser, Stanford POS tagger, and the Stanford Classifier.
The Stanford tools support English and Chinese languages and basic NLP tasks,
including tokenization and name entity recognition.
These tools are released under the full GPL but it does not allow them to be used in
commercial applications, though a commercial license is available. The API is well
organized and supports the core NLP functionality.
There are several tokenization approaches supported by the Stanford group. We will
use the PTBTokenizer class to illustrate the use of this NLP library. The constructor
demonstrated here uses a Reader object, a LexedTokenFactory<T> argument, and a
string to specify which of the several options is to be used.
[8]
Chapter 1
We will use the Stanford NLP library extensively in this book. A list of Stanford
links is found in the following table. Documentation and download links are
found in each of the distributions:
Stanford NLP
Home
CoreNLP
Website
http://nlp.stanford.edu/index.shtml
Parser
http://nlp.stanford.edu/software/corenlp.
shtml#Download
http://nlp.stanford.edu/software/lex-parser.shtml
POS Tagger
http://nlp.stanford.edu/software/tagger.shtml
java-nlp-user
Mailing List
https://mailman.stanford.edu/mailman/listinfo/
java-nlp-user
[9]
Introduction to NLP
LingPipe
LingPipe consists of a set of tools to perform common NLP tasks. It supports model
training and testing. There are both royalty free and license versions of the tool.
The production use of the free version is limited.
To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize
text using the Tokenizer class. Start by declaring two lists, one to hold the tokens
and a second to hold the whitespace:
List<String> tokenList = new ArrayList<>();
List<String> whiteList = new ArrayList<>();
Now, create an instance of the Tokenizer class. As shown in the following code
block, a static tokenizer method is used to create an instance of the Tokenizer
class based on a Indo-European factory class:
Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE.
tokenizer(text.toCharArray(), 0, text.length());
The tokenize method of this class is then used to populate the two lists:
tokenizer.tokenize(tokenList, whiteList);
[ 10 ]
Chapter 1
Website
http://alias-i.com/lingpipe/index.html
JavaDocs
http://alias-i.com/lingpipe/demos/tutorial/
read-me.html
http://alias-i.com/lingpipe/docs/api/index.html
Download
http://alias-i.com/lingpipe/web/install.html
Core
http://alias-i.com/lingpipe/web/download.html
Models
http://alias-i.com/lingpipe/web/models.html
GATE
General Architecture for Text Engineering (GATE) is a set of tools written in
Java and developed at the University of Sheffield in England. It supports many
NLP tasks and languages. It can also be used as a pipeline for NLP processing.
It supports an API along with GATE Developer, a document viewer that displays
text along with annotations. This is useful for examining a document using
highlighted annotations. GATE Mimir, a tool for indexing and searching text
generated by various sources, is also available. Using GATE for many NLP tasks
involves a bit of code. GATE Embedded is used to embed GATE functionality
directly in code. Useful GATE links are listed in the following table:
Gate
Home
Website
https://gate.ac.uk/
Documentation
https://gate.ac.uk/documentation.html
JavaDocs
Download
http://jenkins.gate.ac.uk/job/GATE-Nightly/
javadoc/
https://gate.ac.uk/download/
Wiki
http://gatewiki.sf.net/
[ 11 ]
Introduction to NLP
UIMA
The Organization for the Advancement of Structured Information Standards
(OASIS) is a consortium focused on information-oriented business technologies.
It developed the Unstructured Information Management Architecture (UIMA)
standard as a framework for NLP pipelines. It is supported by the Apache UIMA.
Although it supports pipeline creation, it also describes a series of design patterns,
data representations, and user roles for the analysis of text. UIMA links are listed
in the following table:
Apache UIMA
Home
Website
https://uima.apache.org/
Documentation
https://uima.apache.org/documentation.html
JavaDocs
https://uima.apache.org/d/uimaj-2.6.0/apidocs/
index.html
https://uima.apache.org/downloads.cgi
Download
Wiki
https://cwiki.apache.org/confluence/display/UIMA/
Index
Finding Sentences
Extracting Relationships
Combined Approaches
Many of these tasks are used together with other tasks to achieve some objective.
We will see this as we progress through the book. For example, tokenization is
frequently used as an initial step in many of the other tasks. It is a fundamental
and basic step.
[ 12 ]
Chapter 1
Simple words: These are the common connotations of what a word means
including the 17 words of this sentence.
Synonyms: This is a word that has the same meaning as another word.
Words such as small and tiny can be recognized as synonyms. Addressing
this issue requires word sense disambiguation.
Identifying these parts is useful for other NLP tasks. For example, to determine
the boundaries of a sentence, it is necessary to break it apart and determine which
elements terminate a sentence.
The process of breaking text apart is called tokenization. The result is a stream of
tokens. The elements of the text that determine where elements should be split are
called Delimiters. For most English text, whitespace is used as a delimiter. This type
of a delimiter typically includes blanks, tabs, and new line characters.
[ 13 ]
Introduction to NLP
The split method uses a regular expression argument to specify how the text
should be split. In the next code sequence, its argument is the string \\s+.
This specifies that one or more whitespaces be used as the delimiter:
String tokens[] = text.split("\\s+");
In Chapter 2, Finding Parts of Text, we will explore the tokenization process in depth.
Finding sentences
We tend to think of the process of identifying sentences as a simple process.
In English, we look for termination characters such as a period, question mark,
or exclamation mark. However, as we will see in Chapter 3, Finding Sentences, this
is not always that simple. Factors that make it more difficult to find the end of
sentences include the use of embedded periods in such phrases as "Dr. Smith"
or "204 SW. Park Street".
This process is also called Sentence Boundary Disambiguation (SBD). This is
a more significant problem in English than it is in languages such as Chinese or
Japanese that have unambiguous sentence delimiters.
[ 14 ]
Chapter 1
Identifying sentences is useful for a number of reasons. Some NLP tasks, such as
POS tagging and entity extraction, work on individual sentences. Question-anwering
applications also need to identify individual sentences. For these processes to work
correctly, sentence boundaries must be determined correctly.
The following example demonstrates how sentences can be found using the Stanford
DocumentPreprocessor class. This class will generate a list of sentences based on
either simple text or an XML document. The class implements the Iterable interface
allowing it to be easily used in a for-each statement.
Start by declaring a string containing the sentences, as shown here:
String paragraph = "The first sentence. The second sentence.";
Create a StringReader object based on the string. This class supports simple read
type methods and is used as the argument of the DocumentPreprocessor constructor:
Reader reader = new StringReader(paragraph);
DocumentPreprocessor documentPreprocessor =
new DocumentPreprocessor(reader);
The DocumentPreprocessor object will now hold the sentences of the paragraph. In
the next statement, a list of strings is created and is used to hold the sentences found:
List<String> sentenceList = new LinkedList<String>();
[ 15 ]
Introduction to NLP
[ 16 ]
Chapter 1
Some searches can be very simple. For example, the String class and related classes
have methods such as the indexOf and lastIndexOf methods that can find the
occurrence of a String class. In the simple example that follows, the index of the
occurrence of the target string is returned by the indexOf method:
String text = "Mr. Smith went to 123 Washington avenue.";
String target = "Washington";
int index = text.indexOf(target);
System.out.println(index);
[ 17 ]
Introduction to NLP
Before the sentences can be processed, we need to tokenize the text. Set up the
tokenizer using the Tokenizer class, as shown here:
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
We will need to use a model to detect sentences. This is needed to avoid grouping
terms that may span sentence boundaries. We will use the TokenNameFinderModel
class based on the model found in the en-ner-person.bin file. An instance of
TokenNameFinderModel is created from this file as follows:
TokenNameFinderModel model = new TokenNameFinderModel(
new File("C:\\OpenNLP Models", "en-ner-person.bin"));
The NameFinderME class will perform the actual task of finding the name.
An instance of this class is created using the TokenNameFinderModel instance,
as shown here:
NameFinderME finder = new NameFinderME(model);
The primary focus of Chapter 4, Finding People and Things, is name recognition.
[ 18 ]
Chapter 1
Detecting the Parts of Speech (POS) is useful in other tasks such as extracting
relationships and determining the meaning of text. Determining these relationships
is called Parsing. POS processing is useful for enhancing the quality of data sent to
other elements of a pipeline.
The internals of a POS process can be complex. Fortunately, most of the complexity
is hidden from us and encapsulated in classes and methods. We will use a couple of
OpenNLP classes to illustrate this process. We will need a model to detect the POS.
The POSModel class will be used and instanced using the model found in the
en-pos-maxent.bin file, as shown here:
POSModel model = new POSModelLoader().load(
new File("../OpenNLP Models/" "en-pos-maxent.bin"));
The POSTaggerME class is used to perform the actual tagging. Create an instance
of this class based on the previous model as shown here:
POSTaggerME tagger = new POSTaggerME(model);
The tag method is then used to find those parts of speech, which stored the results
in an array of strings:
String[] tags = tagger.tag(tokens);
Each token is followed by an abbreviation, contained within brackets, for its part of
speech. For example, NNP means that it is a proper noun. These abbreviations will
be covered in Chapter 5, Detecting Parts of Speech, which is devoted to exploring this
topic in depth.
[ 19 ]
Introduction to NLP
Extracting relationships
Relationship extraction identifies relationships that exist in text. For example, with
the sentence "The meaning and purpose of life is plain to see", we know that the topic
of the sentence is "The meaning and purpose of life". It is related to the last phrase
that suggests that it is "plain to see".
Humans can do a pretty good job at determining how things are related to each
other, at least at a high level. Determining deep relationships can be more difficult.
Using a computer to extract relationships can also be challenging. However,
computers can process large datasets to find relationships that would not be
obvious to a human or that could not be done in a reasonable period of time.
There are numerous relationships possible. These include relationships such as
where something is located, how two people are related to each other, what are
the parts of a system, and who is in charge. Relationship extraction is useful for
a number of tasks including building knowledge bases, performing analysis
of trends, gathering intelligence, and performing product searches. Finding
relationships is sometimes called Text Analytics.
[ 20 ]
Chapter 1
There are several techniques that we can use to perform relationship extractions.
These are covered in more detail in Chapter 7, Using a Parser to Extract Relationships.
Here, we will illustrate one technique to identify relationships within a sentence
using the Stanford NLP StanfordCoreNLP class. This class supports a pipeline
where annotators are specified and applied to text. Annotators can be thought of as
operations to be performed. When an instance of the class is created, the annotators
are added using a Properties object found in the java.util package.
First, create an instance of the Properties class. Then assign the annotators
as follows:
Properties properties = new Properties();
properties.put("annotators", "tokenize, ssplit, parse");
Next, an Annotation instance is created, which uses the text as its argument:
Annotation annotation = new Annotation(
"The meaning and purpose of life is plain to see.");
Apply the annotate method against the pipeline object to process the annotation
object. Finally, use the prettyPrint method to display the result of the processing:
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, System.out);
[ 21 ]
Introduction to NLP
[Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT]
[Text=meaning CharacterOffsetBegin=4 CharacterOffsetEnd=11
PartOfSpeech=NN] [Text=and CharacterOffsetBegin=12 CharacterOffsetEnd=15
PartOfSpeech=CC] [Text=purpose CharacterOffsetBegin=16
CharacterOffsetEnd=23 PartOfSpeech=NN] [Text=of CharacterOffsetBegin=24
CharacterOffsetEnd=26 PartOfSpeech=IN] [Text=life CharacterOffsetBegin=27
CharacterOffsetEnd=31 PartOfSpeech=NN] [Text=is CharacterOffsetBegin=32
CharacterOffsetEnd=34 PartOfSpeech=VBZ] [Text=plain
CharacterOffsetBegin=35 CharacterOffsetEnd=40 PartOfSpeech=JJ] [Text=to
CharacterOffsetBegin=41 CharacterOffsetEnd=43 PartOfSpeech=TO] [Text=see
CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=VB] [Text=.
CharacterOffsetBegin=47 CharacterOffsetEnd=48 PartOfSpeech=.]
(ROOT
(S
(NP
(NP (DT The) (NN meaning)
(CC and)
(NN purpose))
(PP (IN of)
(NP (NN life))))
(VP (VBZ is)
(ADJP (JJ plain)
(S
(VP (TO to)
(VP (VB see))))))
(. .)))
root(ROOT-0, plain-8)
det(meaning-2, The-1)
nsubj(plain-8, meaning-2)
conj_and(meaning-2, purpose-4)
prep_of(meaning-2, life-6)
cop(plain-8, is-7)
aux(see-10, to-9)
xcomp(plain-8, see-10)
[ 22 ]
Chapter 1
The first part of the output displays the text along with the tokens and POS.
This is followed by a tree-like structure showing the organization of the sentence.
The last part shows relationships between the elements at a grammatical level.
Consider the following example:
prep_of(meaning-2, life-6)
This shows how the preposition, "of", is used to relate the words "meaning" and
"life". This information is useful for many text simplification tasks.
Selecting a model
[ 23 ]
Introduction to NLP
Selecting a model
Many of the tasks that we will examine are based on models. For example, if we need
to split a document into sentences, we need an algorithm to do this. However, even the
best sentence boundary detection techniques have problems doing this correctly every
time. This has resulted in the development of models that examine the elements of text
and then use this information to determine where sentence breaks occur.
The right model can be dependent on the nature of the text being processed. A model
that does well for determining the end of sentences for historical documents might
not work well when applied to medical text.
Many models have been created that we can use for the NLP task at hand. Based on
the problem that needs to be solved, we can make informed decisions as to which
model is the best. In some situations, we might need to train a new model. These
decisions frequently involve trade-offs between accuracy and speed. Understanding
the problem domain and the required quality of results permits us to select the
appropriate model.
[ 24 ]
Chapter 1
Preparing data
An important step in NLP is finding and preparing data for processing. This includes
data for training purposes and the data that needs to be processed. There are several
factors that need to be considered. Here, we will focus on the support Java provides
for working with characters.
We need to consider how characters are represented. Although we will deal
primarily with English text, other languages present unique problems. Not only are
there differences in how a character can be encoded, the order in which text is read
will vary. For example, Japanese orders its text in columns going from right to left.
[ 25 ]
Introduction to NLP
There are also a number of possible encodings. These include ASCII, Latin, and
Unicode to mention a few. A more complete list is found in the following table.
Unicode, in particular, is a complex and extensive encoding scheme:
Encoding
Description
ASCII
Latin
There are several Latin variations that uses 256 values. They include
various combination of the umlaut, such as , and other characters.
Various versions of Latin have been introduced to address various
Indo-European languages, such as Turkish and Esperanto.
Big5
Unicode
There are three encodings for Unicode: UTF-8, UTF-16, and UTF-32.
These use 1, 2, and 4 bytes, respectively. This encoding is able to
represent all known languages in existence today, including newer
languages such as Klingon and Elvish.
Java is capable of handling these encoding schemes. The javac executable's encoding
command-line option is used to specify the encoding scheme to use. In the following
command line, the Big5 encoding scheme is specified:
javac encoding Big5
Character processing is supported using the primitive data type char, the Character
class, and several other classes and interfaces as summarized in the following table:
Character type
char
Description
Primitive data type.
Character
CharBuffer
CharSequence
[ 26 ]
Chapter 1
Description
StringBuffer
StringBuilder
Segment
CharacterIterator
StringCharacterIterator
An immutable string.
We also need to consider the file format if we are reading from a file. Often data is
obtained from sources where the words are annotated. For example, if we use a web
page as the source of text, we will find that it is marked up with HTML tags. These
are not necessarily relevant to the analysis process and may need to be removed.
The Multi-Purpose Internet Mail Extensions (MIME) type is used to characterize
the format used by a file. Common file types are listed in the following table. Either
we need to explicitly remove or alter the markup found in a file or use specialized
software to deal with it. Some of the NLP APIs provide tools to deal with specialized
file formats.
File format
MIME type
Description
Text
plain/text
Office Type
Document
application/msword
Microsoft Office
application/vnd.oasis.
opendocument.text
Open Office
application/pdf
HTML
text/html
Web pages
XML
text/xml
Database
Not applicable
Many of the NLP APIs assume that the data is clean. When it is not, it needs to be
cleaned lest we get unreliable and misleading results.
[ 27 ]
Introduction to NLP
Summary
In this chapter we introduced NLP and it uses. We found that it is used in many
places to solve many different types of problems ranging from simple searches to
sophisticated classification problems. The Java support for NLP in terms of core
string support and advanced NLP libraries were presented. The basic NLP tasks
were explained and illustrated using code. We also examined the process of training,
verifying, and using models.
In this book, we will lay the foundation for using the basic NLP tasks using both
simple and more sophisticated approaches. You may find that some problems
require only simple approaches and when that is the case, knowing how to use the
simple techniques may be more than adequate. In other situations, a more complex
technique may be needed. In either case, you will be prepared to identify what tool is
needed and be able to choose the appropriate technique for the task.
In the next chapter, we will examine the process of tokenization in depth and see
how it can be used to find parts of text.
[ 28 ]
www.PacktPub.com
Stay Connected: