Apache Solr For Indexing Data - Sample Chapter
Apache Solr For Indexing Data - Sample Chapter
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Anshul Johri
P U B L I S H I N G
pl
$ 34.99 US
22.99 UK
Sa
m
Sachin Handiekar
ee
Sachin Handiekar
Anshul Johri
Anshul Johri has more than 10 years of technical experience in software engineering.
He did his masters in computer science from the computer science department in
the University of Pune. Anshul has always been a start-up mindset guy, working on
fast-paced development using cutting-edge technologies and doing multiple things at
a time. His core strength has always been search technology, whereby Solr plays an
important role in his career. Anshul started using Solr around 9 years ago, and since
then, he has never looked back. He did better and better with Solr, whether using it or
contributing to the open source search community. He has used Solr extensively in all
his organizations across various projects.
As mentioned earlier, Anshul has always been a start-up mindset guy. Because
of that, he has worked with many start-ups in his career so far, which includes
early-age and mid-size start-ups as well. To name a few, they are Ibibo.com,
Asklaila.com, Bookadda.com, and so on. His last company was Amazon, where
he spent around 2 years building scalable systems for Amazon Prime (a global
product). Anshul recently started his own company in India with another friend
from Amazon and founded http://www.rentomo.com/, a unique concept of a
peer-to-peer sharing platform in a trusted community. He heads the technology
and other core pillars of his own start-up.
Anshul did the technical review of the book Indexing with Solr, published by
Packt Publishing.
Preface
Welcome to Apache Solr for Indexing Data. Solr is an amazing enterprise tool that
gives us a search engine with various possibilities to index data and gives users a
better experience. This book will cover the various indexing methods that we can
use to improve the indexing process by covering step-by-step examples.
The book is all about indexing in Solr, and we'll cover all the possible topics in Solr
that developers can use in their use cases by following simple examples.
Preface
Chapter 6, Indexing Data Using Apache Tika, illustrates the integration of Apache Tika
with Solr for the indexing of documents.
Chapter 7, Apache Nutch, covers the integration of Apache Nutch with Solr for indexing
crawl data from the Internet.
Chapter 8, Commits, Real-Time Index Optimizations, and Atomic Updates, shows us how
we can use the real-time indexing features available in Solr and utilize these features
to provide a real-time search experience.
Chapter 9, Advanced Topics Multilanguage, Deduplication, and Others, covers advanced
topics such as indexing multilanguage documents and removing duplicate documents
from Solr.
Chapter 10, Distributed Indexing, tells us how we can utilize SolrCloud to provide a
high-availability and fault-tolerant cluster.
Chapter 11, Case Study of Using Solr in E-Commerce, covers a case study by going
through easy-to-use, simple examples that can be used in an e-commerce website.
Getting Started
We will start this chapter with a quick overview of Solr, followed by a section that
helps you get Solr up and running. We will also cover some basic building blocks of
the Solr architecture, its directory structure, and its configurations files. This chapter
covers following topics:
Running Solr
Multicore Solr
[1]
Getting Started
Let's go through the installation process of Solr. This section describes how to install
Solr on various operating systems such as Mac, Windows, and Linux. Let's go through
each of them one by one.
Running Solr
To test whether your installation was completed successfully, you need to run Solr.
Type these commands in the terminal to run it:
$ cd /usr/local/Cellar/solr/4.4.0/libexec/example/
$ java -jar start.jar
After you run the preceding commands, you will see lots of dumping messages/logs
on the terminal. Don't worry! It's normal. Just try to fix any error if it is there. Once the
messages are stopped and there is no error message, simply go to any web browser
and type http://localhost:8983/solr/#/.
Downloading the example code
You can download the example code files from your account at
http://www.packtpub.com for all the Packt Publishing books
you have purchased. If you purchased this book elsewhere, you
can visit http://www.packtpub.com/support and register
to have the files e-mailed directly to you.
[2]
Chapter 1
Fresh Solr do not contain any data. In Solr terminology, data is termed as a
document. You will learn how to index data in Solr in upcoming chapters.
[3]
Getting Started
3. Unzip the Solr download. You should have files as shown in the following
screenshot. Open the example folder.
4. Copy the etc, lib, logs, solr, and webapps folders and start.jar to
C:\solr (you will need to create the folder at C:\solr), as shown in the
following screenshot:
5. Now open the C:\solr\solr folder and copy the contents back to the root
C:\solr folder. When you are done, you can delete the C:\solr\solr folder.
See the following image, the selected folder you can delete now:
[4]
Chapter 1
At this point, your C:\solr directory should look like what is shown in the
following screenshot:
6. Solr can be run at this point if you start it from the command line.
Change your directory to C:\solr and then run java -Dsolr.solr.
home=C:/solr/ -jar start.jar.
7. If you go to http://localhost:8983/solr/, you should see the
Solr dashboard.
8. Now Solr is up and running, so we can work on getting Jetty to run as a
Windows service. Since Jetty comes bundled with Solr, all that we need to do
is run it as a service. There are several options to do this, but the one I prefer is
through Non-Sucking Service Manager (NSSM)program in windows which
is the, the most compatible service manager across Windows environment.
NSSM can be downloaded from http://nssm.cc/download.
9. Once you have downloaded NSSM, open the win32 or win64 folder as
appropriate and copy nssm.exe to your C:\solr folder.
10. Open Command Prompt, change the directory to C:\solr, and then
run nssm install Solr.
11. A dialog will open. Select java.exe as the application located at
C:\Windows\System32\.
12. In the options input box, enter: Dsolr.solr.home=C:/solr/ -Djetty.
home=C:/solr/ -Djetty.logs=C:/solr/logs/ -cp C:/solr/lib/*.
jar;C:/solr/start.jar -jar C:/solr/start.jar.
[5]
Getting Started
[6]
Chapter 1
The Solr download comes with example data bundled in it. We can use the same
data for indexing as an example. Go to the exampledocs directory under the example
directory. Here, you will see a lot of files. Now go to the command line (terminal) and
type the following commands:
$ cd $SOLR_HOME/example/exampledocs/
$ ./post.sh vidcard.xml
Now let's try to check out our imported data from web browser. Try http://
localhost:8983/solr/select?q=*:*&wt=json to fetch all of the data in your
Solr instance, like this:
When you see the preceding data, it means that your Solr server is running properly
and is ready to index your desired feed. You will be reading indexing in depth in
upcoming chapters.
[7]
Getting Started
Indexing
Solr Core
Distributes
Solr config
Queries
Document Mgmt
add/update/del
Update Processors
(deduping
language detection)
Zookeeper
Ensemble
Millions of
Users
Query Processing
and Caching
solrconfig.xml
Text Analysis Pipeline
schema.xml
Search Components
(spell checker,
faceting,
highlighting,
more like this
clustering,...)
Text Analysis
tokenization and
chain of token filters
(stop words, lowercase,
synonyms, stemming,...)
configures
<fields>
<types>
shard
Mgmt
shard2
replica1
Scale Out
Add more
shards for
faster queries
and more
documents
replica2
Replication
Add more replicas
for better throughput (queries/sec)
and faults tolerance
Soft commit
(in memory)
near Real Time
Hard commit
(flushed to disk)
[8]
Chapter 1
Do not worry if you are not able to understand the preceding diagram right now.
We will cover every component related to indexing in detail. The purpose of this
diagram is to give you a feel of the current architecture of Solr and its working in
the real world. If you see the preceding diagram properly, you will find two .xml
files named schema.xml and solrconfig.xml. These are the two most important
files in the Solr configuration and are considered the building blocks of Solr.
Listeners
These are some of the important configurations defined in solrconfig.xml. This file
is well commented; I would advise you to go through it from the start and read all
the comments. You will get a very good understanding of the various components
involved in the Solr configuration.
[9]
Getting Started
The second most important configuration file is called schema.xml. This file can
be found in the solr/collection1/conf/ directory. As the name says, this file is
used to define the schema of the data (content) that you want to index and make
searchable. Data is called document in Solr terminology. The schema.xml file
contains all the details about the fields that your documents can contain, and how
these fields should be dealt with when adding documents to the index or when
querying those fields. This file can be divided broadly into two sections:
The fields section (the definitions of the document structure using types)
The structure of your document should be defined as a field under the fields
section. Let's say you have to define a book as a document in Solr with fields
as isbn, title, author, and price. The schema will be as follows:
<field name="isbn" type="string" required="true" indexed="true"
stored="true"/> <field name="title" type="text_general"
indexed="true" stored="true"/>
<field name="author" type="text-general" indexed="true"
stored="true" multiValued="true"/>
<field name="price" type="int" indexed="true" stored="true"/>
In the preceding schema, you see a type attribute, which defines the data type
of the field. You can change the behavior of the field by changing the type. The
multiValued attribute is used to tell Solr that the field can hold multiple values,
while the required attribute makes the field mandatory for creating a document.
After the fields section ends, we need to mention which field is going to be unique.
In our case, it is going to be isbn:
<uniqueKey>isbn</uniqueKey>
The schema.xml file is also well-commented file. I will again advise you to go
through the comments of this file, for starting this will help you understand the
various field types and data types in detail.
[ 10 ]
Chapter 1
Cores in Solr are managed through a configuration file called solr.xml. The solr.
xml file is present in your Solr Home directory. Since its inception, solr.xml has
evolved from configuring one core to managing multiple cores and eventually
defining parameters for SolrCloud. Do not worry much about SolrCloud if you
are not aware of it, as we have a dedicated chapter that covers SolrCloud in detail.
In brief, SolrCloud is a terminology used in distributed search and indexing.
When we need to index huge amounts of data, we need to think of scalability
and performance. This is where SolrCloud comes into the picture.
Starting from Solr 4.3, Solr will maintain two distinct formats for solr.xml; one is
legacy and the other is discovery mode. The legacy format will be supported until
the 4.x.0 series and it will be deprecated in the 5.0 release of Solr. The default
solr.xml config file looks something like this:
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${jetty.port:8983}</int>
<str name="hostContext">${hostContext:solr}</str>
<int name="zkClientTimeout">${zkClientTimeout:30000}</int>
<bool
name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
</solr>
The preceding configuration shows that Solr configurations are SolrCloud friendly,
but this does not mean that Solr is running in SolrCloud mode, unless you start Solr
with some special parameters (explained in the SolrCloud Chapter 10, Distributed
Indexing). To configure multiple cores in Solr in legacy format, you need to edit the
solr.xml file with the following code snippet and remove the existing discovery
code from solr.xml:
<solr persistent="false">
<cores adminPath="/admin/cores" defaultCoreName="core1">
<core name="core1" instanceDir="core1" />
<core name="core2" instanceDir="core2" />
</cores>
</solr>
[ 11 ]
Getting Started
Now you need to create two cores (new directories, core1 and core2) in the Solr
directory. You also need to create Solr configuration files for new cores. To do this,
just copy the same configuration files (the conf directory in collections1) in both
cores for now and restart the Solr server after you have made these settings.
Once you restart the Solr server with the preceding configuration, two cores will be
created, with names core1 and core2 and the existing default Solr configuration
settings. The instanceDir variable defines the directory name relative to solr.
xmlwhere to look for configuration and data files. You can modify the paths of
these cores according to your wishes and the configuration files according to your
use case. You can also change the names of the cores.
You can verify your settings by opening the following URL in your browser:
http://localhost:8983/solr/.
You will see two new cores created in the Solr dashboard. Currently, there is
no document in any of the cores because we have not indexed any data so far.
So, this concludes the process of creating multiple cores in Solr.
Summary
Thus, by the end of the first chapter, you have learned what Solr is, how to install and
run it on various operating systems, what the various components and basic building
blocks of Solr are (such as its configuration files and directory structure), and how to
set up configuration files. You also learned in brief about the architecture of Solr. In the
last section, we covered multicore setup in the Solr 4.x.0 series. However, the legacy
method of multicore setup is going to be deprecated in the Solr 5.x release and then
it's going to be only discovery mode, which is called SolrCloud.
In the next chapter, we will look deeply into the various components used in Solr
configuration files, such as tokenizers, analyzers, filters, field types, and so on.
[ 12 ]
www.PacktPub.com
Stay Connected: