Xplore 1.3 Admin PDF
Xplore 1.3 Admin PDF
Xplore 1.3 Admin PDF
xPlore
Version 1.3
EMC Corporation
Corporate Headquarters:
Hopkinton, MA 017489103
15084351000
www.EMC.com
Table of Contents
Chapter 1
Introduction to xPlore.....................................................................................13
Features ..........................................................................................................13
Limitations .......................................................................................................14
xPlore compared to FAST .................................................................................17
Architectural overview.......................................................................................19
Physical architecture ........................................................................................19
xPlore disk areas ..........................................................................................20
xPlore instances ...........................................................................................20
xDB libraries and Lucene index......................................................................21
Indexes ........................................................................................................22
Logical architecture ..........................................................................................23
Documentum domains and categories............................................................25
Mapping of domains to xDB ...........................................................................26
How Content Server documents are indexed......................................................27
How Content Server documents are queried ......................................................29
Chapter 2
Table of Contents
Chapter 3
Managing Security..........................................................................................51
About security ..................................................................................................51
Changing search results security.......................................................................51
Manually updating security................................................................................52
Changing the administrator password ................................................................53
Configuring the security cache ..........................................................................54
Troubleshooting security ...................................................................................55
Chapter 4
Chapter 5
Table of Contents
Chapter 7
Table of Contents
Chapter 9
Table of Contents
Chapter 11
Facets........................................................................................................... 273
About Facets.................................................................................................. 273
Configuring facets in xPlore ............................................................................ 274
Creating a DFC facet definition........................................................................ 276
Facet datatypes ............................................................................................. 276
Creating a DFS facet definition........................................................................ 278
Defining a facet handler .................................................................................. 281
Table of Contents
Chapter 13
Chapter 14
Chapter 15
Table of Contents
Appendix B
Appendix C
Preface
Intended Audience
This guide contains information for xPlore administrators who configure xPlore and Java developers
who customize xPlore:
Configuration is defined for support purposes as changing an XML file or an administration
setting in the UI.
Customization is defined for support purposes as using xPlore APIs to customize indexing and
search. The xPlore SDK is a separate download that supports customization.
You must be familiar with the installation guide, which describes the initial configuration of the
xPlore environment. When Documentum functionality is discussed, this guide assumes familiarity
with EMC Documentum Content Server administration.
Revision history
The following changes have been made to this document.
Revision Date
Description
November 2012
Initial publication
Additional documentation
This guide provides overview, administration, and development information. For information on
installation, supported environments, and known issues, see:
EMC Documentum xPlore Release Notes
EMC Documentum xPlore Installation Guide
EMC Documentum Environment and System Requirements Guide
For additional information on Content Server installation and Documentum search client
applications, see:
EMC Documentum Content Server Installation Guide
EMC Documentum Search Development Guide
11
Chapter 1
Introduction to xPlore
This chapter contains the following topics:
Features
Limitations
Architectural overview
Physical architecture
Logical architecture
Features
Documentum xPlore is a multi-instance, scalable, high-performance, full-text index server that can be
configured for high availability and disaster recovery.
The xPlore architecture is designed with the following principles:
Uses standards as much as possible, like XQuery
Uses open source tools and libraries, like Lucene
Supports enterprise readiness: High availability, backup and restore, analytics, performance tuning,
reports, diagnostics and troubleshooting, administration GUI, and configuration and customization
points.
Supports virtualization, with accompanying lower total cost of ownership.
Indexing features
Collection topography: xPlore supports creating collections online, and collections can span multiple
file systems.
Transactional updates and purges: xPlore supports transactional updates and purges of indexes as well
as transactional commit notification to the caller.
Multithreaded insertion into indexes: xPlore ingestion through multiple threads supports vertical
scaling on the same host.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
13
Introduction to xPlore
Dynamic allocation and deallocation of capacity: For periods of high ingestion, you can add a CPS
instance and new collection. Add content to this collection, then move the collection to another
instance for better search performance. You can then decommission the CPS instance.
Temporary high query load: For high query load, like a legal investigation, add an xPlore instance for
the search service and bind collections to it in read-only mode.
Growing ingestion or query load: If your ingestion or query load increases due to growing business,
you can add instances as needed.
Extensible indexing pipeline using the open-source UIMA framework.
Configurable stop words and special characters.
Search features
Case sensitivity: xPlore queries are lower-cased (rendered case-insensitive).
Full-text queries: To query metadata, set up a specific index on the metadata.
Faceted search: Facets in xPlore are computed over the entire result set or over a configurable number
of results.
Security evaluation: When a user performs a search, permissions are evaluated for each result. Security
can be evaluated in the xPlore full-text engine before results are returned to Content Server, resulting
in faster query results. This feature is turned on by default and can be configured or turned off.
Native XQuery syntax: The xPlore full-text engine supports XQuery syntax.
Thesaurus search to expand query terms.
Fuzzy search finds misspelled words or letter reversals.
Boost specific metadata in search results.
Extensive testing and validation of search on supported languages.
Administration features
Multiple instance configuration and management.
Reports on ingestion metrics and errors, search performance and errors, and user activity.
Collections management: Creating, configuring, deleting, binding, routing, rebuilding, querying.
Command-line interface for automating backup and restore.
Limitations
ACLs and aspects are not searchable by default
ACLs and aspects are not searchable by default, to protect security. You can reverse the default by
editing indexserverconfig.xml. Set full-text-search to true in the subpath definition for acl_name and
r_aspect_name and then reindex your content.
14
Introduction to xPlore
Workaround using xPlore administrator. Select an instance and click Configuration. Change the
following to smaller values:
Batch size: Decrease batch size to decrease the number of documents in a failed batch. (All
documents in a batch fail to be indexed if one document fails.)
Max text threshold
Thread pool size
Batch failure
Indexing requests are processed in batches. When one request in a batch fails when the index is
written to xDB, the entire batch fails.
15
Introduction to xPlore
An adopted collection is a collection that has been moved to a parent collection. The adopted collection
becomes a subcollection. This kind of collection is created to boost ingestion rate, and later adopted
for better search performance.
Lemmatization
xPlore supports lemmatization, but you cannot configure the parts of speech that are lemmatized.
The part of speech for a word can be misidentified when there is not enough context. Workaround:
Enable alternative lemmatization if you have disabled it (see Configuring indexing lemmatization,
page 105).
Punctuation at the end of the sentence is included in the lemmatization of the last word. For example,
a phrase Mary likes swimming and dancing is lemmatized differently depending on whether there is a
period at the end. Without the period, dancing is identified as a verb with the lemma dance. With the
period, it is identified as a noun with the lemma dancing. A search for the verb dance does not find
the document when the word is at the end of the sentence. The likelihood of errors in Part-Of-Speech
(POS) tagging increases with sentence length. Workaround: Enable alternate lemmatization.
Phrase searches
The content of a phrase search is not lemmatized.
Search fails for parts of common phrases. A common phrase like because of, a good many, or
status quo is tokenized as a phrase and not as individual words. A search for a word in the phrase
like because fails.
Introduction to xPlore
Documentum client applications, such as Webtop and DCO, materialize emails so they are fully
searchable. SourceOne does not, so emails in SourceOne are sometimes not fully searchable.
If a query returns a result that is a lightweight sysobject (LWSO), the query does not find the object.
No result is returned. DQL filters unmaterialized LWSOs, not XQuery.
Chinese
Space in query causes incorrect tokenization
A space within a Chinese term is treated in DQL as white space. A search for the string fails. For
example, the term is treated as AND . A search for fails.
Dictionary must be customized for Chinese name and place search
For Chinese documents, the names of persons and places cannot be searched. To be found, they must
be added to the Chinese dictionary in xPlore. See Adding dictionaries to CPS, page 120.
Administration differences
xPlore has an administration console. FAST does not. Many features in xPlore are configurable
through xPlore administrator. These features were not configurable for FAST. Additionally,
administrative tasks are exposed through Java APIs.
Ports required: During xPlore instance configuration, the installer prompts for the HTTP port for the
JBoss instance (base port). The installer validates that the next 100 consecutive ports are available.
During index agent configuration, the installer prompts for the HTTP port for index agent Jboss
instance and validates that the next 20 consecutive ports are available. FAST used 4000 ports.
High availability: xPlore supports N+1, active/passive with clusters, and active/active shared data
configurations. FAST supports only active/active. xPlore supports spare indexing instances that are
EMC Documentum xPlore Version 1.3 Administration and Development Guide
17
Introduction to xPlore
activated when another instance fails. The EMC Documentum xPlore Installation Guide describes
high availability options for xPlore.
Disaster recovery: xPlore supports online backup, including full and incremental. FAST supports
only offline (cold) backup.
Storage technology: xPlore supports SAN and NAS. FAST supports SAN only.
Virtualization: xPlore runs in VMware environments. FAST does not.
64-bit address space: 64-bit systems are supported in xPlore but not in FAST.
xPlore requires less temporary disk space than FAST. xPlore requires twice the index space used by
all collections, in addition to the index. This space is used for merges and optimizations. FAST
requires 3.5 times the space.
Indexing differences
Back up and restore: xPlore supports warm backups.
High availability: xPlore automatically restarts content processing after a CPS crash. After a VM
crash, the xPlore watchdog sends an email notification.
Transactional updates and purges: xPlore supports transactional updates and purges as well as
transactional commit notification to the caller. FAST does not.
Collection topography: xPlore supports creating collections online, and collections can span multiple
file systems. FAST does not support these features.
Lemmatization: FAST supports configuration for which parts of speech are lemmatized. In xPlore,
lemmatization is enabled or disabled. You can configure lemmatization for specific Documentum
attribute values.
Search differences
One-box search: Searches from the Webtop client default to ANDed query terms in xPlore.
Query a specific collection: Targeted queries are supported in xPlore but not FAST.
Folder descend: Queries are optimized in xPlore but not in FAST.
Results ranking: FAST and xPlore use different ranking algorithms.
Excluding from index: xPlore allows you to configure non-indexed metadata to save disk space
and improve ingestion and search performance. With this configuration, the number of hits differs
between FAST and xPlore queries on the non-indexed content. For example, if xPlore does not index
docbase_id, a full-text search on "256" returns no hits in xPlore. The search returns all indexed
documents for repository whose ID is 256.
Security evaluation: Security is evaluated by default in the xPlore full-text engine before results are
returned to Content Server, resulting in faster query results. FAST returns results to the Content Server,
resulting in many hits that the user is not able to view.
Underprivileged user queries: Optimized in xPlore but not in FAST.
Native XQuery syntax: Supported by xPlore.
18
Introduction to xPlore
Facets: Facets are limited to 350 hits in FAST, but xPlore supports many more hits.
XML attributes: Attribute values on XML elements are part of the xPlore binary index. xPlore does
not index XML attribute values.
Special characters: Special character lists are configurable. The default in xPlore differs from FAST
when terms such as email addresses or contractions are tokenized. For example, in FAST, an email
address is split up into separate tokens with the period and @ as boundaries. However, in xPlore,
only the @ serves as the boundary, since the period is considered a context character for part of
speech identification.
Architectural overview
xPlore provides query and indexing services that can be integrated into external content sources such
as the Documentum content management system. External content source clients like Webtop or
CenterStage, or custom Documentum DFC clients, can send indexing requests to xPlore.
Each document source is configured as a domain in xPlore. You can set up domains using xPlore
administrator. For Documentum environments, the Documentum index agent creates a domain for
each repository and a default collection within that domain.
Documents are provided in an XML representation to xPlore for indexing through the indexing APIs.
In a Documentum environment, the Documentum index agent prepares an XML representation of each
document. The document is assigned to a category, and each category corresponds to one or more
collections as defined in xPlore. To support faceted search in Documentum repositories, you can define
a special type of an index called an implicit composite index.
xPlore instances are web application instances that reside on application servers. When an xPlore
instance receives an indexing request, it uses the document category to determine what is tokenized
and saved to the index. A local or remote instance of the content processing service (CPS) fetches the
content. CPS detects the primary language and format of a document. CPS then extracts indexable
content from the request stream and parses it into tokens. The tokens are used for building a full-text
index.
xPlore manages the full-text index. An external Apache Lucene full-text engine is embedded into
the EMC XML database (xDB). xDB tracks indexing and updates requests, recording the status of
requests and the location of indexed content. xDB provides transactional updates to the Lucene
index. Indexes are still searchable during updates.
When an instance receives a query request, the request is processed on all included collections, then
the assembled query results are returned.
xPlore provides a web-based administration console.
Physical architecture
The xPlore index service and search service are deployed as a WAR file to a JBoss application server
that is included in the xPlore installer. xPlore administrator and online help are installed as war files
in the same JBoss application server. The index is stored in the storage location that was selected
during configuration of xPlore.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
19
Introduction to xPlore
Area
Description
Use in indexing
Use in search
xplore_home/data
xplore_home/config/log
Stores transaction
information
Provides snapshot
information during some
retrievals
Non-committed data is
stored to the log
None
Holds content
None
1. (CPS) Intermediate
processing
2. (CPS) Exports to the
index service
3. Index: Updates to
the Lucene index
(non-transactional)
Temporarily holds
content during indexing
process
xPlore instances
An xPlore instance is a web application instance (WAR file) that resides on an application server. You
can have multiple instances on the same host (vertical scaling), although it is more common to have
one xPlore instance per host (horizontal scaling). You create an instance by running the xPlore installer.
The first instance that you install is the primary instance. You can add secondary instances after
you have installed the primary instance. The primary instance must be running when you install
a secondary instance.
Adding or deleting an instance
20
Introduction to xPlore
To add an instance to the xPlore system, run the xPlore configurator script. If an xPlore instance exists
on the same host, select a different port for the new instance, because the default port is already in use.
To delete an instance from the xPlore system, use the xPlore configurator script. Shut down the
instance before you delete it.
You manage instances in xPlore administrator. Click Instances in the left panel to see a list of
instances in the right content pane. You see following instance information:
OS information: Host name, status, OS, and architecture.
JVM information: JVM version, active thread count, and number of classes loaded.
xPlore information: xDB version, instance version, instance type, and state.
An instance can have one or more of the following features enabled:
Content processing service (CPS)
Indexing service
Search service
xPlore Administrator (includes analytics, instance, and data management services)
Spare: A spare instance can be manually activated to take over for a disabled or stopped instance.
See Replacing a failed instance with a spare, page 34.
You manage an instance by selecting the instance in the left panel. Collections that are bound to the
instance are listed on the right. Click a collection to go to the Data Management view of the collection.
The application server instance name for each xPlore instance is recorded in indexserverconfig.xml. If
you change the name of the JBoss instance, change the value of the attribute appserver-instance-name
on the node element for that instance. This attribute is used for registering and unregistering instances.
Back up the xPlore federation after you change this file.
21
Introduction to xPlore
When xPlore processes an XML representation of an input document and supplies tokens to xDB,
xDB stores them into a Lucene index. Optionally, xPlore can be configured to store the content
along with the tokens. A tracking database in xDB manages deletes and updates to the index. When
documents are updated or deleted, changes to the index are propagated. When xPlore supplies XQuery
expressions to xDB, xDB passes them to the Lucene index. To query the correct index, xDB tracks
the location of documents.
xDB manages parallel dispatching of queries to more than one Lucene index when parallel queries
are enabled. For example, if you have set up multiple collections on different storage locations, you
can query each collection in parallel.
Figure 1
An xDB library is stored on a data store. If you install more than one instance of xPlore, the storage
locations must be accessible by all instances. The xDB data stores and indexes can reside on a separate
data store, SAN or NAS. The locations are configurable in xPlore administrator. If you do not have
heavy performance requirements, xDB and the indexes can reside on the same data store.
Indexes
xDB has several possible index structures that are queried using XQuery. The Lucene index is
modeled as a multi-path index (a type of composite index). in xDB. The Lucene index services both
value-based and full-text probes of the index.
Covering indexes are also supported. When the query needs values, they are pulled from the index and
not from the data pages. Covering indexes are used for security evaluation and facet computation.
You can configure none, one, or multiple indexes on a collection. An explicit index is based on values
of XML elements, paths within the XML document, path-value combination, or full-text content.
For example, following is a value indexed field:
/dmftdoc[dmftmetadata//object_name="foo"]
22
Introduction to xPlore
Indexes are defined and configured in indexserverconfig.xml. For information on viewing and updating
this file, see Modifying indexserverconfig.xml, page 43.
Logical architecture
A domain contains indexes for one or more categories of documents. A category is logically
represented as one or more collections. Each collection contains indexes on the content and metadata.
When a document is indexed, it is assigned to a category or class of documents and indexed into one
of the category collections.
Documentum domains and categories, page 25
Mapping of domains to xDB, page 26
Domains
A domain is a separate, independent, logical grouping of collections with an xPlore deployment. For
example, a domain could contain the indexed contents of a single Documentum content repository.
Domains are defined in xPlore administrator in the data management screen. A domain can have
multiple collections in addition to the default collection.
The Documentum index agent creates a domain for the repository to which it connects. This domain
receives indexing requests from the Documentum index agent.
Categories
A category defines how a class of documents is indexed. All documents submitted for ingestion
must be in XML format. (For example, the Documentum index agent prepares an XML version for
Documentum repository indexing.) The category is defined in indexserverconfig.xml and managed
by xPlore. A category definition specifies the processing and semantics that is applied to an ingested
XML document. You can specify the XML elements that are used for language identification. You
can specify the elements that have compression, text extraction, tokenization, and storage of tokens.
You also specify the indexes that are defined on the category and the XML elements that are not
indexed. A collection belongs to one category.
Collections
A collection is a logical group of XML documents that is physically stored in an xDB detachable
library. A collection represents the most granular data management unit within xPlore. All documents
submitted for indexing are assigned to a collection. A collection generally contains one category of
documents. In a basic deployment, all documents in a domain are assigned to a single default collection.
You can create subcollections under each collection and route documents to user-defined collections.
A collection is bound to a specific instance in read-write state (index and search, index only, or update
and search). A collection can be bound to multiple instances in read-only state (search-only). Three
collections (two hot and one cold) with their corresponding instances are shown.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
23
Introduction to xPlore
24
Introduction to xPlore
Figure 3
Example
A document is submitted for indexing. The client indexing application, for example, Documentum
index agent, has not specified the target collection for the document. If the document exists, the index
service updates the document. If it is a new document, the document is assigned to an instance based
on a round-robin order. On that instance, if the instance has more than one collection, then collection
routing is applied. If collection routing is not supplied by a client routing class or the Documentum
index agent, the document is assigned to a collection in round-robin order.
25
Introduction to xPlore
Documentum categories
A document category defines the characteristics of XML documents that belong to that category and
their processing. All documents are sent to a specific index based on the document category. For
example, xPlore pre-defines a category called dftxml that defines the indexes. All Documentum
indexable content and metadata are sent to this category.
The following Documentum categories are defined within the domain element in indexserverconfig.xml.
For information on viewing and updating this file, see Modifying indexserverconfig.xml, page 43.
dftxml: XML representation of object metadata and content for full text indexing. To view the
dftxml representation using xPlore administrator, click the document in the collection view.
acl: ACLs that defined in the repository are indexed so that security can be evaluated in the full-text
engine. See About security, page 51 for more information.
group: Groups defined in the repository are indexed to evaluate security in the full-text engine.
Introduction to xPlore
One content source (Documentum repository A) is mapped to a domain library. The library is
stored in a defined storage area on either instance.
A second repository, Repository B, has its own domain.
All xPlore domains share the system metrics and audit databases (SystemData library in xDB
with libraries MetricsDB and AuditDB). The metrics and audit databases have a subcollection for
each xPlore instance.
The ApplicationInfo library contains Documentum ACL and group collections for a specific
domain (repository).
The SystemInfo library has two subcollections: TrackingDB and StatusDB. Each collection in
TrackingDB matches a collection in Data and is bound to the same instance as that data collection.
There is a subcollection in StatusDB for each xPlore instance. The instance-specific subcollection
has a file status.xml that contains processing information for objects processed by the instance.
The Data collection has a default subcollection.
27
Introduction to xPlore
queue item and applies index agent filters. After an index request is submitted to xPlore, the client
application can move on to the next task. (Indexing is asynchronous.)
3. The index agent retrieves the object associated with the queue item from the repository. The content
is retrieved or staged to a temporary area. The agent then creates a dftxml (XML) representation
of the object that can be used full-text and metadata indexing.
4. The Index Agent sends the dftxml representation of the content and metadata to the xPlore Server.
5. The xPlore indexing service calls CPS to perform text extraction, language identification, and
transformation of metadata and content into indexable tokens.
6. The xPlore indexing service performs the following steps:
Routes documents to their target collections.
Merges the extracted content into the dftxml representation of the document.
Calls xDB to store the dftxml in xDB.
Returns the tokens from CPS to xDB.
Stores the document location (collection) and document ID in the TrackingDB.
Saves indexing metrics in the MetricsDB.
Tracks document indexing status in the StatusDB.
7. The indexing service notifies the index agent of the indexing status. The index agent then removes
the queue item from the Content Server. Otherwise, the queue item is left behind with the error
status and error message.
The object is now searchable. (The index service does not provide any indication that an object is
searchable.) For information on how to troubleshoot latency between index agent submission and
searchability, see Troubleshooting indexing, page 146.
Reindexing
The index agent does not recreate all the queue items for reindexing. Instead, it creates a watermark
queue item (type dm_ftwatermark) to indicate the progress of reindexing. It picks up all the objects for
28
Introduction to xPlore
indexing in batches by running a query. The index agent updates the watermark as it completes each
batch. When the reindexing is completed, the watermark queue item is updated to done status.
You can submit for reindexing one or all documents that failed indexing. In Documentum
Administrator, open Indexing Management > Index Queue. Choose Tools > Resubmit all failed
queue items, or select a queue item and choose Tools > Resubmit queue item.
29
Chapter 2
Managing the System
This chapter contains the following topics:
Managing instances
Configuring an instance
Modifying indexserverconfig.xml
Customizations in indexserverconfig.xml
Administration APIs
31
host: DNS name of the computer on which the xPlore primary instance is installed.
port: xPlore primary instance port (default: 9300).
Log in as the Administrator with the password that you entered when you installed the primary
instance. The xPlore administrator home page displays a navigation tree in the left pane and links
to the four management areas in the content pane.
2. Click System Overview in the left tree to get the status of each xPlore instance. Click Global
Configuration to configure system-wide settings.
If you are unable to open the xPlore administrator URL, with an error message Not configured
, and the index agent error message Not responding, a firewall is likely preventing access. For
information on changing the administrator password, see Changing the administrator password, page
53.
If you did not stop secondary instances, they report a failed connection to the primary instance when
you restart it.
Managing instances
Auditing. See Auditing queries, page 244, Troubleshooting data management, page 167, and
Configuring the audit record, page 39.
Managing instances
Configuring an instance
You can configure the indexing service, search service, or content processing service for a secondary
instance. Select the instance in xPlore administrator and then click Stop Instance.
Requirements
All instances in an xPlore deployment must have their host clocks synchronized to the primary
xPlore instance host.
b. Use the object ID returned by the previous step to get the parameters and values and their
index positions. For example:
?,c,select param_name, param_value from dm_ftengine_config where
r_object_id=080a0d6880000d0d
33
Managing instances
c. Enter your new port at the SET command line. If the port was returned as the second parameter,
set the index to 2 as shown in the following example:
set,c,l,param_value[2]
SET>new_port
save,c,l
d. Enter your new host name at the SET command line. For example:
retrieve,c,dm_ftengine_config
set,c,l,param_value[3]
SET>new_hostname
save,c,l
34
Managing instances
For information on changing a failed instance to spare, see Changing a failed instance into a spare,
page 37.
5.
6.
d. Edit indexserver-bootstrap.properties in all other xPlore instances to reference the new primary
instance.
7. Edit xdb.properties in the directory WEB-INF/classes of the new primary instance.
a. Find the XHIVE_BOOTSTRAP entry and edit the URL to reflect the new primary instance
host name and port. (This bootstrap file is not the same as the indexserver bootstrap file.)
b. Change the host name to match your new primary instance host.
c. Change the port to match the port for the value of the attribute xdb-listener-port on the new
instance.
For example:
XHIVE_BOOTSTRAP=xhive://NewHost:9330
d. Edit xDB.properties in all other xPlore instances to reference the new primary instance.
8. Update xdb.bat in xplore_home/dsearch/xhive/admin. Your new values must match the values in
indexserverconfig.xml for the new primary instance.
Change the path for XHIVE_HOME to the path to the new primary instance web application.
Change ESS_HOST to the new host name.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
35
Managing instances
Change ESS_PORT to match the value of the port in the url attribute of the new primary
instance (in indexserverconfig.xml).
9. Start the xPlore primary instance, then start the secondary instances.
10. Update the index agent.
a. Shut down the index agent instance and modify indexagent.xml in
xplore_home/jboss5.1.0/server/DctmServer_Indexagent/deploy/IndexAgent.war/WEB-INF/classes.
b. Change parameter values for parameters that are defined in the element
indexer_plugin_config/generic_indexer/parameter_list/parameter.
Change the parameter_value of the parameter dsearch_qrserver_host to the new host name.
Change the parameter_value of the parameter dsearch_qrserver_port to the new port.
11. Update dm_ftengine_config on the Content Server. Use iAPI to change the parameters for the
host name and port in the dm_ftengine_config object. This change takes effect when you restart
the repository.
a. To find the port and host parameter index values for the next step, do the following iAPI
command:
retrieve,c,dm_ftengine_config
b. Use the object ID to get the parameters and values and their index positions. For example:
?,c,select param_name, param_value from dm_ftengine_config where
r_object_id=080a0d6880000d0d
c. To set the port, enter your new port at the SET command line. If the port was returned as the
third parameter in step 3, substitute 3 for the parameter index. For example:
retrieve,c,dm_ftengine_config
set,c,l,param_value[3]
SET>new_port
save,c,l
d. To set the host name, enter your new host name at the SET command line:
retrieve,c,dm_ftengine_config
set,c,l,param_value[4]
SET>new_hostname
save,c,l
36
37
recurrence timeunit and frequency: Specifies how often the task is executed. For example, the
disk space task with a frequency of 2 and time unit of hours checks disk space every two hours.
Default: Every minute.
start-date: date and time the task should be invoked, in UTC format. If the date is in the past,
the task will be executed as soon as possible.
expiry-date: Specifies the date and time a task stops executing, in UTC format.
max-response-timeout: Specifies how long between detection of a hung task and execution of the
notification (or other task). For example, a wait-time value of 6 and time unit of hours indicates a
wait of 6 hours before notification about a non-responding instance.
max-retry-threshold: Specifies the maximum number of times the task can be retried. For example,
if the task is notification, a value of 10 indicates the notification task is retried 10 times. Recurring
tasks are retried at the next scheduled invocation time.
max-iterations: Maximum number of times to attempt to ping an instance that has no response.
Default: -1 (no limit)
You can also configure the timing properties for the index agent. If you change
the installation owner password, modify the property docbase_password in
dsearch-watchdog-config.xml with the new encrypted password. To encrypt the password, run
xplore_home/watchdog/tools/encrypt-password.bat|sh.
When the data in a remote shared environment is referenced by a local symbolic link in Windows, the
watchdog cannot monitor the disk space of the remote environment.
For information on viewing and updating this file, see Modifying indexserverconfig.xml, page 43.
2. Conserve disk space on the primary host: Purge the status database when the xPlore
primary instance starts up. Set the value of the purge-statusdb-on-startup attribute on the
index-server-configuration element to true.
39
audit-save-batch-size: Specifies how many records are batched before a save. Default: 100.
lifespan-in-days: Specifies time period before audit record is purged. Default: 30
preferred-purge-time: Specifies the time of day at which the audit record is purged. Format:
hours:minutes:seconds in 24-hour time. Default: midnight (00:00:00)
audit-file-size-limit: Size limit units: K | M | G | T (KB, MB, GB, TB).
audit-file-rotate-frequency: Period in hours that a file serves as the active storage for the audit
records.
Connection refused
Indexing fails when one of the xPlore instances is down. The error in dsearch.log is like the following:
CONNECTION_FAILED: Connect to server at 10.32.112.235:9330 failed,
Original message:
Connection refused
1. Update indexserverconfig.xml with the new value of the URL attribute on the node element. For
information on viewing and updating this file, see Modifying indexserverconfig.xml, page 43.
2.
Change the JBoss startup (script or service) so that it starts correctly. If you run a stop script,
run as the same administrator user who started the instance.
The dm_fulltext_index object attribute is_standby must be set to false (0). Substitute your object ID:
retrieve,c,dm_fulltext_index
3b0012a780000100
?,c,select is_standby from dm_fulltext_index where r_object_id=3b0012a780000100
0
41
Multiple instances of xPlore: Storage areas must be accessible from all other instances. If not, you
see an I/O error when you try to create a collection. Use the following cleanup procedure:
1.
Edit indexserverconfig.xml to remove binding elements from the collection that has the issue.
For information on viewing and updating this file, see Modifying indexserverconfig.xml, page 43.
Restart xPlore instances.
I/O error indexing a large collection: Switch to the 64-bit version of xPlore and use 4+ GB of
memory when a single collection has more than 5 million documents.
I/O error during index merge: Documents are added to small Lucene indexes within a single
collection. These indexes are merged into a larger final index to help query response time. The final
merge stage can require large amounts of memory. If memory is insufficient, the merge process
fails and corrupts the index. Switch to the 64-bit version of xPlore and allocate 4 GB of memory
or more to the JVM.
com.xhive.error.XhiveException: IO_ERROR:
Failure while merging external indexes, Original message:
Insufficient system resources exist to complete the requested service
To fix a corrupted index, see Repairing a corrupted index, page 175. To delete a corrupted domain, see
Delete a corrupted domain, page 157.
Error on startup
Non-ASCII characters in indexserverconfig.xml can cause startup to fail.
If you edit indexserverconfig.xml using a simple text editor like Notepad, non-ASCII characters such
as are saved in native (OS) encoding. For example, Windows uses ISO-8859-1. xPlore uses UTF-8
encoding, which results in unexpected text errors.
42
Use an XML editor to edit the file, and validate your changes using the xplore.bat (Windows) or
xplore.sh (Linux) script in xplore_home/dsearch/xhive/admin. Restart the xPlore instances.
If the indexing is very slow, you can modify the timeout for a request.
The index-request-time-out property specifies in milliseconds how much time is allowed before an
index request times out. The default value is 3600000 (1 hour).
Normally, you do not need to add this parameter to indexserverconfig.xml; do this only in one of the
following situations:
You want to import a very large thesaurus whose estimated processing time may exceed one hour,
set its value to a time sufficient for the thesaurus import to complete successfully.
You set the value of Index Agent timeout setting to longer than one hour. Set index-request-time-out
to the same value specified for the Index Agent timeout setting for the latter to take effect; otherwise,
index-request-time-out overrides the Index Agent setting.
To set the index-request-time-out parameter, add it to indexserverconfig.xml under
index-config/properties element and specify its value.
Remove this parameter from indexserverconfig.xml to use its default value. Here is an example
for a two-hours setting:
<property name="index-request-time-out" value="7200000"/>
Modifying indexserverconfig.xml
Some tasks are not available in xPlore administrator. These rarely needed tasks require manual editing
of indexserverconfig.xml. This file is located in xplore_home/config on the primary instance. It is
loaded into xPlore memory during the bootstrap process, and it is maintained in parallel as a versioned
file in xDB. All changes to the file are saved into the xDB file at xPlore startup.
On Windows 2008, you cannot save the file with the same name, and the extension is not shown. By
default, when you save the file, it is given a .txt extension. Be sure to replace indexserverconfig.xml
with a file of the same name and extension.
Note: Do not edit this file in xDB, because the changes are not synchronized with xPlore.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
43
For example:
xplore validateConfigFile "C:/xPlore/config/indexserverconfig.xml"
Customizations in indexserverconfig.xml
Define and configure indexes for facets.
44
Add and configure categories: Specifying the XML elements that have text extraction, tokenization,
and storage of tokens. Specify the indexes that are defined on the category and the XML elements
that are not indexed. Change the collection for a category.
Configure system, indexing, and search metrics.
Conserve disk space by purging the status database on startup.
Specify a custom routing-class for user-defined domains.
Change the xDB listener port and admin RMI port.
Turn off lemmatization.
Lemmatize specific categories or element content.
Configure indexing depth (leaf node).
Change the xPlore host name and URL.
Boost metadata and freshness in results scores.
Add or change special characters for CPS processing.
Trace specific classes. See Tracing, page 304.
Set the security filter batch size and the user and group cache size.
Action
indexserverconfig.xml
Define/change a category of
documents (see Configuring
categories, page 158)
xDB
45
Action
indexserverconfig.xml
xDB
Turn off xPlore native security (see Changing search results security, page 51).
Make types and attributes searchable (Making types and attributes searchable, page 223).
Turn off XQuery generation to support certain DQL operations (DQL, DFC, and DFS queries,
page 224).
Configure search for word fragments and wildcards (Configuring wildcards and fragment search,
page 218).
Route a query to a specific collection (Routing a query to a specific collection, page 257).
Turn on tracing for the Documentum query plugin (see Tracing Documentum queries, page 227).
Customize facets and queries (see About Facets, page 273).
Administration APIs
The xPlore Admin API supports all xPlore administrative functions. The Admin API provides you
with full control of xPlore and its components.
Note: Administration APIs are not supported in this release. The information is provided for planning
purposes.
Each API is described in the javadocs. Index service APIs are available in the interface IFtAdminIndex
in the package com.emc.documentum.core.fulltext.client.admin.api.interfaces. This package is in
the SDK jar file dsearchadmin-api.jar.
System administration APIs are available in the interface IFtAdminSystem in the package
com.emc.documentum.core.fulltext.client.admin.api.interfaces in the SDK jar file dsearchadmin-api.jar.
Administration APIs are wrapped in a command-line interface tool (CLI). The syntax and CLIs are
described in the chapter Automated Utilities (CLI).
47
Configuration APIs
Configuration APIs are available in the interface IFtAdminConfig in the package
com.emc.documentum.core.fulltext.client.admin.api.interfaces. This package is in the SDK jar file
dsearchadmin-api.jar.
49
50
Chapter 3
Managing Security
This chapter contains the following topics:
About security
Troubleshooting security
About security
xPlore does not have a security subsystem. Anyone with access to the xPlore host port can connect
to it. You must secure the xPlore environment using network security components such as a firewall
and restriction of network access. Secure the xPlore administrator port and open it only to specific
client hosts.
Passwords are encrypted with a FIPS 140-2 validated encryption module using SHA1. Existing
passwords encrypted with MD5 are decrypted and encrypted with SHA1.
Documentum repository security is managed through individual and group permissions (ACLs). By
default, security is applied to results before they are returned to the Content Server (native xPlore
security), providing faster search results. xPlore security minimizes the result set that is returned
to the Content Server.
Content Server queues changes to ACLs and groups. The queue sometimes causes a delay between
changes in the Content Server and propagation of security to the search server. If the index agent has
not yet processed a document for indexing or updated changes to a permission set, users cannot
find the document.
You can set up a separate index agent to handle changes to ACLs and groups. See Setting up index
agents for ACLs and groups, page 65.
51
Managing Security
2. Open the iAPI tool from the Documentum Server Manager on the Content Server host or in
Documentum Administrator.
3. To check your existing security mode, enter the following command:
retrieve,c,dm_ftengine_config
get,c,l,ftsearch_security_mode
4. Enter the following command to turn off xPlore native security. Note lowercase L in the set and
save commands:
retrieve,c,dm_ftengine_config
set,c,l,ftsearch_security_mode
0
save,c,l
reinit,c
52
Managing Security
You can manually populate or update the ACL and group information in xPlore. A similar job in
Content Server 6.7 and higher allows you to selectively replication ACLs and groups. The script
replicates all ACLs and groups. Use the job or script for the following use cases:
You are testing Documentum indexing before migration.
You use xPlore to index a repository that has no full-text system (no migration).
Security in the index is out of synch with the repository from ftintegrity counts.
Note: To speed up security updates in the index, you can create a separate index agent for ACLs and
groups. See Setting up index agents for ACLs and groups, page 65.
1. Locate the script aclreplication_for_repositoryname.bat or .sh in
xplore_home/setup/indexagent/tools.
2. Edit the script before you run it. Locate the line beginning with "%JAVA_HOME%\bin\java". Set
the repository name, repository user, password, xPlore primary instance host, xPlore port, and
xPlore domain (optional).
Check the Java Method Server log and job report for any errors or exceptions thrown. When you run
the script, it prints the status of each object it tried to replicate.
Alternatively, you can run the ACL replication job dm_FTACLReplication in Documentum
Administrator. (Do not confuse this job with the dm_ACLReplication job.) By default, the job reports
only the number of objects replicated. Setting the job argument verbose to true writes the status of
each object in job report. You can selectively replicate only dm_acl, dm_group.
Table 4
Argument
Description
-acl_where_clause
-group_where_clause
-max_object_count
-replication_option
-verbose
Note: The arguments -dsearch_host, -dsearch_port, -dsearch_domain, and -ftengine_standby are not
supported in xPlore 1.3. The argument -ftengine_standby was used for dual mode (FAST and xPlore,
two Content Servers) which is not supported in xPlore 1.3.
53
Managing Security
2.
3.
4.
5.
Managing Security
2. Edit indexserverconfig.xml. For information on viewing and updating this file, see Modifying
indexserverconfig.xml, page 43.
3. Change the size of a cache in the security-filter-class element:
<security-filter-class name="documentum" default="true" class-name="
com.emc.documentum.core.fulltext.indexserver.services.security.SecurityJoin">
<properties>
<property name="groups-in-cache-size" value="1000"/>
<property name="not-in-groups-cache-size" value="1000"/>
<property name="acl-cache-size" value="400">
<property name="batch-size" value="800">
<property value="10000" name="max-tail-recursion-depth"/>
</properties>
</security-filter-class>
4. If necessary, change the Groups-in cache cleanup interval by adding a property to the
security-filter-class properties. The default is 7200 sec (2 hours).
<property name="groupcache-clean-interval" value="7200">
5. Validate your changes using the validation tool described in Modifying indexserverconfig.xml,
page 43.
Troubleshooting security
Viewing security in the log
Check dsearch.log using xPlore administrator. Choose an instance and click Logging. Click dsearch
log to view the following information:
The XQuery expression. For example, search for the term default:
QueryID=PrimaryDsearch$f3087f7a-fb55-496a-bf0a-50fb1e688fa1,
query-locale=en,query-string=
declare option xhive:fts-analyzer-class
com.emc.documentum.core.fulltext.indexserver.core.index.xhive.IndexServerAnalyzer;
declare option xhive:ignore-empty-fulltext-clauses true;
declare option xhive:index-paths-values "dmftmetadata
//owner_name,dmftsecurity/acl_name,dmftsecurity/acl_domain";
let $libs := collection(/TechPubsGlobal/dsearch/Data)
let $results := for $dm_doc score $s
in $libs/dmftdoc[(dmftmetadata//a_is_hidden = "false") and
(dmftversions/iscurrent = "true") and
(. ftcontains "test" with stemming using stop words default)]
order by $s descending
return $dm_doc return (for $dm_doc in subsequence($results,1,351)
return <r>
{for $attr in $dm_doc/dmftmetadata//*[local-name()=(
object_name,r_modify_date,r_object_id,r_object_type,
r_lock_owner,owner_name,r_link_cnt,r_is_virtual_doc,
r_content_size,a_content_type,i_is_reference,r_assembled_from_id,
r_has_frzn_assembly,a_compound_architecture,i_is_replica,r_policy_id,
subject,title)] return <attr name={local-name($attr)} type=
55
Managing Security
{$attr/@dmfttype}>{string($attr)}</attr>}
{xhive:highlight(($dm_doc/dmftcontents/dmftcontent/
dmftcontentref,$dm_doc/dmftcustom))}
<attr name=score type=dmdouble>{string(dsearch:get-score($dm_doc))}
</attr></r>) is running
When DEBUG is enabled for security package, the following information is saved in dsearch.log:
Minimum-permit-level. Returns the minimum permit level for results for the user. Levels: 0 = null |
1 = none | 2 = browse | 3 = read | 4 = relate | 5 = version | 6 = write | 7 = delete
Total-group-probes: Total number of groups checked for user
Filter-output: Total number of hits after security has filtered the results.
Total-values-from-index-keys: Number of index hits on owner_name, acl_name and acl_domain
for the document.
QueryID: Generated by xPlore to uniquely identify the query.
Total-values-from-data-page: Number of hits on owner_name, acl_name and acl_domain for the
document retrieved from the data page.
Filter-input: Number of results returned before security filtering.
Total-not-in-groups-cache-hits: Number of times the groups-out cache contained a hit (groups
the user does not belong to)
Total-matching-group-probes: How many times the query added a group to the group-in cache.
Total-ACL-index-probes: How many times the query added an ACL to the cache. If this value is
high, you can speed up queries by increasing the ACL cache size.
Total-groups-in-cache-hits: Number of times the group-in cache contained a hit.
Total-ACL-cache-hits: Number of times the ACL cache contained a hit.
Total-res-with-no-dmftdoc: Total number of hits in documents with no rendered dftxml. Should be 0.
In the following example from the log, the query returned 2200 hits to filter. Of these hits, 2000 were
filtered out, returning 200 results to the client application. The not-in-groups cache was probed 30
times for this query. The GroupOut cache was filled 3 times, for groups that the user did not belong to:
56
Managing Security
<USER_NAME>tuser4</USER_NAME>
<TOTAL_INPUT_HITS_TO_FILTER>2200</TOTAL_INPUT_HITS_TO_FILTER>
<HITS_FILTERED_OUT>2000</HITS_FILTERED_OUT>
<GROUP_IN_CACHE_HIT>0</GROUP_IN_CACHE_HIT>
<GROUP_OUT_CACHE_HIT>30</GROUP_OUT_CACHE_HIT>
<GROUP_IN_CACHE_FILL>0</GROUP_IN_CACHE_FILL>
<GROUP_OUT_CACHE_FILL>3</GROUP_OUT_CACHE_FILL>
Verify that the ACL IDs are registered for the events dm_save, dm_destroy, and dm_saveasnew. Verify
that the group IDs are registered for the events dm_save and dm_destroy, for example:
?,c,select registered_id,event from dmi_registry where user_name=
dm_fulltext_index_user
57
Chapter 4
Managing the Documentum Index
Agent
This chapter contains the following topics:
Migrating documents
Using ftintegrity
ftintegrity output
59
The xPlore installer includes the index agent and its configurator. Install the index agent on a Content
Server host or a separate host.
A dm_ftindex_agent_config object represents the index agent in normal mode. This object is
configured by the index agent configurator. For more information about the index agent config object,
refer to the EMC Documentum Object Reference Manual.
Attributes
dm_ftengine_config
dm_acl
object_name: dm_fulltext_admin_ac;
owner_name: Name of user specified at installation
acl_class: 3
accessor_name: dm_owner, dm_fulltext_admin, dm_world
accessor_permit: 7, 7, 3
dm_fulltext_index
object_name
is_standby: Indicates whether the index agent is in use or in standby
mode.
install_loc: Type of agent (dsearch or fast)
ft_engine_id: Specifies the associated gine_config object.
dm_ftindex_agent_config
index_name
queue_user: Identifies the full-text user who is registered in dmi_registry
for full-text events.
can_index property is set to true is indexed. Other renditions of the object are not indexed. If the
primary content of an object is not in an indexable format, you can ensure indexing by creating a
rendition in an indexable format. Use Documentum Content Transformation Services or third-party
client applications to create the rendition. For a full list of supported formats, see Oracle Outside In
8.3.7 documentation.
Some formats are not represented in the repository by a format object. Only the properties of objects in
that format are indexed. The formats.cvs file, which is located in DM_HOME/install/tools, contains a
complete list of supported mime_types and the formats with which they are associated. If a supported
mime_type has no format object, create a format object in the repository and map the supported
mime_type to the format in formats.cvs.
Documents are selected for indexing in the Content Server based on the following criteria:
If a_full_text attribute is false, the content is not indexed. Metadata is indexed.
If a_full_text attribute is true, content is indexed based on the can_index and format_class attributes
on the dm_format associated with the document:
1. If an object has multiple renditions and none of the renditions have a format_class value of
ft_always or ft_preferred, each rendition is examined starting with the primary rendition. The
first rendition for which can_index is true is indexed, and no other renditions are indexed.
2. If an object has a rendition whose format_class value is ft_preferred, each ft_preferred rendition
is examined in turn starting with the primary rendition. The first ft_preferred rendition that is
found is indexed, and no other renditions are indexed.
3. If an object has renditions with a format_class value of ft_always, those renditions are always
indexed.
Note: Index agent filters can override the settings of a_full_text and can_index. See Configuring
index agent filters, page 69.
Sample DQL to determine these attribute values for the format bmp:
select can_index, format_class from dm_format where name = bmp
To find all formats that are indexed, use the following command from iAPI:
?,c,select name,can_index from dm_format
The dm_ftengine_config object has a repeating attribute ft_collection_id that references a collection
object of the type dm_fulltext_collection. Each ID points to a dm_fulltext_collection object. It is
reserved for use by Content Server client applications.
Syntax
Description
ADD_FTINDEX ALL
ADD_FTINDEX property_list
61
Syntax
Description
DROP_FTINDEX ALL
DROP_FTINDEX property_list
When you add or drop indexing for aspect properties, clean the DFC BOF cache for the changes to
take effect.
1. Stop the index agent.
2. On the index agent host, delete the directory for the DFC bof cache. The directory is set by
dfc.data.dir in dfc.properties. For example:
xplore_home\jboss5.1.0\server\DctmServer_Indexagent\data\Indexagent\cache\
content_server_version\bof\repository_name
Description
No support
Light support
Full support
62
Every index agent URL has the same URL ending: IndexAgent/login_dss.jsp. Only the port
and host differ.
host is the DNS name of the machine on which you installed the index agent.
port is the index agent port number that you specified during configuration (default: 9200).
3. In the login page, enter the user name and password for a valid repository user and optional
xPlore domain name.
4. Choose one of the following:
Start Index Agent in Normal Mode: The index agent will index content that is added or
modified after you start.
Start new reindexing operation: All content in the repository is indexed (migration mode) or
reindexed. Filters and custom routing are applied. Proceed to the next step in this task.
Continue: If you had started to index this repository but had stopped, start indexing. The date
and time you stopped is displayed.
For information on adding CPS processing daemons, see .
Viewing index agent details
Start the index agent and click Details. You see accumulated statistics since last index agent restart and
objects in the indexing queue. To refresh statistics, return to the previous screen and click Refresh,
then view Details again.
63
index_name : Repo_ftindex_01
...
Now use the retrieve and dump commands to get the object_name attribute of the
dm_ftindex_agent_config object. You use this attribute value in the start or stop script. For example:
retrieve,c,dm_ftindex_agent_config
...
0800277e80000e42
API> dump,c,l
...
USER ATTRIBUTES
object_name : Config13668VM0_9200_IndexAgent
Use the apply command to start or stop the index agent. Syntax:
apply,c,,FTINDEX_AGENT_ADMIN,NAME,S,<index_name of dm_fulltext_index>,
AGENT_INSTANCE_NAME,S,<object_name of dm_ftindex_agent_name>,ACTION,
S,start|stop|status
To start or stop all index agents, replace the index agent name with all. For example:
apply,c,NULL,FTINDEX_AGENT_ADMIN,NAME,S,LH1_ftindex_01,
AGENT_INSTANCE_NAME,S,all,ACTION,S,start
Status results:
0: The index agent is running.
100: The index agent has shut down.
200: The index agent has a problem.
where -action argument value is one of the following: start | shutdown | status | reset.
65
3.
Add one parameter set to your new indexagent.xml file. Set the value of parameter_name to
index_type_mode, and set the value of parameter_value to aclgroup as follows:
<indexer_plugin_config>
<generic_indexer>
<class_name> </class_name>
<parameter_list>
...
<parameter>
<parameter_name>index_type_mode</parameter_name>
<parameter_value>aclgroup</parameter_value>
</parameter>
</parameter_list>
</generic_indexer>
</indexer_plugin_config>
4. In the indexagent.xml for sysobjects (the original index agent), add a similar parameter set. Set the
value of parameter_name to index_type_mode, and set the value of parameter_value to sysobject.
5. Restart both index agents. (Use the scripts in indexagent_home/jboss5.1.0/server or the Windows
services.)
Supporting millions of ACLs
If you have many ACLs (users or groups), turn off facet compression. In indexserverconfig.xml, find
the sub-path element whose path attribute value is dmftsecurity/acl_name. Change the value of
the compress attribute to false. For information on viewing and updating this file, see Modifying
indexserverconfig.xml, page 43.
66
67
<emc.install dar="C:\Downloads\tempIndexAgentDefaultFilters.dar"
docbase="DSS_LH1" username="Administrator" password="password" />
Verify filter loading in the index agent log, which is located in the logs subdirectory of the
index agent JBoss deployment directory. In the following example, the FoldersToExclude
filter was loaded:
2010-06-09 10:49:14,693 INFO FileConfigReader [http-0.0.0.0-9820-1]
Filter FoldersToExclude Value:/Temp/Jobs,
/System/Sysadmin/Reports, /System/Sysadmin/Jobs,
6. Configure the filters in the index agent UI. See Configuring index agent filters, page 69.
Troubleshooting the index agent filters
To verify that the filters are installed, use the following iAPI command:
?,c,select primary_class from dmc_module where any a_interfaces =
com.documentum.fc.indexagent.IDfCustomIndexFilter
Open dfc.properties in the composerheadless package. This package is installed with Content Server
at $DOCUMENTUM/product/version/install/composer/ComposerHeadless. The file dfc.properties
is located in the subdirectory plugins/com.emc.ide.external.dfc_1.0.0/documentum.config. Find the
following lines and verify that the IP address and port of the connection broker for the target repository
are accurate.
dfc.docbroker.host[N]=connection_broker_ip_address
68
dfc.docbroker.port[N]=connection_broker_port
69
dmi_expr_code
dm_process
dm_docbase_config
dmc_jar
dmc_tcf_activity_template
dm_esign_template
dm_method
dm_ftwatermark
dm_format_preferences
dm_activity
dmc_wfsd_type_info
dm_ftengine_config
dmc_module
dm_menu_system
dm_ftfilter_config
dmc_aspect_type
dm_plugin
dm_ftindex_agent_config
dm_registered
dm_script
dm_jms_config
dm_validation_descriptor
dmc_preset_package
dm_job
dm_location
dm_acs_config
dm_mount_point
dmc_java_library
dm_business_pro
dm_outputdevice
dm_public_key_certificate
dm_client_rights
dm_server_config
dm_client_registration
dm_cont_transfer_config
dm_xml_application
dm_procedure
dm_cryptographic_key
dm_xml_config
dmc_dar
70
Note: Update the file_system_path attribute of the dm_location object in the repository to match
this local_mount value, and then restart the Content Server.
3. Save indexagent.xml and restart the index agent instance.
For better performance, you can mount the content storage to the xPlore index server host and set
all_filestores_local to true. Create a local file store map as shown in the following example:
<all_filestores_local>true</all_filestores_local>
<local_filestore_map>
<local_filestore>
<store_name>filestore_01</store_name>
<local_mount>\\192.168.195.129\DCTM\data\ftwinora\content_storage_01
</local_mount>
</local_filestore>
<!-- similar entry for each file store -->
</local_filestore_map>
You can create multiple full-text collections for a repository for the following purposes:
Partition data
EMC Documentum xPlore Version 1.3 Administration and Development Guide
71
Migrating documents
Migrating content (reindexing), page 72
Migrating documents by object type, page 72
Migrating a limited set of documents, page 73
3.
4.
5.
6.
7.
Note: The parameter_list element can contain only one parameter element.
Stop and restart the index agent using the scripts in indexagent_home/jboss5.1.0/server or using
the Windows services panel.
Log in to the index agent UI and choose Start new reindexing operation.
When indexing has completed (on the Details page, no more documents in the queue), click
Stop IA.
Run the aclreplication script to update permissions for users and groups in xPlore. See Manually
updating security, page 52.
Update the indexagent.xml file to index another type or change the parameter_value to
dm_document.
Using ftintegrity
ftintegrity output, page 75
ftintegrity result files, page 76
Running the state of index job, page 76
state of index and ftintegrity arguments, page 77
ftintegrity and the state of index job (in Content Server 6.7 or higher) are used to verify indexing after
migration or normal indexing. The utility verifies all types that are registered in the dmi_registry_table
EMC Documentum xPlore Version 1.3 Administration and Development Guide
73
with the user dm_fulltext_index_user. The utility compares the object ID and i_vstamp between the
repository and xPlore. You can compare metadata values, which compares object IDs and the specified
attributes.
Run ftintegrity as the same administrator user who started the instance.
Note: ftintegrity can be very slow, because it performs a full scan of the index and content. Do not run
ftintegrity when an index agent is migrating documents.
Run the ftintegrity index verification tool after migration or restoring a federation, domain,
or collection. The tool is a standalone Java program that checks index integrity against
repository documents. It verifies all types that are registered to dmi_registry_table with the user
dm_fulltext_index_user, comparing the object ID and i_vstamp between the repository and xPlore.
Use the option -checkType to check a specific object type. Use the option -checkMetadata to check
specific single-value attributes (requires -checkType).
1. Navigate to xplore_home/setup/indexagent/tools.
2. Open the script ftintegrity_for_repositoryname.bat (Windows) or ftintegrity_for_repositoryname.sh
(Linux) and edit the script. Substitute the repository instance owner password in the script (replace
<password> with your password). The tool automatically resolves all parameters except for the
password.
3. Optional: Add the option -checkfile to the script. The value of this parameter is the full path to
a file that contains sysobject IDs, one on each line. This option compares the i_vstamp on the
ACL and any groups in the ACL that is attached to each object in a specified list. If this option
is used with the option -checkUnmaterializeLWSO, -CheckType, -StartDate, or -EndDate, these
latter options are not executed.
For example:
....FTStateOfIndex DSS_LH1 Administrator mypassword
Config8518VM0 9300 -checkfile ...
4. Optional: Add the option -checkType to compare a specific type in the Content Server and index.
You can run the script for one type at a time. The tool checks sysobject types or subtypes. It does
not check dm_acl and dm_group objects or custom types that are not subtypes of dm_sysobject.
For example:
$JAVA_HOME/bin/java ... -checkType dm_document
5. Optional: Add the option -checkMetadata at the end of the script. This argument requires a path
to a metadata.txt file that contains a list of required single-valued (not repeating) metadata fields
to check, one attribute name per line. (Create this file if it does not exist.) This option applies
only to a specific type.
For example, add the following to the ftintegrity script in xplore_home/setup/indexagent/tools:
$JAVA_HOME/bin/java ... -checkType dm_document
-checkMetadata C:/xplore/setup/indexagent/tools/metadata.txt
ObjectId-indexOnly.txt: contains all the object IDs that exist only in xPlore.
The ftintegrity tool generates the following files:
ObjectId-common-version-mismatch.txt: contains ACLs that are out of sync as the content of the
elements acl_name|domain/acl i_vstamp in docbase/acl i_vstamp in xDB.
ObjectId-common-version-match.txt: contains all the object IDs with consistent versions.
ObjectId-dctmOnly.txt: contains groups that are out of sync as the content of the elements
Mismatching i_vstamp group:/Sysobject ID: id/Group ids in dctm only:/group id.
ObjectId-indexOnly.txt: contains all the object IDs that exist only in xPlore.
Note: All optional arguments must be appended to the end of the java command line.
ftintegrity output
Output from the script is like the following:
Executing stateofindex
Connected to the docbase D65SP2M6DSS
2011/03/14 15:41:58:069 Default network framework: http
2011/03/14 15:41:58:163 Session Locale:en
2011/03/14 15:41:59:913 fetched 1270 object from docbase for type dm_acl
2011/03/14 15:41:59:913 fetched 1270 objects from xPlore for type dm_acl
2011/03/14 15:42:08:428 fetched 30945 object from docbase for type dm_sysobject
2011/03/14 15:42:08:428 fetched 30798 objects from xPlore for type dm_sysobject
2011/03/14 15:42:08:756 fetched 347 object from docbase for type dm_group
2011/03/14 15:42:08:756 fetched 347 objects from xPlore for type dm_group
2011/03/14 15:42:09:194 **** Total objects from docbase : 32215 ****
2011/03/14 15:42:09:194 **** Total objects from xPlore : 32068 ****
2011/03/14 15:42:09:194 3251 objects with match ivstamp in both DCTM and
Index Server
2011/03/14 15:42:09:194 17 objects with different ivstamp in DCTM and Index Server
2011/03/14 15:42:09:194 147 objects in DCTM only
2011/03/14 15:42:09:194 0 objects in Index Server only
ftintegrity is completed.
75
In the example, the ACLs and groups totals were identical in the repository and xPlore, so security is
updated. There are 147 objects in the repository that are not in the xPlore index. They were filtered out
by index agent filters, or they are objects in the index agent queue that have not yet been indexed.
To eliminate filtered objects from the repository count, add the usefilter argument to ftintegrity
(slows performance).
ObjectId-indexOnly.txt
This report contains the object IDs and i_vstamp values of objects in the index but not in the
repository.
These objects were removed from the repository during or after migration, before the event has
updated the index.
You can input the ObjectId-common-version-mismatch.txt file into the index agent UI to see errors for
those files. After you have started the index agent, check Index selected list of objectsand then check
Object file. Navigate to the file and then choose Submit. Open xPlore Administrator > Reports and
choose Document processing error summary. The error codes and reasons are displayed.
You can also use ftintegrity tool to check the consistency between the repository and the xPlore index.
The ftintegrity script calls the dm_FTStateOfIndex job.
Note: ftintegrity and the dm_FTStateOfIndex job can be very slow, because they perform a full scan of
the index and content. Do not run ftintegrity or the dm_FTStateOfIndex job when an index agent is
migrating documents.
The state of index job compares the index content with the repository content. Execute the state of
index job from Documentum Administrator (DA). The job generates reports that provide the following
information:
Index completeness and comparison of document version stamps.
Status of the index server: Disk space usage, instance statistics, and process status.
Total number of objects: Content correctly indexed, content that had some failure during indexing,
and objects with no content
To disable the job, view the job properties in Documentum Administrator and change the state
to inactive.
Job arg
ftintegrity option
Description
-batchsize
batchsize (argument,
not option)
-check_file
-CheckFile
-check_type
-checkType
77
Job arg
ftintegrity option
Description
-check_metadata
-checkMetadata
-check_unmaterialized_lwso
-checkLWSO
-collection_name
Not available
-end_date
-EndDate
-ftengine_standby
-ftEngineStandby
-fulltext_user
-fulltextUser
-get_id_in_indexing
Not available
-sort_order
-sortOrder
-start_date
-StartDate
-timeout_in_minute
-timeout
-usefilter value
-usefilter
In addition, the job is installed with the -queueperson and -windowinterval arguments set. The
-queueperson and -windowinterval arguments are standard arguments for administration jobs and are
explained in the EMC Documentum Content Server Administration and Configuration Guide.
78
79
If you make this change after indexing, reindex objects to make the metadata non-searchable.
Documentum object types can be marked as non-indexed in Documentum Administrator. See Making
types non-indexable, page 80.
Figure 8
81
{
domBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
}
catch (ParserConfigurationException e)
{
throw new DfException(e);
}
IDfCollection childRelations = getChildRelatives("dm_annotation");
while (childRelations.next())
{
Element annotationNode = document.createElement("annotation");
mediaAnnotations.appendChild(annotationNode);
try
{
IDfId id = childRelations.getTypedObject().getId("child_id");
// This will get the dm_note object
IDfDocument note = (IDfDocument) getSession().getObject(id);
ByteArrayInputStream xmlContent = note.getContent();
Document doc = domBuilder.parse(xmlContent);
// Add the node content
annotationNode.appendChild(doc);
// Add a node for the author of a note
Element authorElement = document.createElement("author");
authorElement.setTextContent(note.getString("r_modifier);
annotationNode.appendChild(authorElement)
}
catch (SAXException e)
{
// Log the error
catch (IOException e)
{
// Log the error
}
}
childRelations.close();}}
Generated dftxml
<dmftdoc>
...
<dmftcustom>
<mediaAnnotations>
<annotation>
<content>
This is my first note
</content>
<author>Marc</author>
</annotation>
<annotation>
<content>
This is my second note
82
</content>
<author>Marc</author>
</annotation>
</mediaAnnotations>
</dmftcustom>
</dmftdoc>
83
84
85
Table 10
Metric
Index agent
xPlore administrator
Failed
Warning
Success
To check on the status of queue items that have been submitted for reindexing, use the following DQL.
For username, specify the user logged in to the index agent UI and start reindexing.
select task_name,item_id,task_state,message from dmi_queue_item where
name=username and event=FT re-index
If task_state is done, the message is "Successful batch..." If the task_state is failed, the message is
"Incomplete batch..."
To resubmit one document for reindexing
Put the object ID into a temporary text file. Use the index agent UI to submit the upload: Choose
Index selected list of objects >Object File option.
To remove queue items from reindexing
Use the following DQL. For username, specify the user logged in to the index agent UI and started
reindexing.
delete dmi_queue_item object where name=username and
event=FT re-index
event
-------------------------------dm_move_content
dm_checkin
dm_readonlysave
dm_destroy
dm_save
dm_destroy
dm_save
dm_save
dm_destroy
dm_saveasnew
87
Element name
Description
error_config
error_code
error_threshold
time_threshold
action
88
Table 12
Error codes
error_code
Description
UNSUPPORTED_DOCUMENT
Unsupported format
XML_ERROR
DATA_NOT_AVAILABLE
No information available
PASSWORD_PROTECTED
MISSING_DOCUMENT
INDEX_ENGINE_NOT_RUNNING
CONNECTION FAILURE
By default, if xPlore server is down (CONNECTION FAILURE error), the indexing and the data
ingestion stop after the specified number of errors happens in the specified time period. In this case,
the index agent status displayed is finished. When the problem is solved and xPlore is up and
running, use the index agent UI to stop and restart the index agent.
Problem: DM_SYSOBJECT_E_CANT_SAVE_NO_LINK
ERROR
The error in the index agent log is Cannot save xxx sysobject without any link. Possible causes:
The index agent configurator failed to retrieve full-text repository objects.
The index agent installation user does not have a default folder defined in the repository, or the
folder no longer exists.
To verify, dump the user with the following iAPI commands. Substitute the installation owner name.
retrieve,c,dm_user where user_name=installation_owner
get,c,l,default_folder
89
group by task_state
To check the indexing status of a single object, get the queue item ID for the document in the details
screen of the index agent UI. Use the following DQL to check the status of the queue item:
select task_name,item_id,task_state,message from dmi_queue_item where name=
username and event=FT re-index
To check registered types and the full-text user name, use the following iAPI command.
?,c,select distinct t.name, t.r_object_id, i.user_name from dm_type t,
dmi_registry i where t.r_object_id = i.registered_id and i.user_name like
%fulltext%
Enable connections between the index agent host, the Content Server, and xPlore through the firewall.
Startup problems
Make sure that the index agent web application is running. On Windows, verify that the Documentum
Indexagent service is running. On Linux, verify that you have instantiated the index agent using
the start script in xplore_home/jboss5.1.0/server.
Make sure that the user who starts the index agent has permission in the repository to read all content
that is indexed.
If the repository name is reported as null, restart the repository and the connection broker and try again.
If you see a status 500 on the index agent UI, examine the stack trace for the index agent instance. If a
custom routing class cannot be resolved, this error appears in the browser:
org.apache.jasper.JasperException: An exception occurred processing JSP page
90
/action_dss.jsp at line 39
...
root cause
com.emc.documentum.core.fulltext.common.IndexServerRuntimeException:
com.emc.documentum.core.fulltext.client.index.FtFeederException:
Error while instantiating collection routing custom class...
If the index agent web application starts with port conflicts, stop the index agent with the script. If you
run a stop script, run as the same administrator user who started the instance. The index agent locks
several ports, and they are not released by closing the command window.
You can kill the JVM process and run the index agent configurator to give the agents different ports.
91
Chapter 5
Document Processing (CPS)
This chapter contains the following topics:
About CPS
Administering CPS
Indexable formats
Lemmatization
About CPS
The content processing service (CPS) performs the following functions:
Retrieves indexable content from content sources
Determines the document format and primary language
Parses the content into index tokens that xPlore can process into full-text indexes
If you test Documentum indexing before performing migration, first replicate security. See Manually
updating security, page 52.
For information on customizations to the CPS pipeline, see Custom content processing, page 122.
Language identification
Some languages have been tested in xPlore. Many other languages can be indexed. Some languages
are identified fully including parts of speech, and others require an exact match. For a list of languages
EMC Documentum xPlore Version 1.3 Administration and Development Guide
93
that CPS detects, see Basistech documentation. If a language is not listed as one of the tested languages
in the xPlore release notes, search must be for an exact match. For tested languages, linguistic features
and variations that are specific to these languages are identified, improving the quality of search
experience.
White space
White space such as a space separator or line feed identifies word separation. Then, special characters
are substituted with white space. See Handling special characters, page 108.
For Asian languages, white space is not used. Entity recognition and logical fragments guide the
tokenization of content.
Case sensitivity
All characters are stored as lowercase in the index. For example, the phrase "Im runNiNg iN THE
Rain is lemmatized and tokenized as "I be run in the rain.
There is a limited effect of case on lemmatization. In some languages, a word can have different
meanings and thus different lemmas depending on the case.
Case sensitivity is not configurable.
For example:
http://DR:8080/services
94
b. From the Instance list, select an instance you want to add the CPS to.
c. From the Usage list, specify whether the CPS instance processes indexing requests (the index
option), search requests (the search option), or both (the all option).
d. Click OK.
Note: Once added, the remote CPS appears as UNREACHABLE. Restart all xPlore instances
for it to take effect.
6. Specify whether the CPS instance performs linguistic processing (lp) or text extraction (te). If a
value is not specified, TE and LP are sent to CPS as a single request.
a. In indexserverconfig.xml, locate the content-processing-services element. This element
identifies each CPS instance, The element is added when you install and configure a new
CPS instance.
b. Add or change the capacity attribute on this element. The capacity attribute determines whether
the CPS instance performs text extraction, linguistic processing, or all. In the following
example, a local CPS instance analyzes linguistics, and the remote CPS instance processes
text extraction.
<content-processing-services analyzer="rlp" context-characters="
!,.;?"" special-characters="@#$%^_~*&:()-+=<>/\[]{}">
<content-processing-service capacity="lp" usage="all" url="local"/>
<content-processing-service capacity="te" usage="index" url="
http://myhost:9700/services"/>
</content-processing-services>
7. Restart the CPS instance using the start script startCPS.bat or startCPS.sh in
xplore_home/jboss5.1.0/server. (On Windows, the standalone instance is installed as an automatic
service.)
8. Test the remote CPS service using the WSDL testing page, with the following syntax:
http://hostname:port/services/cps/ContentProcessingService?wsdl
After you install and register the remote instance, you see it in the Content Processing Service UI of
xPlore administrator. You can check the status and see version information and statistics.
Check the CPS daemon log file cps_daemon.log for processing event messages. For a local process,
the log is in xplore_home/jboss5.1.0/server/DctmServer_ PrimaryDsearch/logs. For a remote CPS
instance, cps_daemon.log is located in cps_home/jboss5.1.0/server/cps_instance_name/logs. If a CPS
instance is configured to process text only, TE is logged in the message. For linguistic processing, LP
is logged. Both kinds of processing log CF messages.
95
Note: An xPlore instance must have at least one CPS configured for it. If an xPlore instance has only
one CPS, either local or remote, you cannot disable or remove it.
3. Create a content-processing-services element under each node element, after the node/properties
element, like the following:
<node appserver-instance-name="PrimaryDsearch" ...primaryNode="true"...
url="http://host1:9300/dsearch/"... hostname="host1" name="PrimaryDsearch">
...
<properties>
<properties>
<property value="10000" name="statusdb-cache-size"/>
</properties>
<content-processing-services analyzer="rlp" context-characters="
!,.;?""
special-characters="@#$%^_~*&:()-+=<>/\[]{}">
</content-processing-services>
<logging>...
4. Move the shared CPS instances to each node, where they become dedicated CPS instances for that
node. Place their definitions within the content-processing-services element that you created. For
example:
<node appserver-instance-name="PrimaryDsearch" ...primaryNode="true"...
url="http://host1:9300/dsearch/"... hostname="host1" name="PrimaryDsearch">
...
96
<properties>
<properties>
<property value="10000" name="statusdb-cache-size"/>
</properties>
<content-processing-services analyzer="rlp" context-characters="
!,.;?""
special-characters="@#$%^_~*&:()-+=<>/\[]{}">
<content-processing-service usage="all" url="http://host1:20000/services"/>
</content-processing-services>
<logging>...
Make sure that you have a remaining global CPS instance after the last node element. (You have
moved one or more of the content-processing-service elements under a node.) For example:
<content-processing-services analyzer="rlp" context-characters="!,.;?""
special-characters="@#$%^_~*&:()-+=<>/\[]{}">
<content-processing-service usage="all" url="local"/>
</content-processing-services>
Administering CPS
Starting and stopping CPS
You can configure CPS tasks in xPlore administrator. In the left pane, expand the instance and click
Content Processing Service. Click Configuration .
1. Stop CPS: Select an instance in the xPlore administrator tree, expand it, and choose Content
Processing Service. Click Stop CPS and then click Suspend.
2. Start CPS: Select an instance in the xPlore administrator tree, expand it, and choose Content
Processing Service. Click Start CPS and then click Resume.
If CPS crashes or malfunctions, the CPS manager tries to restart it to continue processing.
97
99
To force a different default locale for metadata, add a property to every category definition like the
following:
<category name="dftxml">
<properties>
...<property value="fr" name="index-metadata-default-locale"/>
</properties>
For a query, the session locale from one of the following is used as the language for linguistic analysis:
Webtop login locale
100
name="object_name"/>
name="title"/>
name="subject"/>
name="keywords"/>
6. (Optional) Check the session locale for a query. Look at the xPlore log event that prints the query
string (in dsearch.log of the primary xPlore instance). The event includes the query-locale setting
used for the query. For example:
query-locale=en
7. (Optional) Change the session locale of a query. The session_locale attribute on a Documentum
object is automatically set based on the OS environment. To search for documents in a different
language, change the local per session in DFC or iAPI. The iAPI command to change the
session_locale:
set,c,sessionconfig,session_locale
101
The DFC command to set session locale on the session config object (IDfSession.getSessionConfig):
IDfTypedObject.setString("session_locale", locale)
Handling apostrophes
Some languages have more apostrophes as part of a name or other part of speech. The default list
of special context characters includes the apostrophe. Apostrophes in words are treated as white
102
space. You can remove the apostrophe from the list if words are not correctly found on search. See
Handling special characters, page 108.
If you have already indexed objects in PDF format that would be affected by this change, you must
reindex them.
Indexable formats
Some formats are fully indexed. For some formats, only the metadata is indexed. For a full list of
supported formats, see Oracle Outside In 8.3.7 documentation.
If a format cannot be identified, it is listed in the xPlore administrator report Document Processing
Error Detail. Choose File format unsupported to see the list.
Lemmatization
About lemmatization
Configuring indexing lemmatization
Lemmatizing specific types or attributes
Troubleshooting lemmatization
Saving lemmatization tokens
About lemmatization
Lemmatization is a normalization process that reduces a word to its canonical form. For example, a
word like books is normalized into book by removing the plural marker. Am, are, and is are normalized
to "be. This behavior contrasts with stemming, a different normalization process in which stemmed
EMC Documentum xPlore Version 1.3 Administration and Development Guide
103
words are reduced to a string that sometimes is not a valid word. For example, ponies becomes poni.
xPlore uses an indexing analyzer that performs lemmatization. Studies have found that some form
of stemming or lemmatization is almost always helpful in search.
Lemmatization is applied to indexed documents and to queries. Lemmatization analyzes a word for
its context (part of speech), and the canonical form of a word (lemma) is indexed. The extracted
lemmas are actual words.
Alternate lemmas
Alternative forms of a lemma are also saved. For example, swim is identified as a verb. The noun
lemma swimming is also saved. A document that contains swimming is found on a search for swim.
If you turn off alternate lemmas, you see variable results depending on the context of a word. For
example, saw is lemmatized to see or to saw depending on the context. See Configuring indexing
lemmatization, page 105.
Query lemmatization
Lemmatization of queries is more prone to error because less context is available in comparison to
indexing.
The following queries are lemmatized:
IDfXQuery: The with stemming option is included.
The query from the client application contains a wildcard.
The query is built with the DFC search service.
The DQL query has a search document contains (SDC) clause (except phrases). For example,
the query select r_object_id from dm_document search document contains companies winning
produces the following tokens: companies, company, winning, and win.
104
...
<property value="true" name="query-exact-phrase-match"/>
</properties>
</search-config>
linguistic-process element
Element
Description
element-with-name
105
Element
Description
save-tokens-for-summary-processing
element-with-attribute
element-for-language-identification
In the following example, the content of an element with the attribute dmfttype with a value of
dmstring is lemmatized. These elements are in a dftxml file that the index agent generates. For the
dftxml extensible DTD, see Extensible Documentum DTD, page 348.
If the extracted text does not exceed 262144 bytes (extract-text-size), the specified element is
processed. In the following example, an element with the name dmftcustom is processed . Several
elements are specified for language identification.
<linguistic-process>
<element-with-attribute name="dmfttype" value="dmstring"/>
<element-with-name name="dmftcustom">
<save-tokens-for-summary-processing extract-text-size-="
262144" token-size="65536"/>
</element-with-name>
<element-for-language-identification name="object_name"/> ...
</linguistic-process>
Note: If you wish to apply your lemmatization changes to the existing index, reindex your documents.
Troubleshooting lemmatization
If a query does not return expected results, examine the following:
Test the query phrase or terms for lemmatization and compare to the lemmatization in the context of
the document. (You can test each sample using xPlore administrator Test Tokenization.
View the query tokens by setting the dsearch logger level to DEBUG using xPlore administrator.
Expand Services > Logging and click Configuration. Set the log level for dsearchsearch. Tokens
are saved in dsearch.log.
Check whether some parts of the input were not tokenized because they were excluded from
lemmatization: Text size exceeds the configured value of the extract-text-size-less-than attribute.
Check whether a subpath excludes the dftxml element from search. (The sub-path attribute
full-text-search is set to false.)
If you have configured a collection to save tokens, you can view them in the xDB admin tool. (See
Debugging queries, page 259. ) Token files are generated under the Tokens library, located at the
106
same level as the Data library. If dynamic summary processing is enabled, you can also view tokens
in the stored dftxml using xPlore administrator . The number of tokens stored in the dftxml depends
on the configured amount of tokens to save. To see the dftxml, click a document in a collection.
Figure 9
Tokens in dftxml
3. You can view the saved tokens in the xDB tokens database. Open the xDB admin tool in
xplore_home/dsearch/xhive/admin.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
107
Figure 10
Tokens in xDB
Custom characters enclosing Chinese, Japanese, and Korean letters and months. These characters
are derived from a number of custom character ranges that have bidirectional properties, falling in
the 3200-32FF range. The specific character ranges are:
3200-3243
3260-327B
327F-32B0
32C0-32CB
32D0-32FE
109
Note: The special characters list must contain only ASCII characters.
For example, a phrase extract-text is tokenized as extract and text, and a search for either term finds the
document.
Note: The context characters list must contain only ASCII characters.
White space is substituted after the parts of speech have been identified. For example, the phrase "Is
John Smith working for EMC? the question mark is filtered out because it functions as a context
special character (punctuation).
111
112
Testing tokenization
Test the tokenization of a word or phrase to see what is indexed. Expand Diagnostic and Utilities in
the xPlore administrator tree and then choose Test tokenization. Input the text and select the language.
Different tokenization rules are applied for each language. (Only languages that have been tested are
listed. See the release notes for supported languages. Other languages are not tokenized.)
Uppercase characters are rendered as lowercase. White space replaces special characters.
The results table displays the original input words. The root form is the token used for the index. The
Start and End offsets display the position in raw input. Components are displayed for languages that
support component decomposition, such as German.
Results can differ from tokenization of a full document for the following reasons:
The document language that is identified during indexing does not match the language that is
identified from the test.
The context of the indexed document does not match the context of the text.
Use the executable CASample in xplore_home/dsearch/cps/cps_daemon/bin to test the processing
of a file. Syntax:
casample path_to_input_file
113
1. Stop the CPS instance in xPlore administrator. Choose Instances > Instance_name > Content
Processing Service and click Stop CPS.
2. Edit the CPS configuration file in the CPS host directory xplore_home/dsearch/cps/cps_daemon.
3. Change the value of element daemon_count to 3 or more (default: 1, maximum 7).
4. Change the value of connection_pool_size to 2.
5. Restart all xPlore instances.
6. (Optional for temporary ingestion loads) Reset the CPS daemon_count to 1 and
connection_pool_size to 4 after reindexing is complete.
A similar error is Failed to create temporary file or error code 47 (file write error).
Ensure that the directory for the CPS temporary file path is large enough. Set the value of
temp_file_folder in the file PrimaryDsearch_local_configuration.xml. This file is located in
xplore_home/dsearch/cps/cps_daemon. Its size should not be less than 20 GB for file processing.
Stop all xPlore instances and restart. A restart finalizes the temporary index items in the /tmp
directory.
Stop all xPlore instances again.
Add the following Java option to the primary instance start script, for example,
startPrimaryDsearch.sh in jboss5.1.0/server. Substitute your temp directory for MYTEMP_DIR:
-Djava.io.tmpdir=MYTEMP_DIR
4.
115
Check the file using the casample utility to see if it is recognized. See CPS troubleshooting methods,
page 112. If the file is XML, check to see that it is well-formed. For a list of supported formats, see
Oracle Outside In 8.3.7 documentation.
If the document uses an unsupported encoding, a 1027 error code is displayed. For supported
encodings, see Basistech documentation.
For the error message Unknown language provided, check to see whether you have configured an
invalid locale. See Configuring languages and encoding, page 100.
If the error message is Not enough data to process, the file has very little text content and the
language was not detected.
File corrupted
If there are processing errors for the file, they will be displayed after the processing statistics. A corrupt
file returns the following error. The XML element that contains the error is displayed:
*** Error: file is corrupt in element_name.
A file with bad content can also return the error message Served data invalid.
Check the file using the casample utility to see if it is corrupted. See CPS troubleshooting methods,
page 112. Check the list of supported formats for the format utility
You can register or unregister a type through Documentum Administrator. The type must be
dm_sysobject or a subtype of it. If a supertype is registered for indexing, the system displays the
Enable Indexing checkbox selected but disabled. You cannot clear the checkbox.
Is the format indexable? Check the class attribute of the document format. See Documentum attributes
that control indexing, page 60 for more information.
Is the document too large? See Maximum document and text size, page 98.
Insufficient CPU
Content extraction and text analysis are CPU-intensive. CPU is consumed for each document creation,
update, or change in metadata. Check CPU consumption during ingestion.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
117
Suggested workarounds: For migration, add temporary CPU capacity. For day-forward (ongoing)
ingestion, add permanent CPU or new CPS instances. CPS instances are used in a round-robin order.
Insufficient memory
When xPlore indexes large documents, it loads much content which consumes too much memory. In
this case, you get an out-of-memory error in dsearch.log such as:
Internal server error. [Java heap space] java.lang.OutOfMemoryError
Suggested workarounds:
Add more memory to xPlore.
Limit the document and text size as described in Maximum document and text size, page 98.
Enable the throttle mechanism as described in Throttling indexing, page 326.
Large documents
Large documents can tie up a slow network. These documents also contain more text to process. Use
xPlore administrator reports to see the average size of documents and indexing latency and throughput.
The average processing latency is the average number of seconds between the time the request is
created in the indexing client and the time xPlore receives the same request. The State of repository
report in Content Server also reports document size. For example, the Documents ingested per hour
reports shows number of documents and text bytes ingested. Divide bytes ingested by document count
to get average number of bytes per document processed.
Several configuration properties affect the size of documents that are indexed and consequently the
ingestion performance.Maximum document and text size, page 98 describes these settings.
Verify that the SAN has sufficient memory to handle the I/O rate.
2.
3.
If the SAN is multiplexing a set of drives over multiple application, move the "disk space"
to a less contentious set of drives.
4. If other measures have not resolve the problem, change underlying drives to solid state.
118
5.
Use striped mapping instead of concatenated mapping so that all drives can be used to service
I/O.
Slow network
A slow network between the Documentum Content Server and xPlore results in low CPU consumption
on the xPlore host. Consumption is low even when the disk subsystem has a high capacity. File
transfers via FTP or network share are also slow, independent of xPlore operations.
Suggested workarounds: Verify that network is not set to half duplex. Check for faulty hubs or
switches. Increase network capacity.
119
Use xPlore administrator to select the instance, and then choose Configuration. Change the following
to smaller values:
Max text threshold
Thread pool size
You can add a separate CPS instance that is dedicated to processing. This processor does not interfere
with query processing. You can also throttle ingestion to limit document content size or count. See
Throttling indexing, page 326.
Chinese:
xplore_home\dsearch\cps\cps_daemon\shared_libraries\rlp\cma\source\samples\build_user_dict.sh
Japanese:
xplore_home\dsearch\cps\cps_daemon\shared_libraries\rlp\jma\source\samples\build_user_dict.sh
Korean:
xplore_home\dsearch\cps\cps_daemon\shared_libraries\rlp\kma\source\samples\build_user_dict.sh
On Linux:
1. Export the following variables:
export
BT_ROOT=xplore_home/dsearch/cps/cps_daemon/shared_libraries
export BT_BUILD=variableDir
Where variableDir is a subdirectory under
xplore_home/dsearch/cps/cps_daemon/shared_libraries/rlp/bin/ and its name differs from
computer to computer. For example:
export BT_BUILD=amd64-glibc25-gcc42
2. Use the chmod command to change the file permissions on the compilation script and some
other files:
chmod a+x build_user_dict.sh
chmod a+x build_cla_user_dictionary
chmod a+x cla_user_dictionary_util
chmod a+x t5build
chmod a+x t5sort
3. Run the compilation script build_user_dict.sh. Use the following as an example:
./build_user_dict.sh mydict.txt mydict.bin
On Windows:
1. Download and install Cygwin from http://www.cygwin.com/.
2. Launch the Cygwin terminal.
3. Export the following variables:
export
BT_ROOT=xplore_home/dsearch/cps/cps_daemon/shared_libraries
export BT_BUILD=variableDir
Where variableDir is a subdirectory under
xplore_home/dsearch/cps/cps_daemon/shared_libraries/rlp/bin/ and its name differs from
computer to computer. For example:
export BT_BUILD=amd64-glibc25-gcc42
4. Edit the build_user_dict.sh file to make sure that carriage returns are denoted by \n instead
of \n\r in the file.
5. In the Cygwin terminal, run the compilation script build_user_dict.sh specific to the
dictionary language; for example:
./build_user_dict.sh mydict.txt mydict.bin
3. Put the compiled dictionary into the directory specific to the dictionary language:
Chinese: xplore_home/cps/cps_daemon/shared_libraries/rlp/cma/dicts
EMC Documentum xPlore Version 1.3 Administration and Development Guide
121
Japanese: xplore_home/cps/cps_daemon/shared_libraries/rlp/jma/dicts
Korean: xplore_home/cps/cps_daemon/shared_libraries/rlp/kma/dicts
4. Edit the CLA configuration file to include the user dictionary. You add a dictionarypath element to
cla-options.xml in xplore_home/cps/cps_daemon/shared_libraries/rlp/etc.
The following example adds a Chinese user dictionary named mydict.bin:
<claconfig>
...
...
<dictionarypath><env name="root"/>/cma/dicts/mydict.bin
</dictionarypath>
</claconfig>
5. To prevent a word that is also listed in a system dictionary from being decomposed,
modify cps_context.xml in xplore_home/cps/cps_daemon. Add the property
com.basistech.cla.favor_user_dictionary if it does not exist, and set it to true.
For example:
<contextconfig><properties>
<property name="com.basistech.cla.favor_user_dictionary"
value="true"/>...
Custom text extractors are usually third-party modules that you configure as text extractors for
certain formats (mime types). You must create adaptor code based on proprietary, public xPlore
adaptor interfaces.
2. Annotators: Classify elements in the text, annotate metadata, and perform name indexing for
faster retrieval.
Custom annotators require software development of modules in the Apache UIMA framework.
Figure 11
Customization steps
1. Write plugin (Java or C/C++) or UIMA annotator.
2. Place DLLs or jar files in CPS classpath.
3. Repeat for each CPS instance.
4. Test content processing.
5. Perform a backup of your customization DLLs or jars when you back up the xPlore federation.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
123
Text extraction
The text extraction phase of CPS can be customized at the following points:
Pre-processing plugin
Plugins for text extraction based on mime type, for example, the xPlore default extractor Oracle
Outside In (formerly Stellent) or Apache Tika.
Post-processing plugin
For best reliability, deploy a custom text extractor on a separate CPS instance. For instructions on
configuring a remote CPS instance for text extraction, see Adding a remote CPS instance, page 94.
The following diagram shows three different mime types processed by different plugins
Figure 12
Element
Description
text_extractor_preprocessor
text_extractor
text_extractor_postprocessor
124
Element
Description
name
type
lib_path
formats
properties
property
125
</formats>
</text_extractor_preprocessor>
<text_extraction>
<text_extractor>
<name>tika</name>
<type>java</type>
<lib_path>
com.emc.cma.cps.processor.textextractor.CPSTikaTextExtractor
</lib_path>
<properties>
<property name="return_attribute_name">false</property>
</properties>
<formats>
<format>application/pdf</format>
</formats>
</text_extractor> ...
</text_extraction>
...
</cps_pipeline>
Annotation
Documents from a Content Server are submitted to CPS as dftxml files. The CPS annotation
framework analyzes the dftxml content. Text can be extracted to customer-defined categories, and
the metadata can be annotated with information.
Annotation employs the Apache UIMA framework. A UIMA annotator module extracts data from the
content, optionally adds information, and puts it into a consumable XML output.
A UIMA annotator module has the following content:
126
b.
c.
d.
e.
Note: To deploy the UIMA module as a remote service, you can use the Vinci service that is included
in the UIMA SDK.
127
Element
Description
name
descriptor
process-class
properties
The following example configures a UIMA module. You specify a descriptor XML file, a processing
class, and an optional name. The descriptor file, and process class are hypothetical. The path to the
descriptor file is from the base of the UIMA module jar file.
<pipeline-config>
<pipeline descriptor="descriptors/PhoneNumberAnnotator.xml" process-class="
com.emc.documentum.core.fulltext.indexserver.uima.UIMAProcessFactory" name="
phonenumber_pipeline"/>
</pipeline-config>
The xPlore UIMAProcessFactory class is provided with xPlore. It executes the pipeline based on the
definitions provided.
Element
Description
name
128
Element
Description
root-result-element
mapper-class
input-element
type-mapping
feature-mapping
properties
In the following example of the Apache UIMA room number example, the annotated content is placed
under dmftcustom in the dftxml file (root-result-element). The content of the element r_object_type
(input-element element-path) and object_name are passed to the UIMA analyzer. (If you are annotating
content and not metadata, you do not need input-element.) For the object_name value, a room element
is generated by the RoomNumber class. Next, the building and room-number elements are generated
by a lookup of those features (data members) in the RoomNumber class.
129
See the Apache UIMA documentation for more information on creating the annotator class and
descriptor files. See a simple annotator example, page 130 for xPlore.
When an annotator is applied (tracing for dsearchindex is INFO), you see the domain name and
category in the log like the following:
Domain test category dftxml,
Apply Annotator Phone Number Annotator to document
testphonenumbers5_txt1318881952107
UIMA example
The following example is from the UIMA software development kit. It is used to create a UIMA
module that normalizes phone numbers for fast identification of results in xPlore.
PhoneAnnotationTypeDef.xml creates the following string-type features. The class that handles the
types is specified as the value of types/typeDescription/name: FtPhoneNumber.:
phoneNumber: Phone number as it appears in a document.
normalizedForm: Normalized phone number
Note that the type descriptor must import the xPlore type definition descriptors, which handle file
access.
<?xml version="1.0" encoding="UTF-8" ?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>TutorialTypeSystem</name>
<description>Phone number Type System Definition</description>
<vendor>The Apache Software Foundation</vendor>
<version>1.0</version>
<imports>
<import location="AnnotationTypeDef.xml" />
<import location="FtDocumentAnnotationTypeDef.xml"/>
</imports>
<types>
<typeDescription>
<name>FtPhoneNumber</name>
<description></description>
<supertypeName>uima.tcas.Annotation</supertypeName>
<features>
<featureDescription>
<name>phoneNumber</name>
<description />
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
<featureDescription>
<name>normalizedForm</name>
<description />
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>
131
132
<imports>
<import location="PhoneAnnotationTypeDef.xml"/>
</imports>
</typeSystemDescription>
<capabilities>
<capability>
<inputs></inputs>
<outputs>
<type>FtPhoneNumber</type>
<feature>FtPhoneNumber:phoneNumber</feature>
<feature>FtPhoneNumber:normalizedForm</feature>
</outputs>
<languagesSupported></languagesSupported>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
</analysisEngineDescription>
Imports:
import
import
import
import
import
import
import
org.apache.uima.analysis_engine.AnalysisEngineProcessException;
org.apache.uima.jcas.JCas;
org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
org.apache.uima.UimaContext;
org.apache.uima.resource.ResourceInitializationException;
java.util.regex.Matcher;
java.util.regex.Pattern;
Class:
public class PhoneNumberAnnotator extends JCasAnnotator_ImplBase {
private Pattern[] mPatterns;
public void initialize(UimaContext aContext)
throws ResourceInitializationException {
super.initialize(aContext);
// Get config. parameter values from PhoneNumberAnnotator.xml
String[] patternStrings = (String[]) aContext.getConfigParameterValue("
Patterns");
// compile regular expressions
mPatterns = new Pattern[patternStrings.length];
for (int i = 0; i < patternStrings.length; i++) {
133
mPatterns[i] = Pattern.compile(patternStrings[i]);
}
}
private String normalizePhoneNumber(String input)
{
StringBuffer buffer = new StringBuffer();
for (int index = 0; index < input.length(); ++index)
{
char c = input.charAt(index);
if (c >= 0 && c <= 9)
buffer.append(c);
}
return buffer.toString();
}
public void process(JCas aJCas) throws AnalysisEngineProcessException {
// get document text
String docText = aJCas.getDocumentText();
// loop over patterns
for (int i = 0; i < mPatterns.length; i++) {
Matcher matcher = mPatterns[i].matcher(docText);
while (matcher.find()) {
// found one - create annotation
FtPhoneNumber annotation = new FtPhoneNumber(aJCas);
annotation.setBegin(matcher.start());
annotation.setEnd(matcher.end());
annotation.addToIndexes();
String text = annotation.getCoveredText();
annotation.setPhoneNumber(text);
annotation.setNormalizedForm(normalizePhoneNumber(text));
}
}
}
}
134
Configure the usage of the UIMA module within a category element in indexserverconfig.xml.
Add one pipeline-usage element after the indexes element. Most applications annotate the dftxml
category, so you place pipeline-usage in category name="dftxml". For input-element, specify a
name for logging and a path to the input element in dftxml. (For a sample dftxml, see Extensible
Documentum DTD, page 348. For type-mapping, specify an element name for logging (usually the
same as type-name). For type-name, specify the type definition class. For feature-mapping, specify
the output XML element for element-name and the feature name that is registered in the annotator
descriptor PhoneNumberAnnotator.xml. For example:
</indexes>
<pipeline-usage root-result-element="ae-result" name="phonenumber_pipeline">
<input-element name="content" element-path="/dmftdoc/dmftcontents"/>
<type-mapping element-name="FtPhoneNumber" type-name="FtPhoneNumber">
<feature-mapping element-name="phoneNumber" feature="phoneNumber"/>
<feature-mapping element-name="phoneNormlizd" feature="normalizedForm"/>
</type-mapping>
</pipeline-usage>
You can identify the origin of the CPS processing error in the files cps.log and dsearch.log:
CPS log examples: Failed to extract text from password-protected files:2011-08-02
23:27:23,288 ERROR
[MANAGER-CPSLinguisticProcessingRequest-(CPSWorkerThread-1)]
Failed to extract text for req 0 of doc VPNwithPassword_zip1312352841145,
err-code 770, err-msg: Corrupt file (native error:TIKA-198:
Illegal IOException from org.apache.tika.parser.pkg.PackageParser@161022a6)
2011-08-02 23:36:27,188 ERROR
[MANAGER-CPSLinguisticProcessingRequest-(CPSWorkerThread-2)]
Failed to extract text for req 0 of doc tf_protected_doc1312353385777,
err-code 770, err-msg: Corrupt file (native error:
Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@3b11d63f)
135
136
Chapter 6
Indexing
This chapter contains the following topics:
About indexing
Configuring an index
Troubleshooting indexing
Indexing APIs
About indexing
The indexing service receives batches of requests to index from a custom indexing client like
the Documentum index agent. The index requests are passed to the content processing service,
which extracts tokens for indexing and returns them to the indexing service. You can configure all
indexing parameters by choosing Global Configuration from the System Overview panel in xPlore
administrator. You can configure the same indexing parameters on a per-instance basis by choosing
Indexing Service on an instance and then choosing Configuration.
For information on creating Documentum indexes, see Creating custom indexes, page 145.
Modify indexes by editing indexserverconfig.xml. For information on viewing and updating this file,
see Modifying indexserverconfig.xml, page 43. By default, Documentum content and metadata are
indexed. You can tune the indexing configuration for specific needs. A full-text index can be created as
a path-value index with the FULL_TEXT option.
For information on scalability planning, see EMC Documentum xPlore Installation Guide.
137
Indexing
Indexing depth: Only the leaf (last node) text values from subelements of an XML node with implicit
composite indexes are returned. You can configure indexing to return all node values instead of the leaf
node value. (This change negatively affects performance.) Set the value of index-value-leaf-node-only
in the index-plugin element to false. Reindex your documents to see the other nodes in the index.
The paths in the configuration file are in XPath syntax and see the path within the dftxml representation
of the object. (For information on dftxml, see Extensible Documentum DTD, page 348.) Specify an
XPath value to the element whose content requires text extraction for indexing.
Table 17
Option
Description
do-text-extraction
for-element-with-name
for-element-with-name/
xml-content
for-element-with-name/
save-tokens-for-summary-processing
xml-content on-embed-error
xml-content index-as-sub-path
xml-content file-limit
138
Indexing
Option
Description
compress
compress/for-element
Configuring an index
Indexes are configured within an indexes element in the file indexserverconfig.xml. For information on
modifying this file, see Modifying indexserverconfig.xml, page 43. The path to the indexes element
is category-definitions/category/indexes. Four types of indexes can be configured: fulltext-index,
value-index, path-value index, and multi-path.
By default, multi-path indexes do not have all content indexed. If an element does not match a
configuration option, it is not indexed. To index all element content in a multi-path index, add a
subpath element on //*. For example, to index all metadata content, use the path dmftmetadata//*.
The following child elements of node/indexes/index define an index.
Table 18
Index option
Description
path-value-index
path-value-index/sub-path
139
Indexing
Index option
Description
sub-path attributes
(continued)
(continued)
enumerate-repeating-elements: Boolean,
specifies whether the position of the element
in the path should be indexed. Used for
correlated repeating attributes, for example,
media objects with prop_name=dimension and
prop_value=800x600,blue.
(continued)
(continued)
(continued)
(continued)
(continued)
(continued)
140
Indexing
Index option
Description
(continued)
(continued)
(continued)
Subpaths
A subpath definition specifies the path to an element. The path information is saved with the indexed
value. A subpath increases index size while enhancing performance. For most Documentum
applications, you do not need to modify the definitions of the subpath indexes, except for the following
use cases:
Store facet values in the index. For example:
<sub-path ...returning-contents="
true" compress="true" value-comparison="true" full-text-search="true"
enumerate-repeating-elements="false" type="string"
path="dmftcustom/entities/person"/>
Add a subpath for non-string metadata. By default, all metadata is indexed as string type. To
speed up searches for non-string attributes, add a subpath like the following. Valid types: string |
integer | double | date | datetime.
<sub-path leading-wildcard="false" compress="false" boost-value="1.0"
include-descendants="false" returning-contents="false" value-comparison="true"
full-text-search="true" enumerate-repeating-elements="false" type="datetime"
path="dmftmetadata//r_creation_date"/>
Note: Starting from xPlore 1.3, boolean is no longer a supported subpath type. Any subpath with its
type set to boolean in indexserverconfig.xml will be automatically converted to string type.
Add paths for dmftcustom area elements (metadata or content that a TBO injects).
Add paths to support XQuery of XML content.
Modify the capabilities of existing subpaths, such as supporting leading wildcard searches for
certain paths. For example:
<sub-path description="leading wildcard queries"
returning-contents="false"
141
Indexing
value-comparison="true" full-text-search="true"
enumerate-repeating-elements="
false" leading-wildcard="true" type="string"
path="dmftmetadata//object_name"/>
Add subpaths for metadata that should not be indexed, for example, Documentum attributes. Set
full-text-search to false.
Add a subpath for sorting. Requires an attribute value of value-comparison=true. See Sort
support, page 143.
Note: For all subpath changes that affect existing documents, you must rebuild the index of every
affected collection. To verify that your subpath was used during indexing, open the xDB admin tool
and drill down to the multi-path index. Double-click Multi-Path index in the Index type column. You
see a Rebuild Index page that lists all paths and subpaths. For example:
Table 19
Feature
Without subpath
With subpath
Limited
Flexible
142
Indexing
Feature
Without subpath
With subpath
Low
ftcontains (full-text)
Updates
Low overhead
High overhead
Yes
No
Sort support
You can configure sorting by Documentum attributes. Add a subpath in indexserverconfig.xml for
each attribute that is used for sorting. Requires an attribute value of value-comparision=true.
Note: Reindexing is required to return documents sorted by attributes.
Follow these rules for attribute sort subpath definitions:
Configure the most specific subpath for sorting. This insures that the index is used for lookup.
For example, if the document has element root/pathA/my_attribute and root/pathB/my_attribute
(same attribute), do not set root//some_attribute as sortable. The following example configures an
attribute for sorting using the full xpath:
<sub-path sortable="true" ... value-comparison="true" full-text-search="
true" ...path="dmftmetadata/dm_document/r_modify_date"/>
If there is more than one path to the same attribute, you must create a subpath definition for each
path. Ambiguous paths may result in a slow search outside the index. In the example above, you
need two subpath definitions:
<sub-path sortable="true" ... value-comparison="true" full-text-search="
true" ...path="dmftmetadata/root/pathA/my_attribute"/>
<sub-path sortable="true" ... value-comparison="true" full-text-search="
true" ...path="dmftmetadata/root/pathB/my_attribute"/>
If there is only one instance of an attribute in the rendered dftxml. a shortened path notation is
treated as the most specific subpath. For example, there is only one object_name element in dftxml.
so dmftmetadata//object_name is the most specific path.
143
Indexing
Sort debugging
You can use the Query debug and Optimizer tools in xPlore administrator to troubleshoot sorting. When
an index is properly configured and used for the query, you see the following in the Query debug pane:
Found an index to support all order specs. No sort required.
If you do not see this confirmation, check the Optimizer pane. Find the following section:
<endingimplicitmatch indexname="dmftdoc">
In the following example, there are two order-by clauses in the query. The first fails because there
is no sub-path definition for sorting.
<endingimplicitmatch indexname="dmftdoc">
<LucenePlugin>
<ImplicitIndexOptimizer numconditionsall="1"
numorderbyclausesall="2">
<condition accepted="true" expr="child::dmftmetadata/
descendant-or-self::node()/child::object_name
[. contains text rule]"/>
<orderbyclause accepted="true" expr="child::dmftmetadata/
descendant-or-self::node()/child::object_name"/>
<orderbyclause accepted="false" reason="
No exact matching subpath configuration found that
matches order-by clause" expr="child::dmftmetadata/
descendant-or-self::node()/child::r_modify_data"/>
</ImplicitIndexOptimizer>
<numacceptedandrejected numconditionsaccepted="1"
numconditionsskipped="0" numorderbyclausessaccepted="1"
numorderbyclausesskipped="1"/>
</LucenePlugin>
<conditions numaccepted="1">
<pnodecondition>
child::dmftmetadata/descendant-or-self::
node()/child::object_name[. contains text rule]
</pnodecondition>
<externallyoptimizedcondition accepted="true">
child::dmftmetadata/descendant-or-self::node()/
child::object_name[. contains text rule]
</externallyoptimizedcondition>
</conditions>
<orderbyclauses numaccepted="1">
<orderbyclause accepted="true">child::dmftmetadata/
descendant-or-self::node()/child::object_name
</orderbyclause>
<orderbyclause accepted="false">child::dmftmetadata/
descendant-or-self::node()/child::r_modify_data
</orderbyclause>
</orderbyclauses>
</endingimplicitmatch>
144
Indexing
The information displays that dmftmetadata//r_modify_data is not accepted. The reason is a typo in the
sub-path definition in indexserverconfig.xml: The proper sub-path is dmftmetadata//r_modify_date.
145
Indexing
Troubleshooting indexing
You can use reports to troubleshoot indexing and content processing issues. SeeDocument processing
(CPS) reports, page 290 and Indexing reports, page 290 for more information on these reports.
Indexing
147
Indexing
Under certain conditions, CPS fails while processing a document. xPlore restarts the CPS process,
but the restart causes a delay. Restart is logged in cps.log and cps_daemon.log. For information on
these logs, see CPS logging, page 299.
Large documents tie up ingestion
A large document in the ingestion pipeline can delay smaller documents that are further back in
the queue. Detect this issue using the Documents ingested per hour report in xPlore administrator.
(Only document size averages are reported.)
If a document is larger than the configured maximum limit for document size or text size, the
document is not indexed. The document metadata are indexed but the content is not. These
documents are reported in the xPlore administrator report Content too large to index.
Workarounds: Attempt to refeed a document that was too large. Increase the maximum size for
document processing or set cut_off_text to true in PrimaryDsearch_local_configuration.xml. This file
is located in xplore_home/dsearch/cps/cps_daemon. See Maximum document and text size, page 98.
Ingestion batches are large
During periods of high ingestion load, documents can take a long time to process. Review the
ingestion reports in xPlore administrator to find bytes processed and latency. Use dsearch.log to
determine when a specific document was ingested.
Workaround: Set up a dedicated index agent for the batch workload.
Insufficient hardware resources
If CPU, disk I/O, or memory are highly utilized, increase the capacity. Performance on a virtual
server is slower than on a dedicated host. For a comparison or performance on various storage
types, see Disk space and storage, page 313.
Indexing
As a consequence, all new objects with this attribute are not indexed.
To resolve the conflict, delete the collection and reindex the objects as described in Deleting a
collection and recreating indexes, page 165.
Rebuilding the indexes or reindexing without deleting the collection does not resolve the conflict
because the conflict still exists in xDB.
149
Indexing
2. Stop all indexing activity on the instance. Set the target collection, domain, or federation to
read-only or maintenance mode.
3. Invoke the checker using CLI. (See Using the CLI, page 184.) The default batch size is 1000. Edit
the script to change the batch size. Syntax (on one line):
xplore checkDataConsistency unit, domain, collection,
is_fix_trackDB, batch-size, report-directory
Valid values:
unit: collection or domain
domain: domain name.
collection: Collection name (null for domain consistency check)
is_fix-trackDB: true or false. Set to false first and check the report. If indexing has not been
turned off, inconsistencies are reported.
batch-size: Numeric value greater than or equal to 1000. Non-numeric, negative, or null values
default to 1000.
report-directory: Path for consistency report. Report is created in a subdirectory
report-directory/time-stamp/domain_name|collection_name. Default base directory is the
current working directory.
Windows example: Checks consistency of a default collection in defaultDomain and fixes the
trackingDB:
xplore "checkDataConsistency collection, defaultDomain,default1 true,
2000, C:\\tmp "
Indexing APIs
Access to indexing APIs is through the interface
com.emc.documentum.core.fulltext.client.IDSearchClient. Each API is described in the javadocs. The
following topics describe the use of indexing APIs.
Indexing
For a detailed example of routing to specific collections and targeting queries to that collection, see
"Improving Webtop Search Performance Using xPlore Partitioning" on the EMC Community Network
(ECN).
3. Create your custom class. (See example.) Import IFtIndexRequest in the package
com.emc.documentum.core.fulltext.common.index. This class encapsulates all aspects of an
indexing request:
public interface IFTIndexRequest
{
String getDocId ();
long getRequestId ();
FtOperation getOperation (); //returns add, update or delete
String getDomain ();
String getCategory ();
String getCollection ();
IFtDocument getDocument (); //returns doc to be indexed
public String getClientId();
void setCollection(String value);
public void setClientId(String id);
}
SimpleCollectionRouting example
This example routes a document to a specific collection based on Documentum version.
The sample Java class file in the SDK/samples directory assumes that the Documentum
index agent establishes a connection to the xPlore server. Place the compiled class
SimpleCollectionRouting in the Documentum index agent classpath, for example,
xplore_home//jboss5.1.0/server/DctmServer_Indexagent/deploy/IndexAgent.war/WEB-INF/classes.
151
Indexing
This class parses the input DFTXNL representation from the index agent. The class gets a metadata
value and tests it, then routes the document to a custom collection if it meets the criterion (new
document).
Imports
import com.emc.documentum.core.fulltext.client.index.
custom.IFtIndexCollectionRouting;
import com.emc.documentum.core.fulltext.client.index.FtFeederException;
import com.emc.documentum.core.fulltext.client.common.IDSearchServerInfo;
import com.emc.documentum.core.fulltext.common.index.IFtIndexRequest;
import com.emc.documentum.core.fulltext.common.index.IFtDocument;
import java.util.List;
import javax.xml.xpath.*;
import org.w3c.dom.*;
152
Indexing
return m_version;
}
153
Chapter 7
Index Data: Domains, Categories, and
Collections
This chapter contains the following topics:
Managing domains
Configuring categories
Managing collections
Check DB consistency
Perform before backup and after restore. This check determines whether there are any corrupted or
missing files such as configuration files or Lucene indexes. Lucene indexes are checked to see whether
they are consistent with the xDB records: tree segments, xDB page owners, and xDB DOM nodes.
Note: You must set the domain to maintenance mode before running this check.
Select the following options to check. Some options require extensive processing time:
Segments and admin structures: Efficient check, does not depend on index size.
Free and used pages: Touches all pages in database.
Note: When the result from this check is inconsistent, run it two more times.
Pages owner: Touches all pages in database.
Indexes: Traverses all the indexes and checks DOM nodes referenced from the index.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
155
Basic checks of indexes: Efficient check of the basic structure. The check verifies that the necessary
Lucene index files exist and that the internal xDB configuration is consistent with files on the
file system.
DOM nodes: Expensive operation, accesses all the nodes in the database.
The basic check of indexes inspects only the external Lucene indexes. The check verifies that the
necessary files exist and that the internal xDB configuration is consistent with files on the file system.
This check runs much faster than the full consistency check.
Use the standalone consistency checker to check data consistency for specific collections in a domain
or specific domains in a federation. If inconsistencies are detected, the tool can rebuild the tracking
database. If tracking entries are missing, they are re-created. If tracking entries point to nothing, they
are deleted. See Running the standalone consistency checker, page 149/
View DB statistics
Displays performance statistics from xDB operations.
Backup
See Backup in xPlore administrator, page 179.
Managing domains
A domain is a separate, independent, logical, or structural grouping of collections. Domains are
managed through the Data Management screen in xPlore administrator. The Documentum index agent
creates a domain for the repository to which it connects. This domain receives indexing requests
from the repository.
New domain
Select Data Management in the left panel and then click New Domain in the right panel. Choose a
default document category. (Categories are specified in indexserverconfig.xml.) Choose a storage
location from the dropdown list. To create a storage location, see Creating a collection storage
location, page 164.
A Documentum index agent creates a domain for a repository and creates ACL and group collections
in that domain. Note: When you create a domain in xPlore administrator, ACL and group collections
are not created automatically.
New Collection
Create a collection and configure it. See Changing collection properties, page 161.
156
Configuration
The document category and storage location are displayed (read-only). You can set the runtime mode
as normal (default) or maintenance (for a corrupt domain). The mode does not persist across xPlore
sessions; mode reverts to runtime on xPlore restart.
Delete domain
If you are unable to start xPlore, with a log entry that the domain is
corrupted, force recovery. Add a property force-restart-xdb=true in
xplore_home/jboss5.1.0/server/%INSTANCE_NAME%/deploy/dsearch.war/
WEB-INF/classes/indexserver-bootstrap.properties.
Remove the domain with the following steps.
1. Bind all collections in the domain to the primary instance using xPlore administrator.
2. All collections must meet the following conditions before you detach the domain:
No collections are detached.
No collections are off-line.
No collections are in search-only mode.
If the delete domain transaction encounters a detached collection, it displays an error message.
3. (Optional) Back up the xplore_home/config and xplore_home/data/domain_name folders.
4. In xPlore administrator, choose Data Management and click the red X next to a domain to delete
it. This option is not enabled if the domain is detached. A corrupted domain cannot be deleted
using xPlore administrator. For steps to manually delete a corrupted domain, see Delete a corrupted
domain, page 157.
If there is an error in deleting any collection in the domain, the entire delete domain transaction
is rolled back.
157
Configuring categories
A category defines a class of documents and their XML structure. The category definition specifies the
processing and semantics that are applied to the ingested XML document. You can specify the XML
elements that have text extraction, tokenization, and storage of tokens. You also specify the indexes
that are defined on the category and the XML elements that are not indexed. More than one collection
can map to a category. xPlore manages categories.
The default categories include dftxml (Documentum content), security (ACL and group, tracking
database, metrics database, audit database, and thesaurus database.
When you create a collection, choose a category from the categories defined in indexserverconfig.xml.
When you view the configuration of a collection, you see the assigned category. It cannot be changed
in xPlore administrator. To change the category, edit indexserverconfig.xml.
You can configure the indexes, text extraction settings, and compression setting for each category. The
paths in the configuration file are in XPath syntax and refer to the path within the XML representation
of the document. (All documents are submitted for ingestion in an XML representation.) Specify an
XPath value to the element whose content requires text extraction for indexing.
Table 20
Option
Description
category-definitions
category
properties/property
track-location
158
Managing collections
About collections, page 159
Planning collections for scalability, page 160
Uses of subcollections, page 160
Adding or deleting a collection, page 161
Changing collection properties, page 161
Routing documents to a specific collection, page 162
Attaching and detaching a collection, page 162
Moving a collection, page 162
Creating a collection storage location, page 164
Rebuilding the index, page 164
Deleting a collection and recreating indexes, page 165
Querying a collection, page 165
About collections
A collection is a logical group of XML documents that is physically stored in an xDB detachable
library. A collection represents the most granular data management unit within xPlore. All documents
submitted for indexing are assigned to a collection. A collection generally contains one category of
documents. In a basic deployment, all documents in a domain are assigned to a single default collection.
You can create subcollections under each collection and route documents to user-defined collections.
A collection is bound to a specific instance in read-write state (index and search, index only, or update
and search). A collection can be bound to multiple instances in read-only state (search-only).
Viewing collections
To view the collections for a domain, choose Data Management and then choose the domain the left
pane. In the right pane, you see each collection name, category, usage, state, and instances that the
collection is bound to.
There is a red X next to the collection to delete it. For xPlore system collections, the X is grayed out,
and the collection cannot be deleted.
159
Note: You cannot back up a collection in the offline state. You must detach it or bring it online.
Usage: Type of xDB library. Valid types: data (index), ApplicationInfo and SystemInfo.
Current size on disk, in KB
Uses of subcollections
Create subcollections for the following uses:
Create multiple top-level collections for migration to boost the ingestion rate. After ingestion
completes, move the temporary collection to a parent collection. The temporary collection is
160
now a subcollection. The parent and subcollections are searched faster than a search of multiple
top-level collections.
Create subcollections for data management. For example, you create a collection for 2011 data
with a subcollection to store November data.
The following restrictions are enforced for subcollections, including a collection that is moved
to become a subcollection:
Subcollections cannot be detached or reattached when the parent collection is indexed. For
example, a path-value index is defined with no subpaths, such as the folder-list-index.
Subcollections cannot be backed up or restored separately from the parent.
Subcollections must be bound to the same instance as the parent.
Subcollection state cannot contradict the state of the parent. For example, if the parent is
search_only, the subcollection cannot be index_only or index_and_search. If the parent is
searchable, the adopted collection cannot be search-only.
Exception: When the parent collection is in update_and_search or index_and_search state,
subcollections can be any state.
161
index and search is the default state when a new collection is created. A collection can have
only one binding that is index_and_search.
Use index only to repair the index. You cannot query a collection that is set to index only.
Use update and search for read-only collections that have updates to existing content or
metadata. You cannot add new content to the collection.
Use search only (read-only) on multiple instances for query load balancing and scalability.
3. Change binding to another xPlore instance:
a. Set the state of the collection. If the collection state is index_and_search, update_and_search,
or index_only , you can bind to only one instance. If the collection state is search_only, you can
bind the collection to multiple instances for better resource allocation.
b. Choose a Binding instance.
Limitations:
If a binding instance is unreachable, you cannot edit the binding.
You cannot change the binding of a subcollection to a different instance from the parent
collection.
To change the binding on a failed instance, restore the collection to the same instance or to a
spare instance.
4. Change storage location. To set up storage locations, see Creating a collection storage location,
page 164.
Moving a collection
If you are moving a collection to another xPlore instance, choose the collection and click
Configuration. Change Binding instance.
You can create a collection for faster ingestion, and then move it to become a subcollection of another
collection after ingestion has completed. When you move it as a subcollection, search performance is
improved.
162
If a collection meets the following requirements, you can move it to become a subcollection:
Facet compression was disabled for all facets before the documents were indexed.
The collection is not itself a subcollection (does not have a parent collection).
The collection does not have subcollections.
The collection has the same category and index definitions (in indexserverconfig.xml) as the new
parent.
xPlore enforces additional restrictions after the collection has been moved. For information on
subcollection restrictions, see Uses of subcollections, page 160.
1. Choose the collection and click Move.
2. In the Move to dialog, select a target collection. This collection can be a subcollection or a
top-level collection.
XhiveDatabase.bootstrap. Update the path of each segment to the new path. For example:
<segment reserved="false" library-id="0" library-path=
"/Repository1/dsearch/Data/default" usable="true"
usage="detachable_root" state="read_write" version="1"
temp="false" id="Repository1#dsearch#Data#default">
<file id="12" path="C:\xPlore_1/data-new\Repository1\default\
xhivedb-Repository1#dsearch#Data#default-0.XhiveDatabase.DB"/>
<binding-server name="primary"/>
</segment>
5. On all instances, change the path in indexserver-bootstrap.properties to match the new bootstrap
location. For example:
xhive-data-directory=C\:/xPlore_1/data-new
6. Delete the JBoss cache (/work folders) from the index agent and primary instance web applications.
Also delete JBoss cache folders for the secondary instances. The path of the /work folder is
xPlore_home/jboss5.1.0/server/DctmServer_InstanceName/work.
7. Start the xPlore instances.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
163
C:/xPlore/data/mydomain/default/dmftdoc_2er90/ids.txt
4. In the index agent UI, provide the path to the list of object IDs. See Indexing documents in normal
mode, page 79.
After the index is rebuilt, run ftintegrity or the State of Index job in Content Server 6.7 or higher. See
Using ftintegrity, page 73 or Running the state of index job, page 76.
2.
3.
4.
5.
If there is only one collection to process documents, you cannot delete it. As a workaround, you
can create a temporary collection to be able to delete the problematic one.
Modify the index definition, if necessary.
Create a collection with the same name as the one you deleted.
If you created a temporary collection, remove it before refeeding documents.
Refeed the documents to launch a full reindexing: in the index agent UI, select Start new
reindexing operation.
If custom routing is defined, it is applied. Otherwise, default routing is applied.
Querying a collection
1. Choose a collection in the Data Management tree.
2. In the right pane, click Execute XQuery.
3. Check Get query debug to debug your query. The query optimizer is for technical support use.
To route queries to a specific collection, see Routing a query to a specific collection, page 257.
165
The d argument (xhivedb) is the same for all xPlore installations. The path argument is the path
within xDB to your collection, which you can check in the XHAdmin tool. The output file is large
for most collections. Redirect output to a file using the -o option. (The file must exist before you run
the command.) For example:
statistics-ls -d xhivedb -o C:/temp/DefaultCollStats.txt
/TechPubsGlobal/dsearch/Data/default --lucene-params --details
The output gives the status of each collection (Merge State). For example:
LibPath=/PERFXCP1/dsearch/Data/default IndexName={dmftdoc}
ActiveSegment=EI-8b38b821-4e29-42b2-9fe0-8e6c82764a6b-211106232537097-luceneblobs-1
EntryName=LI-3bb3483d-38c9-4d14-8a90-5a13a9a19717
MergeState=NONE isFinalIndex=FINAL
LastMergeTime=12/09/2012-07:31:11 MinLSN=0 MinLSN=0
LibPath=/PERFXCP1/dsearch/Data/default IndexName={dmftdoc}
ActiveSegment=EI-8b38b821-4e29-42b2-9fe0-8e6c82764a6b-211106232537097-luceneblobs-1
EntryName=LI-2ea1578b-0d82-496a-9c81-ee15502b3cbe
MergeState=NONE isFinalIndex=NOT-FINAL
LastMergeTime=14/09/2012-10:15:48 MinLSN=786124
MinLSN=485357881901
Other statistics
You can check other statistics such as returnable fields, size of index, and number of documents. The
statistics command has the same arguments as statistics-ls.
For example:
statistics --docs-num -d xhivedb
/TechPubsGlobal/dsearch/Data/default dmftdoc
Statistics options:
--lucene-sz: Size of Lucene fields (.fdt), Index to fields (.fdx), and total size of Lucene index in
bytes (all).
--lucene-rf: Statistics of returnable fields (configured in indexserverconfig.xml). Includes total
count of path mapping and value mapping and compression mapping consistency.
--lucene-list: Shows whether each index is final.
--lucene-params: Lists the xDB parameters set in xDB.properties.
--docs-num: Displays the number of documents in the collection. This value should match the
number displayed for a collection in xPlore administrator.
166
167
2.
3.
Remove the collection-related segment including the data segment and its corresponding
tracking DB, tokens, and xmlContent segment (if any) from XhiveDatabase.bootstrap in
xplore_home/config. In the following example, the segment ID starts with defaultDomain and
ends with default:
<segment id="defaultDomain#dsearch#Data#default" temp="false"
version="1" state="detach_point" usage="detachable_root" usable="false">
<file path="c:\DSSXhive\defaultDomain\default\xhivedb-defaultDomain#
dsearch#Data#default-0.XhiveDatabase.DB" id="14"/>
<binding-server name="primary"/>
</segment>
<segment id="defaultDomain#dsearch#SystemInfo#TrackingDB#default"
temp="false" version="1" state="detach_point" usage="detachable_root"
usable="false">
<file path="c:\DSSXhive\defaultDomain\default\xhivedb-defaultDomain#
dsearch#SystemInfo#TrackingDB#default-0.XhiveDatabase.DB" id="15"/>
<binding-server name="primary"/>
</segment>
4.
5.
For example:
xdb>run-server-repair --port 9330 -f C:\xPlore\config\XhiveDatabase.bootstrap
When the xDB server starts successfully, you see a message like the following:
xDB 10_2@1448404 server listening at xhive://0.0.0.0:9330
168
Command arguments
To list all available commands, enter help at the xdb prompt. To get arguments for any of the
commands, enter <command> help, for example:
xdb>help repair-merge
The path_to_index argument in repair commands is a library path, not a file system path. For example:
repository_name/dsearch/Data/default. The index_name value is dmftdoc. dmftdoc is the multi-path
index for Documentum repositories.
For example:
repair-merge -d xhivedb LH1/dsearch/Data/default dmftdoc --final
For example:
repair-segments -d xhivedb LH1/dsearch/Data/default dmftdoc
169
For example:
repair-blacklists -d xhivedb LH1/dsearch/Data/default dmftdoc --check-dups
170
Chapter 8
Backup and Restore
This chapter contains the following topics:
About backup
About restore
Offline restore
About backup
Back up a domain or xPlore federation after you make xPlore environment changes: Adding or
deleting a collection, or changing a collection binding. If you do not back up, then restoring the
domain or xPlore federation puts the system in an inconsistent state. Perform all your anticipated
configuration changes before performing a full federation backup.
You can back up an xPlore federation, domain, or collection using xPlore administrator or use your
preferred volume-based or file-based backup technologies. The EMC Documentum xPlore Installation
Guide. describes high availability and disaster recovery planning.
You can use external automatic backup products like EMC Networker. All backup and restore
commands are available as command-line interfaces (CLI) for scripting. See the chapter Automated
Utilities (CLI).
You cannot back up and restore to a different location. If the disk is full, set the collection state to
update_and_search and create a collection in a new storage location.
Note: If you remove segments from xDB, your backups cannot be restored.
171
Backup technology
xPlore supports the following backup approaches.
Native xDB backups: These backups are performed through xPlore administrator. They are
incremental, cumulative (differential) or full. A cumulative backup has all backups since the last full
backup. You can back up hot (while indexing), warm (search only), or cold (offline). See Backup
in xPlore administrator, page 179.
File-based backups: Back up the xPlore federation directory xplore_home/data, xplore_home/config,
and /dblog files. Backup is warm (search only) or cold (offline). Incremental file-based backups
are not recommended, since most files are touched when they are opened. In addition, Windows
file-based backup software requires exclusive access to a file during backup and thus requires
a cold backup.
Volume-based (snapshot) backups: Backup is warm (search only) or cold (offline). Can be
cumulative or full backup of disk blocks. Volume-based backups require a third-party product such
as EMC Timefinder.
A snapshot , which is a type of point-in-time backup, is a backup of the changes since the last
snapshot. A copy-on-write snapshot is a differential copy of the changes that are made every time
new data is written or existing data is updated. In a split-mirror snapshot, the mirroring process is
stopped and a full copy of the entire volume is created. A copy-on-write snapshot uses less disk
space than a split-mirror snapshot but requires more processing overhead.
Backup combinations
Periodic full backups are recommended in addition to differential backups. You can perform
incremental or cumulative backups only on the xPlore federation and not on a domain or collection.
Table 21
Backup scenarios
Level
Backup state
DR technology
Backup scope
collection
warm
xPlore
full only
172
Level
Backup state
DR technology
Backup scope
domain
warm
xPlore
full only
xPlore federation
warm or hot
xPlore
cold or warm
volume*
full, incremental or
cumulative
cold or warm
file*
full or cumulative
full only
About restore
All restore operations are performed offline. If you performed a hot backup using xPlore administrator,
the backup file is restored to the point at which backup began.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
173
Each xPlore instance owns the index for one or more domains or collections, and a transaction log. If
there are multiple instances, one instance can own part of the index for a domain or collection. The
system uses the transaction log to restore data on an instance.
Restore a backup to the same location. If the disk is full, set the collection state to update_and_search
and create a collection in a new storage location.
Note: You cannot restore the backup of a previous installation of xPlore after you upgrade to 1.3,
because the xDB version has changed. Back up your installation immediately after upgrading xPlore.
Scripted restore
Use the CLI for scripted restore of a federation, collection, domain. See the chapter Automated
Utilities (CLI). xPlore supports offline restore only. The xPlore server must be shut down to restore
a collection or an xPlore federation.
174
175
If the consistency check passes, check the number of folders named LI-* in
xplore_home/data../lucene-index directories.
If the consistency check fails, perform the xDB command repair-segments.
1. Open a command window and navigate to xplore_home/dsearch/xhive/admin.
2. Launch XHCommand.bat or XHCommand.sh.
3. Enter the following xDB command.
xdb repair-segments d database -p path target
Dead objects
For the XhiveException: OBJECT_DEAD.
1. In xPlore administrator, make sure that all collections have a state of index_and_search.
2. Stop xPlore instances.
3. Start xDB in repair mode.
a. Change the memory in xdb.properties for all nodes. This file is located in
%XPLORE%/jboss5.1.0/server/%NODE_NAME%/deploy/dsearch.war/WEB-INF/classes/xdb.properties.
This change can remain after repair.
XHIVE_MAX_MEMORY=1536M
b. Start each instance in repair mode: Open a shell, go to the directory
xplore_home/dsearch/xhive/admin/, run xhcommand.bat (Windows) or ./XHCommand (Linux),
and input the instance name, port, and path to the bootstrap file on the host.
For example:
run-server-repair --address 0.0.0.0 --port 9330 --nodename primary
-f %XPLORE%/config/XhiveDatabase.bootstrap
4. Input the following xhcommand, specifying the domain, collection name, and parameter. The
repair command can take a long time if the index size is more than a few GB. To scan without
removing dead objects, remove the option repair-index:
repair-blacklists -d xhivedb /%DOMAIN%/dsearch/Data/%COLLECTION% dmftdoc
--check-dups --repair-index
177
("total
("total
("total
("total
("total
docs processed =
checked blacklisted objects = "
unaccounted for blacklisted objects = "
duplicate entries found = ");
intranode dups found = "
If "Total potential impacted normal objects" is not 0, a file is generated with the following name
convention: %DOMAIN%#dsearch#Data#%COLLECTION%_objects_2012-03-12-21-02-06.
Resubmit this file using the index agent UI.
1. Log in to the index agent UI.
2. Choose Object File.
3. Browse to the file and choose Submit.
178
179
a. Navigate to xplore_home/dsearch/xhive/admin.
b. Launch the command-line tool with the following command. You supply the administrator
password (same as xPlore administrator).
XHCommand suspend-diskwrites
2. Set all domains in the xPlore federation to the read_only state using the CLI. (Do not use native
xPlore to set state.) See Collection and domain state CLIs, page 191.
3. Use your third-party backup software to back up or restore. The white paper Backup and Recovery
of EMC Documentum Content Server using the NetWorker Module for Documentum available on
EMC Online Support (https://support.emc.com) provides information on backup using EMC
Networker Module for Documentum.
4. Resume xDB with the following command:
XHCommand suspend-diskwrites --resume
5. Set all backed up domains to the reset state and then turn on indexing. (This state is not displayed
in xPlore administrator and is used only for the backup and restore utilities.) Use the script in
Collection and domain state CLIs, page 191.
Offline restore
xPlore supports offline restore only. The xPlore server must be shut down to restore a collection,
domain, or xPlore federation. If you are restoring a full backup and an incremental backup, perform
both restore procedures before restarting the xPlore instances.
This procedure assumes that no system changes (new or deleted collections, changed bindings) have
occurred since backup. (Perform a full federation backup every time you make configuration changes
to the xPlore environment.)
If you are restoring a full backup and an incremental backup, restore both before restarting xPlore
instances.
If you are restoring a federation and a collection that was added after the federation backup, do the
following:
1.
2.
3.
For automated (scripted) restore, see Scripted federation restore, page 186, Scripted domain restore,
page 187, or Scripted collection restore, page 188. The following instructions include some
non-scripted steps in xPlore administrator.
1. Shut down all xPlore instances.
2. Federation only: Clean up all existing data files.
Delete everything under xplore_home/data.
Delete everything under xplore_home/config.
3. Detach the domain or collection:
180
1. Collection only : Set the collection state to off_line using xPlore administrator. Choose the
collection and click Configuration.
2. Domain or collection: Detach the domain or collection using xPlore administrator.Note:
Force-detach corrupts the domain or collection.
4. Domain only: Generate the orphaned segment list. Use the CLI purgeOrphanedSegments. See
Orphaned segments CLIs, page 189.
5. Stop all xPlore instances.
6. Run the restore CLI. See Scripted federation restore, page 186, Scripted domain restore, page
187, or Scripted collection restore, page 188.
7. Start all xPlore instances. No further steps are needed for federation restore. Do the following
steps for domain or collection restore.
8. Domain only: If orphaned segments are reported before restore, run the CLI
purgeOrphanedSegments. If an orphaned segment file is not specified, the orphaned segment IDs
are read from stdin.
9. Force-attach the domain or collection using xPlore administrator.
10. Perform a consistency check and test search. Select Data Management in xPlore Administrator
and then choose Check DB Consistency.
11. Run the ACL and group replication script to update any security changes since the backup. See
.Manually updating security, page 52.
12. Run ftintegrity. For the start date argument, use the date of the last backup, and for the end date use
the current date. See Using ftintegrity, page 73.
2.
3.
4.
181
CLI troubleshooting
If a CLI does not execute correctly, check the following:
The output message may describe the source of the error.
Check whether the host and port are set correctly in xplore_home/dsearch/admin/xplore.properties.
Check the CLI syntax.
Linux requires double quotes before the command name and after the entire command and
arguments.
Separate each parameter by a comma.
Do not put Boolean or null arguments in quotation marks. For example:xplore
"backupFederation null,false,null"
182
Chapter 9
Automated Utilities (CLI)
This chapter contains the following topics:
CLI properties
Property
Description
host
port
password
183
Property
Description
bootstrap
verbose
protocol
The xPlore installer updates the file dsearch-set-env.bat or dsearch-set-env.sh. This file contains the
path to dsearch.war/WEB-INF.
The CLI uses the Groovy script engine.
Examples:
xplore.bat resumeDiskWrites
./xplore.sh resumeDiskWrites
4. Run a CLI command with parameters using the following syntax appropriate for your environment
(Windows or Linux). The command is case-insensitive. Use double quotes around the command,
single quotes around parameters, and a forward slash for all paths:
xplore.bat "<command> [parameters]"
./xplore.sh "<command> [parameters]"
Examples:
xplore backupFederation c:/xPlore/dsearch/backup, true, null
./xplore.sh "dropIndex dftxml, folder-list-index "
The command executes, prints a success or error message, and then exits.
5. (Optional) Run CLI commands from a file using the following syntax:
xplore.bat -f
<filename>
Examples:
xplore.bat -f
./xplore.sh -f
184
file.txt
file.txt
Call the wrapper without a parameter to view a help message that lists all CLIs and their arguments.
For example:
xplore help backupFederation
./xplore.sh help backupFederation
For example, the following batch file sample.gvy suspends index writes and performs an incremental
backup us the xPlore federation. Use a forward slash for paths.
suspendDiskWrites
folder=c:/folder
isIncremental=true
backupFederation folder, isIncremental, null
println Done
185
Scripted backup
The default backup location is specified in indexserverconfig.xml as the value of the path attribute
on the element admin-config/backup-location. Specify any path as the value of [backup_path].
is_incremental
Boolean. Set to false for full backup, true for incremental. For
incremental backups, set keep-xdb-transactional-log to true in
xPlore administrator. Choose Home > Global Configuration
> Engine.
Examples:
xplore "backupFederation null, true, null"
xplore "backupFederation c:/xplore/backup, false, null"
Examples:
xplore "backupDomain myDomain, null"
xplore "backupDomain myDomain, c:/xplore/backup "
Examples:
"backupCollection collection(myDomain, default), null"
"backupCollection collection(myDomain, default), c:/xplore/backup "
For example:
xplore "restoreFederation C:/xPlore/dsearch/backup/federation/
2011-03-23-16-02-02 "
2.
3.
For example:
xplore "detachDomain defaultDomain, true"
2. Generate the orphaned segment list using the CLI listOrphanedSegments.If an orphaned segment
file is not specified, the IDs or orphaned segments are sent to stdio. See Orphaned segments
CLIs, page 189.
3. Stop all xPlore instances.
4. Run the restore CLI. If no bootstrap path is specified, the default location in the WEB-INF classes
directory of the xPlore primary instance is used.
"restoreDomain [backup_path], [bootstrap_path] "
For example:
EMC Documentum xPlore Version 1.3 Administration and Development Guide
187
3. Generate the orphaned segment list using the CLI listOrphanedSegments. If an orphaned segment
file is not specified, the IDs or orphaned segments are sent to stdio. See Orphaned segments
CLIs, page 189.
4. Stop all xPlore instances.
5. Run the restore CLI. If no bootstrap path is specified, the default location in the WEB-INF classes
directory of the xPlore primary instance is used.
"restoreCollection, [backup_path], [bootstrap_path]"
188
Examples:
"detachDomain "myDomain, true"
or
"detachCollection collection(myDomain,default), true"
2. Attach syntax:
"attachDomain "]domain_name], true"
or
"attachCollection collection([domain_name], [collection_name]), true"
Examples:
"attachDomain myDomain, true"
or
"attachCollection collection(myDomain,default), true"
For example:
"listOrphanedSegments domain, backup/myDomain/2009-10,
c:/temp/orphans.lst
C:/xplore/jboss5.1.0/server/DctmServer_PrimaryDsearch/deploy/dsearch.war/
WEB-INF/classes/indexserver-bootstrap.properties "
or
"listOrphanedSegments collection, backup/myDomain/default/2009-10, null,
C:/xplore/jboss5.1.0/server/DctmServer_PrimaryDsearch/deploy/dsearch.war/
WEB-INF/classes/indexserver-bootstrap.properties "
2. If orphaned segments are reported before restore, run the CLI purgeOrphanedSegments. If
[orphan_file_path] is not specified, the segment IDs are read in from stdin. For file path, use
forward slashes. Syntax:
EMC Documentum xPlore Version 1.3 Administration and Development Guide
189
For example:
purgeOrphanedSegments c:/temp/orphans.lst
or
purgeOrphanedSegments null
Arguments:
[backup_path]: Path to your backup file. If not specified, the default backup location
in indexserverconfig.xml is used: The value of the path attribute on the element
admin-config/backup-location. Specify any path as the value of .
[bootstrap_path]: Path to the bootstrap file in the WEB-INF classes directory of the xPlore primary
instance.
3. Open a command line, change to the directory xplore_home/dsearch/admin, and run the following
command:
xplore.bat -f sample_scripts\removeInConstructionIndexes.groovy
For example:
190
Domain state
Set collection_name to null. Valid states: read_only and reset.
Example:
"setState domain, myDomain, null, reset "
Collection state
Valid states: index_and_search (read/write), update_and_search (read and update existing documents
only), search_only, index_only (write only), and off_line. The update_and_search state changes a flag
so that new documents cannot be added to the collection. Existing documents can be updated. The
state change is not propagated to subcollections.
Example:
"setState collection, myDomain, default, off_line "
Example:
"activateSpareNode node2, spare1 "
191
To remove the rebuild index property and clean up the temporary index folder, use the following CLI:
Syntax:
removeInConstructionIndexes [domain], [collection]
For example:
xplore "removeInConstructionIndexes myDocbase,
myCollection
./xplore.sh "removeInConstructionIndexes myDocbase,
myCollection "
For example:
xplore "isFinalMergeOngoing myDocbase, myCollection
./xplore.sh "getFinalMergeStatus myDocbase, myCollection "
You can also see merge status using xPlore administrator. A Merging icon is displayed during the
merge progress.
For example:
xplore "startFinalMerge myDocbase, myCollection
192
To remove the rebuild index property and clean up the temporary index folder, use the following CLI:
Syntax:
removeInConstructionIndexes [domain], [collection]
For example:
xplore "removeInConstructionIndexes myDocbase,
myCollection
./xplore.sh "removeInConstructionIndexes myDocbase,
myCollection "
193
Chapter 10
Search
This chapter contains the following topics:
About searching
Administering search
Troubleshooting search
About searching
Specific attributes of the dm_sysobject support full-text indexing. Use Documentum Administrator
to make object types and attributes searchable or not searchable and to set allowed search operators
and default search operator.
Set the is_searchable attribute on an object type to allow or prevent searches for objects of that type
and its subtypes. Valid values: 0 (false) and 1 (true). The client application must read this attribute.
(The indexing process does not use it.) If is_searchable is false for a type or attributes, Webtop does
not display them in the search UI. Default: true.
Set allowed_search_ops to set the allowed search operators and default_search_op to set the default
operator. Valid values for allowed_search_ops and default_search_op:
EMC Documentum xPlore Version 1.3 Administration and Development Guide
195
Search
Value
Operator
<>
>
<
>=
<=
begins with
contains
10
ends with
11
in
12
not in
13
between
14
is null
15
is not null
16
not
The default_search_arg attribute sets a default argument for the default operator. The client
must read these attributes; the indexing process does not use them. Webtop displays the allowed
operators and the default operator.
Content Server client applications issue queries through the DFC search service or through DQL. DFC
6.6 and higher translates queries directly to XQuery for xPlore. DQL queries are handled by the
Content Server query plugin, which translates DQL into XQuery unless XQuery generation is turned
off. Not all DQL operators are available through the DFC search service. In some cases, a DQL search
of the Server database returns different results than a DFC/xPlore search. For more information on
DQL and DFC search differences, see DQL, DFC, and DFS queries, page 224.
DFC generates XQuery expressions by default. If XQuery is turned off in DFC, FTDQL queries are
generated. The FTDQL queries are evaluated in the xPlore server. If all or part of the query does not
conform to FTDQL, that portion of the query is converted to DQL and evaluated in the Content Server
database. Results from the XQuery are combined with database results. For more information on
FTDQL and SDC criteria, see the EMC Documentum Content Server DQL Reference.
xPlore search is case-insensitive and ignores white space or other special characters. Special characters
are configurable.
Related topics:
Handling special characters, page 108
Search reports, page 290
Troubleshooting slow queries, page 246
Changing search results security, page 51
196
Search
Query operators
Operators in XQuery expressions, DFC, and DQL are interpreted in the following ways:
DQL operators: All string attributes are searched with the ftcontains operator in XQuery. All other
attribute types use value operators (= != < >).In DQL, dates are automatically normalized to
UTC representation when translated to XQuery.
DFC: When you use the DFC interface IDfXQuery, your application must specify dates in UTC to
match the format in dftxml.
XQuery operators
The value operators = != < > specify a value comparison search. Search terms are not tokenized.
Can be used for exact match or range searching on dates and IDs.
Any subpath that can be searched with a value operator must have the value-comparison attribute
set to true for the corresponding subpath configuration in indexserverconfig.xml. For example,
an improper configuration of the r_modify_date attribute sets full-text-search to true. A date of
2010-04-01T06:55:29 is tokenized into 5 tokens: 2010 04 01T06 55 29. A search for
04 returns any document modified in April. The user gets many non-relevant results. Therefore,
r_modify_date must have value-comparison set to true. Then the date attribute is indexed as one
token. A search for 04 would not hit all documents modified in April.
The ftcontains operator (XQFT syntax) specifies that the search term is tokenized before
searching against index.
If a subpath can be searched by ftcontains, set the full-text-search attribute to true in the
corresponding subpath configuration in indexserverconfig.xml.
Administering search
Common search service tasks
You can configure all search service parameters by choosing Global Configuration from the System
Overview panel in xPlore administrator. You can configure the same search service parameters on
a per-instance basis by choosing Search Service on an instance and then choosing Configuration.
The default values have been optimized for most environments.
Enabling search
Enable or disable search by choosing an instance of the search service in the left pane of the
administrator. Click Disable (or Enable).
Canceling running queries
Open an instance and choose Search Service. Click Operations. All running queries are displayed.
Click a query and delete it.
Viewing search statistics
Choose Search Service and click an instance:
Accumulated number of executed queries
Number of failed queries
EMC Documentum xPlore Version 1.3 Administration and Development Guide
197
Search
198
Search
name="script"/>
<property value="600" name="timeout-in-secs"/>
</properties>
</warmup>
</performance>
Set the child element warmup status to on or off. You can set the warmup timeout in seconds. If the
warmup hangs, it is canceled after this timeout period.
Configuring warmup
Configure warmup in query.properties. This file is in xplore_home/dsearch/xhive/admin. Restart all
xPlore instances to enable your changes.
Table 24
Auto-warmup configuration
Key
Description
xplore_qrserver_host
xplore_qrserver_port
xplore_domain
security_eval
user_name
super_user
query_file
query_plan
batch_size
timeout
max_retries
print_result
fetch_result_byte
199
Search
Key
Description
load_index_before_query
read_from_audit
number_of_unique_users
number_of_queries_per_user
data_path
index_cache_mb
cache_index_components
schedule_warmup
schedule_warmup_period
schedule_warmup_units
initial_delay
query_response_time
exclude_users
Select the queries from audit records that are not run
by these users. Set a comma-separated list of users.
Default: unknown.
200
Search
IndexServerAnalyzer;
declare option xhive:ignore-empty-fulltext-clauses true;
declare option xhive:index-paths-values
dmftmetadata//owner_name,dmftsecurity/acl_name,
dmftsecurity/acl_domain,/dmftinternal/r_object_id;
for $i score $s in collection(/dm_notes/dsearch/Data) /dmftdoc[( ( (
dmftinternal/i_all_types = 030000018000010d) ) and
( (
dmftversions/iscurrent = true) ) ) and ( (. ftcontains ( (((
augmenting) with stemming)) using stop words ("") ) )) ]
order by $s descending
return <dmrow>{if ($i/dmftinternal/r_object_id)
then $i/dmftinternal/r_object_id
else <r_object_id/>}{if ($i/dmftsecurity/ispublic)
then $i/dmftsecurity/ispublic
else <ispublic/>}{if ($i/dmftinternal/r_object_type)
then $i/dmftinternal/r_object_type
else <r_object_type/>}{if ($i/dmftmetadata/*/owner_name)
then $i/dmftmetadata/*/owner_name else <owner_name/>}{
if ($i/dmftvstamp/i_vstamp)
then $i/dmftvstamp/i_vstamp else <i_vstamp/>}{xhive:highlight(
$i/dmftcontents/dmftcontent/dmftcontentref)}</dmrow>
Warmup logging
All the queries that are replayed for warmup from a file or the audit record are tagged as a
QUERY_WARMUP event in the audit records. The log includes the query to get warmup queries. You
can see this type in the admin report Top N Slowest Queries. To view all warmup queries in the audit
record, run the report Audit records for warmup component in xPlore administrator.
201
Search
The DFC client application can specify a set of Documentum attributes for sorting results using the
API in an IDfQueryBuilder API. If the query contains an order by attribute, results are returned
based on the attribute and not on the computed score.
These ranking principles are applied in a complicated Lucene algorithm. The Lucene scoring details
are logged when xDB logging is set to DEBUG. See Configuring logging, page 297.
Some of the following settings require reindexing noted. Freshness is supported by default.
1. Edit indexserverconfig.xml. For information on viewing and updating this file, see Modifying
indexserverconfig.xml, page 43.
2. Add a boost-value attribute to a sub-path element. The default boost-value is 1.0. A change requires
reindexing. In the following example, a hit in the keywords metadata increases the score for a result:
<sub-path returnable="true" boost-value="2.0" path="
dmftmetadata/keywords"/>
3. By default the Documentum attribute r_modify_date is used to boost scores in results (freshness
boost). You can remove the freshness boost factor, change how much effect it has, or boost a
custom date attribute.
To remove this boost, edit indexserverconfig.xml and set the property enable-freshness-score
to false on the parent category element. This change affects only query results and does not
require reindexing.
<category name=dftxml><properties>
...
<property name="enable-freshness-score" value="false" />
</properties></category>
Change the freshness boost factor. Changes do not require reindexing. Only documents that are
six years old or less have a freshness factor. The weight for freshness is equal to the weight for the
Lucene relevancy score. Set the value of the property freshness-weight in index-config/properties
to a decimal between 0 (no boost) and 1.0 (override the Lucene relevancy score). For example:
<index-config><properties>
...
<property name="enable-subcollection-ftindex" value="false"/>
<property name="freshness-weight" value="0.75" />
To boost a different date attribute, specify the path to the attribute in dftxml as the value of
a freshness-path property. This change requires reindexing. In the following example, the
r_creation_date attribute is boosted:
<index-config><properties>
...
<property name="enable-subcollection-ftindex" value="false"/>
<property name="freshness-weight" value="0.75" />
<property name="freshness-path" value="dmftmetadata/.*/r_creation_date" />
4. Configure weighting for query term source: original term, alternative lemma, or thesaurus. Does
not require reindexing. By default, they are equally weighted. Edit the following properties in
search-config/properties. The value can range from 0 to 1000.
<property name="query-original-term-weight" value="1.0"/>
<property name="query-alternative-term-weight" value="1.0"/>
<property name="query-thesaurus-term-weight" value="1.0"/>
5.
202
Search
Note: If your documents containing XML have already been indexed, they must be reindexed to
include parsing with the DTD and entities.
203
Search
And set the full-text-search attribute value to true in the consolidated sub-path:
<sub-path leading-wildcard="false" compress="false"
boost-value="1.0" include-descendants="false" returning-contents
="false" value-comparison="false" full-text-search="true"
enumerate-repeating-elements="false" type="string"
path="dmftcontents/dmftcontent//*"/>
Note: If your documents containing XML have already been indexed, they must be reindexed
to include the XML content.
Note: If the content exceeds the CPS max text threshold, XML content is not embedded.
The following illustration shows how XML content is processed depending on your configuration.
The table assumes that the document submitted for indexing does not exceed the size limit in index
agent configuration and the content limit in CPS configuration.
204
Search
Figure 15
205
Search
A search for the word staff generates the following simple XQuery:
let $j:= for $x in collection(/XMLTest)/dmftdoc
[. ftcontains staff with stemming]
You can use IDfXQuery to generate the following query, which is much more specific and performs
better:
let $j = for $x in collection(/XMLTest)/dmftdoc
[dmftcontents/dmftcontent/dmftcontentref/company/staff]
ftcontains John with stemming
206
Search
Adding a thesaurus
A thesaurus provides results with terms that are related to the search terms. For example, when a user
searches for car, a thesaurus expands the search to documents containing auto or vehicle. When
you provide a thesaurus, xPlore expands search terms in full-text expressions to similar terms. This
expansion takes place before the query is tokenized. Terms from the query and thesaurus expansion
are highlighted in search results summaries.
A thesaurus can have terms in multiple languages. Linguistic analysis of all the terms that are returned,
regardless of language, is based on the query locale.
Thesaurus support is available for DFC clients. The thesaurus is not used for DFC metadata searches
unless you use a DFC API for an individual query. For DQL queries, the thesaurus is used for both
search document contains (SDC) and metadata searches.
The thesaurus must be in SKOS format, a W3C specification. FAST-based thesaurus dictionaries must
be converted to the SKOS format. Import your thesaurus to the file system on the primary instance host
using xPlore administrator. You can also provide a non-SKOS thesaurus by implementing a custom
class that defines thesaurus expansion behavior. See Custom access to a thesaurus, page 268.
SKOS format
The format starts with a concept (term) that includes a preferred label and a set of alternative labels.
The alternative labels expand the term (the related terms or synonyms). Here is an example of such an
entry in SKOS:
<skos:Concept rdf:about="http://www.my.com/#canals">
<skos:prefLabel>canals</skos:prefLabel>
<skos:altLabel>canal bends</skos:altLabel>
<skos:altLabel>canalized streams</skos:altLabel>
<skos:altLabel>ditch mouths</skos:altLabel>
<skos:altLabel>ditches</skos:altLabel>
<skos:altLabel>drainage canals</skos:altLabel>
<skos:altLabel>drainage ditches</skos:altLabel>
</skos:Concept>
In this example, the main term is canals. When a user searches for canals, documents are returned
that contain words like canals, canal bends, and canalized streams. The SKOS format supports
two-way expansion, but it is not implemented by xPlore; a search on ditch does not return documents
with canals.
An SKOS thesaurus must use the following RDF namespace declarations:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<skos:Concept ...
</skos:Concept>
</rdf:RDF>
Terms from multiple languages can be added like the following example:
<skos:Concept rdf:about="http://www.fao.org/aos/concept#25695">
207
Search
208
Search
The following example enables thesaurus expansion for the metadata object_name. The global
thesaurus setting does not enable metadata search in the thesaurus.
rootExpressionSet.addSimpleAttrExpression(
"object_name", IDfValue.DF_STRING, IDfSearchOperation.SEARCH_OP_CONTAINS,
false, false, "IIG");
aSimpleAttrExpr.setThesaurusEnabled(true);
To specify a specific thesaurus in a DQL query, use the hint ft_use_thesaurus_library, which
takes a string URI for the thesaurus. The following example overrides the thesaurus setting in
dm_ftengine_config because it adds the ft_thesaurus_search hint. If thesaurus search is enabled, use
only the ft_use_thesaurus_library hint.
select object_name from dm_document search document contains test
enable(ft_thesaurus_search, ft_use_thesaurus_library (
http://search.emc.com/myDomain/myThesaurus.rdf))
209
Search
Argument values that are passed into getTermsFromThesaurus: input terms, relationship,
minValueLevel, maxValueLevel. For example:<message >
<![CDATA[calling getTermsFromThesaurus with terms [leaVe],
relationship null, minLevelValue -2147483648, maxLevelValue
2147483647]]></message>
Tokens that are looked up in the thesaurus. The query term leaVe is rendered
case-insensitive:><![CDATA[executing the thesaurus lookup query to get related
terms for
[leaVe]]]></message>
...
<![CDATA[Returned token: leave]]>
...
<![CDATA[Total tokens count for reader: 1]]>
Query plan for thesaurus XQuery execution. Provide the query plan to technical support if you are
not able to resolve an issue. For example:<![CDATA[thesaurus lookup execution plan:
query:6:1:Creating query
plan on node /testenv/dsearch/SystemInfo/ThesaurusDB
query:6:1:for expression ...[xhive:metadata(., "uri") = "
http://search.emc.com/testenv/skos.rdf"]/child::
{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF/child::
{http://www.w3.org/2004/02/skos/core#}
Concept[child::{http://www.w3.org/2004/02/skos/core#}prefLabel
[. contains text terms@0]]/child::
{http://www.w3.org/2004/02/skos/core#}altLabel/child::text()
210
Search
Related terms that are returned from the thesaurus. For example:<![CDATA[related terms
from thesaurus lookup query
[Absence from work, Absenteeism, Annual leave, Employee vacations,
Holidays from work, Leave from work, Leave of absence, Maternity
leave, Sick leave]]]>
You can also inspect the final Lucene query. This query is different from the original query because
it contains the expanded terms (alternate labels) from the thesaurus. In xPlore administrator, open
Services > Logging and expand xhive. Change the log level of com.xhive.index.multipath.query
to DEBUG. The query is in the xDB log as generated Lucene query clauses. xdb.log is in
xplore_home/jboss5.1.0/server/DctmServer_PrimaryDsearch/logs. The tokens are noted as tkn. For
example:
generated Lucene query clauses(before optimization):
+(((<>/dmftmetadata<0>/dm_sysobject<0>/a_is_hidden<0>/ txt:false)^0.0)) +
(((<>/dmftversions<0>/iscurrent<0>/ txt:true)^0.0))
+(<>/ tkn:shme <>/dmftcontents<0>/ tkn:shme <>/dmftcontents<0>/dmftcontent<0>/
tkn:shme <>/dmftfolders<0>/ tkn:shme <>/dmftinternal<0>/
tkn:shme <>/dmftinternal<0>/r_object_id<0>/
tkn:shme <>/dmftinternal<0>/r_object_type<0>/
tkn:shme <>/dmftkey<0>/ tkn:shme <>/dmftmetadata<0>/
tkn:shme <>/dmftsecurity<0>/
tkn:shme <>/dmftsecurity<0>/ispublic<0>/ tkn:shme <>/dmftversions<0>/
tkn:shme <>/dmftvstamp<0>/
tkn:shme) _xhive_stored_payload_:_xhive_stored_payload_
Troubleshooting a thesaurus
Make sure that your thesaurus is in xPlore. You can view thesauri and their properties in the xDB
admin tool. Navigate to the /xhivedb/root-library/<domain>/dsearch/SystemInfo/ThesaurusDB library.
To view the default and URI settings, click the Metadata tab.
Make sure that your thesaurus is used. Compare the specified thesaurus URI in the XQuery to the
URI associated with the dictionary. View the URI in the xDB admin tool or the thesaurus list in
xPlore administrator. Compare this URI to the thesaurus URI used by the XQuery, in dsearch.log.
For example:
for $i score $s in collection(/testenv/dsearch/Data) /dmftdoc[.
ftcontains food products using thesaurus at http://www.emc.com/skos]
order by $s descending return $i/dmftinternal/r_object_id
If the default thesaurus on the file system is used, the log records a query like the following:
for $i score $s in collection(/testenv/dsearch/Data) /dmftdoc[.
ftcontains food products using thesaurus default] order by $s
descending return $i/dmftinternal/r_object_id
211
Search
You can view thesaurus terms that were added to a query by inspecting the final query. Set
xhive.index.multipath.query = DEBUG in xPlore administrator. Search for generated Lucene query
clauses.
Search
The indexing service stores the content of each object as an XML node in dftxml called dmftcontentref.
If the content exceeds the limit for indexing, only the metadata are indexed. When you examine the
dftxml for a document, the attribute islocalcopy has a value of true if the content is stored in that
element. When the value is false, only the metadata has been stored.
For all documents in which an indexed term has been found, xPlore retrieves the content node and
computes a summary. The summary is a phrase of text from the original indexed document that
contains the searched word. Search terms are highlighted in the summary.
Dynamic summaries have a performance impact. Unselective queries can require massive processing
to produce summaries. After the summary is computed, the summary is reprocessed for highlighting,
causing a second performance impact. You can disable dynamic summaries for better performance.
All of the following must be true in order to have dynamic summaries. If any condition is false, a
static summary is generated.
query-enable-dynamic-summary must be set to true
The result must be within the first X rows defined by max-dynamic-summary-threshold parameter.
The size of the extracted text must be less than extract-text-size-less-than attribute.
The query term must appear within the first X characters defined by token-size attribute.
If security_mode is set to BROWSE, the user must have at least READ permission.
Static summaries are computed when the summary conditions do not match the conditions configured
for dynamic summaries. Static summaries are much faster to compute but less specific than dynamic
summaries.
1. Configure general summary characteristics.
a. If you want to turn off dynamic summaries in xPlore administrator, choose Services > Search
Service and click Configuration. Set query-enable-dynamic-summary to false.
The first n characters of the document are displayed, where n is the value of the parameter
query-summary-display-length.
b. To configure number of characters displayed in the summary, choose Services > Search
Service in xPlore administrator. Set query-summary-display-length (default: 256 characters
around the search terms). If no search term is found, a static summary of the specified length
from the beginning of the text is displayed, and no terms are highlighted.
c. To configure the size of a summary fragment, edit indexserverconfig.xml. Search
terms can be found in many places in a document. Add a property for fragment size to
search-config/properties (default 64). The following example changes the default to 32,
allowing up to 8 fragments for a display length of 256:
<property value="32" name="query-summary-fragment-size"/>
213
Search
b. Configure the number of characters at the beginning of the document in which the query
term must appear. If the query term is not found in this snippet, a static summary is
returned and term hits are not highlighted. Set the value of the token-size attribute on the
category-definitions/category/do-text-extraction/save-tokens-for-summary-processing element.
.The default value is 65536 (64K). A value of -1 indicates no maximum content size, but this
value negatively impacts performance. For faster summary calculation, set this value lower.
c. Configure the maximum number of results that have a dynamic summary. Dynamic
summaries require much more computation time than static summaries. Set the value of
max-dynamic-summary-threshold (default: 50). Additional results have a static summary.
If most users do not go beyond the first page of results, set this value to the page size, for
example, 10, for faster performance.
With native xPlore security and the security_mode property of the dm_ftengine_config object set to
BROWSE, the user must have at least READ permission to see dynamic summaries.
214
Search
save,c,l
3. If the fuzzy_search_enable parameter does not exist, use iAPI, DQL, or DFC to modify the
dm_ftengine_config object. To add a parameter using iAPI in Documentum Administrator, use
append like the following:
retrieve,c,dm_ftengine_config
append,c,l,param_name
fuzzy_search_enable
append,c,l,param_value
true
save,c,l
4. Change the allowed similarity between a word and similar words, set the parameter
default_fuzzy_search_similarity in dm_ftengine_config. This default also applies to custom fuzzy
queries in DFC and DFS for full-text and properties. Set a value between 0 (terms are different
by more than one letter) and 1 (default=0.5).
To verify that your fuzzy search setting has been applied, view the query in dsearch.log. You
should see the following argument in the query with a similarity value that you have set:
using option xhive:fuzzy "similarity=xyz
5. Edit the xdb.properties file located in the directory WEB-INF/classes of the primary instance.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
215
Search
6. Set the xdb.lucene.fuzzyQueryPrefixLength property to the number of leading characters that should
be ignored when assessing similar terms. Default: 1.
For example, when setting the prefix value to 0, searching explore returns xplore, but it has a
large impact on the performance. Only set it to 0 if the first character is critical to your business.
Setting the prefix to a high value improves the performance but similar terms can be omitted and
you lose the benefit of the feature.
7. Set the xdb.lucene.fuzzyTermsExpandedNumber property to the maximum number of similar terms
used in the query. The most similar terms are used. A smaller value improves query response
time. Default: 10.
8. Make the same changes in the xdb.properties file for all instances.
Fuzzy search in DFC and DFS
You can enable fuzzy search on individual queries in DFC or DFS. Set fuzzy search in individual
full-text and property queries with APIs on IDfFulltextExpression and IDfSimpleAttrExpression. Use
the operators CONTAINS, DOES_NOT_CONTAIN, and EQUALS for String object types:
setFuzzySearchEnabled(Boolean fuzzySearchEnabled)
setFuzzySearchSimilarity(Float similarity): Sets a similarity value between 0 and 1. Overrides the
value of the parameter default_fuzzy_search_similarity in dm_ftengine_config.
To disable fuzzy search, set the property fuzzy_search_enable in the dm_ftengine_config object to false.
216
Search
Index Type
Query
string
integer
double
Yes
Yes
date
dateTime
time
float
long
Yes
Yes
Type
string
Yes
integer
Yes
double
Yes
date
dateTime
Yes
Yes
Yes
Yes
time
Yes
float
Yes
Yes
Yes
When xdb.lucene.strictIndexTypeCheck is True, a stricter index typing checking rule is enforced. This
may cause lower query performance if you did not specify index data types when creating subpath
definitions.
The following table shows compatible query and index data type pairs (indicated as Yes) when
xdb.lucene.strictIndexTypeCheck is set to True.
Index Type
Query
Type
string
string
Yes
integer
double
integer
double
Yes
Yes
date
dateTime
float
long
Yes
Yes
Yes
Yes
date
Yes
dateTime
Yes
time
Yes
float
time
Yes
Yes
Yes
For example, the following elements in a Content Server document are declared as type integer and
dateTime respectively:
<owner_permit dmfttype="dmint">7</owner_permit>
<r_creation_date dmfttype="dmdate">2010-09-27T22:54:48</r_creation_date>
However, without an explicit subpath definition in Indexserverconfig.xml, both element values are
indexed as string values in multi-path indexes.
Performing the following queries will return different results depending how you set the
xdb.lucene.strictIndexTypeCheck value:
/dmftdoc[dmftmetadata//owner_permit = xs:Integer(7)]
/dmftdoc[dmftmetadata//r_creation_date = xs:dateTime(2010-09-27T22:54:48)]
If xdb.lucene.strictIndexTypeCheck = True, the elements will not be returned since the query data
types integer and dateTime do not match the index data type string.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
217
Search
If xdb.lucene.strictIndexTypeCheck = False, the element will be returned since the query data types
integer and dateTime are considered compatible with the index data type string.
The following DQL wildcards are supported. All other characters are treated as literals.
search document contains (SDC) wildcards: * and ?.
wildcards in a where clause: %. and *.
In a DQL phrase search, fragments are matched. For example, dogs*cats in a phrase matches
dogslovecats. In addition, DQL queries that contain the DQL hint FT_CONTAIN_FRAGMENT in the
where clause match fragments instead of whole words.
218
Search
Note: If both leading and trailing wildcards appear in a DQL metadata condition, the wildcards are
dropped even when fast_wildcard_compatible is true. This behavior is the same as FAST indexing. To
override, use the FT_CONTAIN_FRAGMENT hint.
Limitations of wildcards
A wildcard cannot be preceded by white space. For example, a search for word_* is treated as word
* and cannot be resolved. The special character can be removed from the special characters list so
that it will not be treated as white space, if it must be searchable.
If you have configured xPlore to match phrases exactly in queries, using a period followed by a plus
sign (.+) as leading wildcard in XQueries may return inaccurate results.
2. Use the object ID to get the dm_ftengine_config parameters and values. In the following example,
the value of r_object_id that was returned in step 1 is used to get the parameters.
?,c,select param_name, param_value from dm_ftengine_config
219
Search
where r_object_id=080a0d6880000d0d
3. If the wildcards configuration parameters are not returned, configure them. Append a param_name
and param_value element and set its value. For example:
retrieve,c,dm_ftengine_config
append,c,l,param_name
ft_wildcards_mode
append,c,l,param_value
explicit
save,c,l
4. To change an existing parameter, locate the position of the param_name attribute value of the
parameter. Use set as follows:
retrieve,c,dm_ftengine_config
dump,c,l //locates the position
set,c,l,param_value[i] //position of ft_wildcards_mode
implicit
save,c,l
Search
2. Adjust the metadata_endswith_wildcard_mode parameter. For example, you can set it to implicit.
3. To apply the changes, rebuild the index.
221
Search
Use this setting to support wildcards in DQL. Query performance is negatively affected when
fast_wildcard_compatible is set to true.
To check your current dm_ftengine_config settings, use iAPI, DQL, or DFC. To view existing
parameters using iAPI in Documentum Administrator:
First get the object ID:retrieve,c,dm_ftengine_config
... <dm_ftengine_config_object_id>
222
Search
set,c,l,param_value
false
save,c,l
223
Search
Set the folder_cache_limit in the dm_ftengine_config object to the expected maximum number of
folders in the query (default = 2000). If the folder descend condition evaluates to less than the
folder_cache_limit value, then folder IDs are pushed into the index probe, making the query much
faster. If the condition exceeds the folder_cache_limit value, the folder constraint is evaluated
separately for each result.
224
Search
Table 27
DQL
No VQL equivalent
No facets
No hit count
Hit count
Fragment search
No fuzzy search
Sequential queries
No paging
Paging of results
Stemming
Feature
Content Server
DFC on client
6.5 SP2, SP3. (6.6 and higher if
XQuery generation is turned off.)
225
Search
Feature
Content Server
DFC on client
6.6, 6.7.x unless XQuery
generation is turned off.
Facets
No dependency
6.6 or higher
Thesaurus
Fuzzy search
6.7.x
Query subscription
6.7 SP1
6.7 SP1
6.6 or higher
Wildcards in metadata
6.6 or higher
Debugging enhancement
6.7.x
DQL Processing
The DFC and DFS search services by default generate XQuery expressions, not DQL, for xPlore.
DQL hints in a hints file are not applied. You can turn off XQuery generation in dfc.properties so
that DQL is generated and hints are applied. Do not turn off XQuery generation if you want xPlore
capabilities like facets.
If query constraints conform to FTDQL, the query is evaluated in the full-text index. If all or part of
the query does not conform to FTDQL, only the SDC portion is evaluated in the full-text index. All
metadata constraints are evaluated in the Content Server database, and the results are combined.
The following configurations turn off XQuery and render a query in DQL:
dfc.search.xquery.generation.enable = false in dfc.properties
ftsearch_security_mode is 0. See Changing search results security, page 51.
acl_check_db is true. See Changing search results security, page 51.
226
Search
Unsupported DQL
xPlore does not support the DQL SEARCH TOPIC clause or pass-through DQL.
DQL
xPlore
RETURN TOP N
FT_CONTAIN_FRAGMENT
ENABLE(dm_fulltext(qtf_lemmatize=0|1)
FT_COLLECTION
TRY_FTDQL_FIRST, NOFTDQL
FTDQL
No equivalent
227
Search
cs
Traces Content Server search operations such as initializing full-text in-memory objects and the
options used in a query.
ftplugin
Traces the query plugin front-end operations such as DQL translation to XQuery, calls to the back
end, and fetching of each result.
ftengine
Traces back-end operations: HTTP transactions between the query plugin and xPlore, the request
stream sent to xPlore, the result stream returned from xPlore, and the query execution plan.
none
Search
Overview
Query subscriptions is a feature in which a user can:
Specify to automatically run a particular saved search (full-text or metadata-only) at specified
intervals (once an hour, day, week, or month) and return any new results.
The results can be discarded or saved. If the results are saved, they can be merged with or replace
the previous results.
Unsubscribe from a query.
Retrieve a list of their query subscriptions.
Be notified of the results via a dmi_queue_item in the subscribed user Inbox and, optionally, an
email.
Execute a workflow, for example, a business process defined in xCP.
Query subscriptions run in Content Server 6.7 SP1 or higher with DFC 6.7 SP1 or higher. Support for
query subscriptions is installed with the Content Server. A DFC client like Webtop or CenterStage
must be customized using DFC 6.7 SP1 or higher to present query subscriptions to the user.
Because automatically running queries at specified intervals can negatively affect xPlore performance,
tune and monitor query subscription performance.
229
Search
Figure 16
If new results are found, then the new results are returned and one of the following occurs:
(Default) A dmi_queue_item is created and, optionally, an email is sent to the subscribing
user.
A custom workflow is executed. If the workflow fails, then a dmi_queue_item describing
the failure is created.
Note: You must create this workflow.
230
Search
b.
If no new results are found, then the next matching query subscription is executed.
3. Depending on the result_strategy attribute value of the dm_ftquery_subscription object, the new
results:
Replace the current results in the dm_ftquery_subscription object.
Merge with the current results in the dm_ftquery_subscription object.
Are discarded.
Note: The number of results returned per query as well as the total number of results saved are
set in the dm_ftquery_subscription object max_results attribute.
4. The next matching query subscription is executed.
5. After all matching query subscriptions have been executed, the job stops and a job report is saved.
Note: If the stop_before_timeout value (the default is 60 seconds) is reached, then the job is
stopped and any remaining query subscriptions are executed when the job runs next time.
231
Search
set ECLIPSE="C:\Documentum\product\6.6\install\composer\ComposerHeadless"
b. Specify the path to the file DarInstall.xml in a temporary working directory (excluding the file
name) as the value of BUILDFILE. For example:
set BUILDFILE="C:\DarInstall\temp"
c. Specify a workspace directory for the generated Composer files. For example:
set WORKSPACE="C:\DarInstall\work"
4. Launch DarInstall.bat (Windows) or DarInstall.sh (Linux) to install the query subscription SBO.
On Windows 2008, run the script as administrator.
232
Search
Subscription reports
When you support query subscriptions, monitor the usage and query characteristics of the users with
subscription reports. If there are many frequent or poorly performing subscriptions, increase capacity.
Subscription logging
Subscribed queries are logged in dsearch.log with the event name QUERY_AUTO. The following
information is logged:
<event name="QUERY_AUTO" component="search" timestamp="2011-08-23T14:45:09-0700">
..
<application_context>
233
Search
<query_type>QUERY_AUTO</query_type>
<app_name>QBS</app_name>
<app_data>
<attr name="subscriptionID" value="0800020080009561"/>
<attr name="frequency" value="DAILY"/>
<attr name="range" value="1015"/>
<attr name="jobintervalinseconds" value="86400"/>
</app_data>
</application_context>
</event>
Key:
subscriptionID is set by the QBS application
frequency is the subscription frequency as set by the client. Values: HOURLY, DAILY, WEEKLY,
MONTHLY.
range reports time elapsed since last query execution. For example, if the job runs hourly but the
frequency was set to 20 minutes, the range is between 0 and 40 minutes (2400 seconds). Not
recorded if the frequency is greater than one day.
jobintervalinseconds is how often the subscription is set to run, in seconds. For example, a value
86400 indicates a setting of one day in the client. Not recorded if the frequency is greater than
one day.
dm_ftquery_subscription
Represents a subscribed query.
Description
Supertype: SysObject
Subtypes: None
Internal name: dm_ftquery_subscription
Object type tag: 08
A dm_ftquery_subscription object represents subscription-specific information but not the saved query
itself, which is contained in a dm_smart_list object.
Properties
The table describes the object properties.
Table 30
Property
Datatype
Single or repeating
Description
frequency
CHAR(32)
last_exec_date
TIME
234
Search
Property
Datatype
Single or repeating
Description
subscriber_name
CHAR(32)
zone_value
INTEGER
result_strategy
INTEGER
workflow_id
ID
dm_qbs_relation object
A dm_relation_type object that relates the subscription (dm_ftquery_subscription object) to the original
dm_smart_list object.
235
Search
Table 31
dm_qbs_relation properties
Property
Value
Notes
object_name
dm_qbs_relation
parent_type
dm_smart_list
None.
child_type
dm_ftquery_subscription
None.
security_type
CHILD
None.
direction_kind
integrity_kind
If a dm_smart_list object is
deleted, then the corresponding
dm_relation and dm_ftquery
subscription objects are deleted.
Overview
Each of these jobs execute all query subscriptions that are specified to execute at the corresponding
interval:
Job Name
Description
dm_FTQBS_HOURLY
dm_FTQBS_DAILY
dm_FTQBS_WEEKLY
dm_FTQBS_MONTHLY
Each job executes its query subscriptions in ascending order based on each subscription last_exec_date
property value. If a query subscription is not executed, it is executed when the job runs next.
Note: A job is stopped gracefully just before it is timed out.
Method arguments
Argument
Description
-frequency
236
Search
Argument
Description
-stop_before_timeout
-zone_value
-search_timeout
-max_result
Reports
Job reports are stored in:
$DOCUMENTUM\dba\log\sessionID\sysadmin
Job Name
Report File
dm_FTQBS_HOURLY
FTQBS_HOURLYDoc.txt
dm_FTQBS_DAILY
FTQBS_DAILYDoc.txt
dm_FTQBS_WEEKLY
FTQBS_WEEKLYDoc.txt
dm_FTQBS_MONTHLY
FTQBS_MONTHLYDoc.txt
Custom jobs
The job method -zone_value parameter is meant for partitioning the execution of query subscriptions
among multiple custom jobs that run on the same interval. A custom job executes every
dm_ftquery_subscription that has the same zone_value and frequency attribute values as the custom
job. You must specify a -zone_value value for every custom job that runs on the same interval and that
value must be unique amongst all those custom jobs. If a job does not specify a -zone_value value, then
it will execute all subscriptions on the same interval regardless of each subscriptions zone_value value.
Note: None of your custom jobs should have the same interval as any of the pre-installed jobs, because
the pre-installed jobs do not have a -zone_value specified and will execute all subscriptions on the
same interval regardless of their zone_value value.
237
Search
Requirements
Activities:
One starting activity is required.
Only one starting activity can be specified.
The starting activitys name must be: QBS-Activity-1
Packages:
One package is required.
Only one package can be specified.
The package name must be: QBS-Package0
The package type must be dm_fquery_subscription.
The subscription ID must be passed as the package.
IQuerySubscriptionSBO
Provides the functionality to subscribe to, unsubscribe from, and list query subscriptions.
Interface name
com.documentum.server.impl.fulltext.qbs.IQuerySubscriptionSBO
Imports
import com.documentum.server.impl.fulltext.qbs.IQuerySubscriptionSBO;
import com.documentum.server.impl.fulltext.qbs.QuerySubscriptionInfo;
import com.documentum.server.impl.fulltext.qbs.impl.QuerySubscriptionException;
DAR
QBS.dar
Methods
public IDfId subscribe (String docbaseName,IDfId smartListID,
String subscriber, String frequency, IDfId workFlowID, int
zoneValue, IDfTime lastExecDate, int resultStrategy) throws
DfException,QuerySubscriptionException
Validates the dm_smart_list object ID and subscriber name in the specified repository; validates
the frequency value with all query
238
Search
subscription jobs with the job method argument -frequency. Creates a dm_ftquery_subscription
and dm_relation objects. The object ID of dm_ftquery_subscription object is returned.
The workflow template ID can be set to null, if not applicable.
For zone_value, specify -1, if not applicable.
For lastExecDate, specify DfTime.DF_NULLDATE, if not applicable.
For resultStrategy: Integer that indicates whether existing results that are saved in the dm_smart_list
are replaced with the new results (0, the default), merged with the new results (1), or the new results
are discarded (2). Specify -1, if not applicable.
public IDfId subscribe (String docbaseName,IDfId smartListID, String
subscriber, String frequency, IDfId workFlowID, int zoneValue,
IDfTime lastExecDate, int resultStrategy, String subTypeName, Map
customAttrAndValue) throws DfException,QuerySubscriptionException
You can create a subtype of dm_ftquery_subscription that has custom attributes. It enables you to
display additional information related to the subscriptions in your application.
Creates a subscription with a subtype of dm_ftquery_subscription and its relation object based
on the passed-in parameters.
The method parameters are similar to the ones of the previous method with two additional
parameters: subTypeName and customAttrAndValue.
For subTypeName, specify the type name which is a subtype of dm_ftquery_subscription.
For customAttrAndValue, specify a map with attribute name and attribute value as key-value pair.
For single-value attributes, indicate the value in its original datatype in value. For repeating
attributes, indicate a List of values.
public boolean unsubscribe (String docbaseName, IDfId smartListID, String
subscriber) throws DfException,QuerySubscriptionException
Unsubscribe service destroys the dm_relation and dm_ftquery_subscription objects that are
associated with the specified dm_smart_list and subscriber.
public List getSubscribedSmartList(String docbaseName, String
subscriber)throws DfException
Returns information for a subscription based on the dm_smart_list object ID and subscriber name.
The information includes: the dm_smart_list object ID and name, the subscription ID, the frequency,
the workflow ID, the last execution date, and the zone value.
239
Search
IQuerySubscriptionTBO
Manages basic query subscription execution.
Interface name
com.documentum.server.impl.fulltext.qbs.IQuerySubscriptionTBO
Imports
import com.documentum.server.impl.fulltext.qbs.IQuerySubscriptionTBO;
import com.documentum.server.impl.fulltext.qbs.results.DfResultsSetSAXDeserializer;
DAR
QBS.DAR
Methods
public void setSmartListId(IDfId smartListId)
Sets the dm_smart_list object ID associated with the dm_ftquery_subscription object. This method
must be called before calling runRangeQuery().
public IDfResultsSet runRangeQuery(String docbaseName, IDfTime from)
throws DfException, IOException, InterruptedException
Executes a query saved in a dm_smart_list object from the specified date/time in the from parameter.
If from is not a nulldate, a range is added to the search query with a condition like "r_modify_date >
= from". If from is a nulldate, then no range condition is added to the search query.
public void setResults(IDfResultsSet results)
Saves the results to dm_ftquery_subscription.
public IDfResultsSet getResults() throws DfException
Gets the results that are saved in dm_ftquery_subscription.
public void setSearchTimeOut(long timeout)
Sets the number of milliseconds that the search runs before it times out.
public long getSearchTimeOut()
Gets the number of milliseconds that the search runs before it times out.
public void setMaxResult(int max); public int getMaxResult()
Sets the maximum number of query results that can be returned as well as maximum number that
can be saved in the subscription object.
public int getMaxResult()
Gets the maximum number of query results that can be returned as well as maximum number
that can be saved in the subscription object.
public void setResultStrategy(int resultStrategy)
Integer indicates whether existing results that are saved in the dm_smart_list are replaced with the
new results (0, the default), merged with the new results (1), or the new results are discarded (2).
Note: doSave() updates the last_exec_date of the subscription based on this value.
240
Search
Notes
Extending this TBO is not supported.
QuerySubscriptionAdminTool
Class name
com.documentum.server.impl.fulltext.qbs.admin.QuerySubscriptionAdminTool
Usage
You use
com.documentum.server.impl.fulltext.qbs.admin.QuerySubscriptionAdminTool to:
Required JARs
qbs.jar
qbsAdmin.jar
dfc.jar
241
Search
log4j.jar
commons-lang-2.4.jar
aspectjrt.jar
-subscribe example
C:\Temp\qbsadmin>"%JAVA_HOME%\bin\java"
-classpath "C:\Documentum\config;.\lib\qbs.jar;.\lib\qbsAdmin.jar;
.\lib\dfc.jar;.\lib\log4j.jar;.\lib\commons-lang-2.4.jar;
.\lib\aspectjrt.jar"
com.documentum.server.impl.fulltext.qbs.admin.QuerySubscriptionAdminTool
-subscribe D65SP2M6DSS user1 password password1 080000f28002ef2c daily
-subscribe output
subscribed 080000f28002ef2cfor user user1 succeeded
with subscription id 080000f28002f115
-unsubscribe example
C:\Temp\qbsadmin>"%JAVA_HOME%\bin\java" -classpath
"C:\Documentum\config;.\lib\qbs.jar;.\lib\qbsAdmin.jar;
.\lib\dfc.jar;.\lib\log4j.jar;.\lib\commons-lang-2.4.jar;
.\lib\aspectjrt.jar"
com.documentum.server.impl.fulltext.qbs.admin.QuerySubscriptionAdminTool
-unsubscribe D65SP2M6DSS user1 password passwrod1 080000f28002ef2c
-unsubscribe output
User user1 has no subscriptions on dm_smart_list object
(080000f28002ef2c)
-listsubscription example
C:\Temp\qbsadmin>"%JAVA_HOME%\bin\java" -classpath
"C:\Documentum\config;.\lib\qbs.jar;.\lib\qbsAdmin.jar;
.\lib\dfc.jar;.\lib\log4j.jar;.\lib\commons-lang-2.4.jar;
.\lib\aspectjrt.jar"
com.documentum.server.impl.fulltext.qbs.admin.QuerySubscriptionAdminTool
-listsubscription D65SP2M6DSS user1 password password1
-listsubscription output
Subscriptions for user1 are:
smartList: 080000f28002ef2c frequency: DAILYworkFlowID: 0000000000000000
smartList: 080000f28002ef2f frequency: 5 MINUTESworkFlowID: 0000000000000000
Troubleshooting search
When you set the search service log level to WARN, queries are logged. Auditing queries, page 244
describes how to view or customize reports on queries.
242
Search
Clear the JBoss tmp and work directories for the index agent application, and restart the Index Agent.
With this change, the index agent saves the dftxml in the data directory.
243
Search
Auditing queries
Auditing is enabled by default. Audit records are purged on a configurable schedule (default: 30 days).
To enable or disable query auditing, open System Overview in the xPlore administrator left pane.
Click Global Configuration and choose the Auditing tab. Click search to enable query auditing.
For information on configuring the audit record, seeConfiguring the audit record, page 39.
Audit records are saved in an xDB collection named AuditDB. You view or create reports on the audit
record. Query auditing provides the following information:
The XQuery expression in a CDATA element.
The user name and whether the user is a superuser.
The application context, in an application_context element. The application context is supplied by
the search client application.
The query options in name/value pairs set by the client application, in the QUERY_OPTION
element.
The instance that processed the query, in the NODE_NAME element.
The xDB library in which the query was executed, in the LIBRARY_PATH element.
The number of hits, in the FETCH_COUNT element.
The number of items returned, in the TOTAL_HITS element.
The amount of time in msec to execute the query, in the EXEC_TIME element.
The time in msec elapsed to fetch results, in the FETCH_TIME element.
The following security events are recorded for user-generated queries. The audit record reports how
many times these caches were hit for a query. For details on configuring the caches, see Configuring
the security cache, page 54.
How many times the group-in cache was probed for a query, in the GROUP_IN_CACHE_HIT.
244
Search
How many times the group-out cache was probed for a query, in the GROUP_OUT_CACHE_HIT
element.
GROUP_IN_CACHE_FILL - How many times the query added a group to the group-in cache.
GROUP_OUT_CACHE_FILL - How many times the query added a group to the group-out cache.
TOTAL_INPUT_HITS_TO_FILTER - How many hits a query had before security filtering.
Number of hits filtered out by security because the user did not have sufficient permission, in the
HITS_FILTERED_OUT element.
The username contains an illegal character for the xPlore host code page.
The wrong query plugin is in use. See Query plugin configuration (dm_ftengine_config), page 222.
The Content Server query plugin properties of the dm_ftengine_config object are set during xPlore
configuration. If you have changed one of the properties, like the primary xPlore host, the plugin can
fail. Verify the plugin properties, especially the qrserverhost, with the following DQL:
1> select param_name, param_value from dm_ftengine_config
2> go
245
Search
Search
and DFC Search service queries always use the index unless there is a NOFTDQL hint. Some
IDfXQuery-based queries might not use the index.
To detect this issue with query auditing, find the query using the TopNSlowestQueries report (with
user name and day). Click the query ID to get the query text in XML format. Obtain the query plan to
determine which indexes were probed, if any. (Provide the query plan to EMC technical support for
evaluation.) Rewrite the query to use the index.
Test the query in the xDB admin tool
Test the query in xDB admin and see if Using query plan: Index(dmftdoc)/child::dmftkey is in the
query debug output. If not, the query is NOFTDQL (database).
To detect queries that do not use the index (NOFTDQL queries), turn on full-text tracing in the
Content Server:
API>apply,c,NULL,MODIFY_TRACE,SUBSYSTEM,S,fulltext,VALUE,S,all
Look for temp table creation and inserts like the following:
Thu Feb 09 15:50:12 2012 790000: 6820[7756] 0100019f80023909
process_ftquery_to_temp --- will populate temp table in batch size 20000
Thu Feb 09 15:50:12 2012 790000: 6820[7756] 0100019f80023909
build_fulltext_temp --- begin: create the fulltext temporary table.
Thu Feb 09 15:50:13 2012 227000: 6820[7756] 0100019f80023909
BuildTempTbl --- temporary table dmft80023909004 was created successfully.
Thu Feb 09 15:50:13 2012 430000: 6820[7756] 0100019f80023909
Inserting row at index 0 into the table
247
Search
This property value is false by default. When set to true, blacklist caches are refreshed during non-final
merges.
248
Search
Slow warmup
Recent queries are run at startup to warm up the system. Some queries by testers can slow the system.
There are two properties to eliminate slow queries from warmup. You can add either property or both
to query.properties, which is located in xplore_home/dsearch/xhive/admin.
query_response_time: Specify a value in msec for maximum query response time (fetch +
execution). Set to 60000 (60 sec) or less.
exclude_users: Specify a comma-delimited list of users whose queries time out.
Use the ENABLE(fds_collection collectionname) hint or the IN COLLECTION clause in DQL. See
Routing a query to a specific collection, page 257
249
Search
Workaround: Queries can generally be made more selective. If you cannot modify the query, organize
the repository so that the user has access to documents in certain containers such rooms or cases.
Append the container IDs to the user query.
Make sure that counter.xml has not been deleted from the collection
domain_name/Data/ApplicationInfo/group. If it has been deleted, restart xPlore.
Try the query with Content Server security turned on. (See Changing search results security,
page 51.)
Summary can be blank if the summary security mode is set to BROWSE. (See Configuring
summary security, page 214.)
250
Search
3.
Set the save-tokens option to true for the target collection and restart xPlore, then reindex the
document. Check the tokens in the Tokens library to see whether the search term was properly
indexed.
251
Search
Set the save-tokens option to true for the target collection and restart xPlore, then reindex the
document. Check the tokens in the Tokens library to see whether the search term was properly
indexed.
To see the language that was identified in the query, view the query in dsearch.log. For example:
<message >
<![CDATA[QueryID=primary$f20cc611-14bb-41e8-8b37-2a4f1e135c70,
query-locale=en,...>
Search results differ when searching with different locales, especially compound terms that have
associated components. For example, a search for Stollwerk returned many more results when using
the German than the English locale. Stollwerk is lemmatized as stollwerk in English but as stoll and
werk in German. You can turn off lemmatization. See Configuring indexing lemmatization, page 105.
Search
Debugging queries
You can debug queries for the following problems:
Query does not return expected results.
Query is very slow (reported in Top N Slowest Queries report).
No results are returned, because the searched value is a Documentum integer attribute. When you
execute the query with the get query debug option, you see that the value is treated as a string:
query:1:20:for expression .../child::dmftdoc[. contains text 9001001]
You must stop the xPlore instances and add a subpath for the non-string attribute. In this example, the
following subpath was added to the dmftdoc category. Note that partial paths are supported, in case the
metadata value is found in more than one path:
<sub-path leading-wildcard="false" compress="true"
boost-value="1.0" description="award number"
include-descendants="false" returning-contents="true"
value-comparison="true" full-text-search="true"
enumerate-repeating-elements="true" type="integer"
path="dmftmetadata//award_no"/>
For the non-string value to be found, we must reindex the domain (or the specific collection, if known).
After reindexing, we have a different result in Test Search:
query:1:99:Using query plan:
query:1:99:index(dmftdoc)
253
Search
Copy the query into a text editor and remove every phrase declare option xhive..., that is, everything
before let $libs.. In the example above, remove the following:
declare option xhive:fts-analyzer-class
com.emc.documentum.core.fulltext.indexserver.core.index.xhive.IndexServerAnalyzer;
declare option xhive:ignore-empty-fulltext-clauses true;
declare option xhive:index-paths-value ...";
254
Search
255
Search
NewIACollection
query:1:353:for expression .../child::dmftdoc[(((child::dmftmetadata/
descendant-or-self::node()/child::a_is_hidden[. = "false"] and
child::dmftversions/child::iscurrent[. = "true"]) and . contains
text award) and (child::dmftmetadata/descendant-or-self::node(
)/child::r_modify_date[. >= ...] and child::dmftmetadata/
descendant-or-self::node()/child::r_modify_date[. <= ...]))]
query:1:353:Found index "dmftdoc"
query:1:353:Using query plan:
query:1:353:index(dmftdoc)
query:1:353:Looking up "(false, true, award, 1980-01-01T00:00:00Z,
2010-01-01T00:00:00Z)" in index "dmftdoc"
query:1:643:Found an index to support all order specs. No sort required.
retrieve:
IDfXQuery.getExecutionPlan(session)
Using iAPI
save:
apply,c,NULL,MODIFY_TRACE,SUBSYSTEM,S,fulltext,VALUE,S,ftengine
retrieve: The query execution plan is written to dsearch.log, which is located in the logs subdirectory
of the JBoss deployment directory.
Using xPlore search API
save:
IDfXQuery.setSaveExecutionPlan(true)
retrieve:
IFtSearchSession.fetchExecutionPlan(requestId)
let $libs :=
(/TechPubsGlobal/dsearch/Data) let $results :=
for $dm_doc score $s in collection($libs)/dmftdoc[
(dmftmetadata//a_is_hidden = "false") and (
dmftversions/iscurrent = "true")
and (. ftcontains "award" with stemming)]
order by $s descending
return $dm_doc return (for $dm_doc in subsequence($results,1,351)
return <r>{for $attr in $dm_doc/dmftmetadata//*[local-name()=(
object_name,r_modify_date,r_object_id,r_object_type,
r_lock_owner,owner_name,r_link_cnt,r_is_virtual_doc,
r_content_size,a_content_type,i_is_reference,r_assembled_from_id,
r_has_frzn_assembly,a_compound_architecture,i_is_replica,
r_policy_id)]
return <attr name={local-name($attr)} type={$attr/@dmfttype}>{
string($attr)}</attr>}{xhive:highlight(($dm_doc/dmftcontents/
dmftcontent/dmftcontentref,$dm_doc/dmftcustom))}
<attr name=score type=dmdouble>{string(dsearch:get-score($dm_doc))}
</attr></r>
)
Note: The XQuery portion of the query is almost identical to the query retrieved through xPlore
administrator. These queries were issued separately, which accounts for differences.
To debug the Webtop query, edit the query from View Source and enter it in the Execute XQuery
dialog in xPlore administrator.
Route all queries that meet specific criteria using a DQL hint in dfcdqlhints.xml
enable(fds_query_collection_collectionname) where collectionname is the collection name. If
you use a DQL hint, you do not need to change the application or DFC query builder. You must
turn off XQuery generation. See Turning off XQuery generation to support DQL, page 259.) .
For more information on the hints file, refer to EMC Documentum Search Development Guide.
For example:select r_object_id from dm_document search document contains
benchmark
enable(fds_query_collection_custom)
257
Use DQL
You can route a DQL query to a specific collection in the following ways. By default, DFC does
not generate DQL, but you can turn off XQuery generation. (See Turning off XQuery generation
to support DQL, page 259.)
Route an individual query using the DQL in collection clause to specify the target of a SELECT
statement. Use one of the two following syntaxes.
Collection names are separated by underscores .select attr from type SDC where
enable(
fds_query_collection_collection1_collection2__...)
select attr from type SDC in collection
(collection1,collection2,...)
Route all queries that meet specific criteria using a DQL hint in dfcdqlhints.xml
enable(fds_query_collection_collectionname) where collectionname is the collection name.
For more information on the hints file, refer to EMC Documentum Search Development Guide.
The following hints route queries for a specific type to a known target collection appended to
FDS_QUERY_COLLECTION_.<RuleSet>
<Rule>
<Condition>
<From condition="any">
<Type>my_type</Type>
</From>
</Condition>
<DQLHint>ENABLE(FDS_QUERY_COLLECTION_MYTYPECOLLECTION)</DQLHint>
</Rule>
</RuleSet>
Debugging queries
You can debug queries by clicking a collection in xPlore administrator. Choose Execute XQuery for
the target collection or the top-level collection for the repository.
CAUTION: Do not use xhadmin to rebuild an index or change files that xPlore uses. If
you remove segments, your backups cannot be restored. This tool is not aware of xPlore
configuration settings in indexserverconfig.xml.
259
260
expressionSet2.addExpression(new PropertyExpression(
"object_name", Condition.CONTAINS, new SimpleValue("test")));
structuredQuery.setRootExpressionSet(expressionSet2);
return structuredQuery;
}
261
Options:
Debugging:
Get and set client application name for logging
Get and set save execution plan to see how the query was executed
Query execution:
Get and set result batch size. For a single batch, set to 0.
Get and set target collection for query
Get and set query text locale
Get and set parallel execution of queries
Get and set timeout in ms
Security:
Get and set security filter fully qualified class name
Get and set security options used by the security filter
Get and set native security (false sets security evaluation in the Content Server)
Results:
Get and set results streaming
Get and set results returned as XML nodes
Get and set spooling to a file
Get and set synchronization (wait for results)
Get and set caching
Summaries:
Get and set return summary
Get and set return of text for summary
Get and set summary calculation
Get dynamic summary maximum threshold
262
263
2. Get a search session using IDSearchClient. The following example connects to the search service
and creates a session.
public void connect() throws Exception
{
String bootStrap = BOOT_STRAP;
DSearchServerInfo connection = new DSearchServerInfo(m_host, m_port);
IDSearchClient client = DSearchClient.newInstance(
"MySearchSession", connection);
m_session = client.createFtSearchSession(m_domain);
}
private
private
private
private
3. Create an XQuery statement. The following example creates a query for a string in the contents:
public void testQuery()
{
String xquery = "for $doc in doc(/DSS_LH1/dsearch/Data/default) where
$doc/dmftdoc[dmftcontents ftcontains strange] return string(<R> <ID>{
string($doc/dmftdoc/dmftmetadata//r_object_id)}</ID></R>)";
executeQuery(xquery, options); //see "Executing a query"
}
4. Set query options. When you use an xPlore API to set options, the settings override the global
configuration settings in the xPlore administration APIs. See the javadocs for IFtQueryOptions in
the package com.emc.documentum.core.fulltext.common.search and Set the query target, page 261.
Add options like the following, and then provide the options object to the executeQuery method
of IFtSearchSession. For example:
IFtQueryOptions options = new FtQueryOptions();
options.setSpooled(true);
5. Set query debug options. The enumeration FtQueryDebugOptions can be used to set debug options
for IDfXQuery in DFC version 6.7 or higher. To set options, use the following syntax:
public String getDebugInfo(IDfSession session, FtQueryDebugOptions
debugOption)
throws DfException;
For example:
String queryid = xquery.getDebugInfo(m_session,
IDfXQuery.FtQueryDebugOptions.QUERY_ID);
6. Execute the query. See Execute the query, page 263. Provide the query options and XQuery
statement to your instance of IFtSearchSession.executeQuery, like the following:
requestId = m_session.executeQuery(xquery, options);
7. Retrieve results. The method executeQuery returns an instance of IFtQueryRequest from which
you can retrieve results. See Retrieve the results, page 263.
264
The following example sets the query options, executes the query by implementing the
IFtSearchSession method executeQuery, and iterates through the results, printing them to the
console.
private void executeQuery (String xquery)
{
String requestId = null;
try
{
IFtQueryOptions options = new FtQueryOptions();
options.setSpooled(true);
options.setWaitForResults(true);
options.setResultBatchSize(5);
options.setAreResultsStreamed(false);
requestId = m_session.executeQuery(xquery, options);
Iterator<IFtQueryResultValue> results = m_session.getResultsIterator(
requestId);
while (results.hasNext())
{
IFtQueryResultValue r = results.next();
System.out.print("results = ");
//printQueryResult(r); See next step
System.out.println();
}
}
catch (FtSearchException e)
{
System.out.println("Failed to execute query");
}
}
265
List<IFtQueryResultValue> children = (
List<IFtQueryResultValue>) v.getValue();
for (IFtQueryResultValue child : children)
{
printQueryResult(child);
}
}}
DFC
IDfQueryProcessor method setApplicationContext(DfApplicationContext context).
DfApplicationContext can set the following context:
setApplicationName(String name)
setQueryType(String type). Valid values:
setApplicationAttributes(Map<String,String> attributesMap). Set user-defined attributes in a Map
object.
DFC example
The following example sets the query subscription application context and application name. This
information is used to report subscription queries.
Instantiate a query process from the search service, set the application name and query type, and add
your custom attributes to the application context object:
IDfQueryProcessor processor = m_searchService.newQueryProcessor(
queryBuilder, true);
DfApplicationContext anApplicationContext = new DfApplicationContext();
anApplicationContext.setApplicationName("QBS");
anApplicationContext.setQueryType("AUTO_QUERY");
Map<String,String> aSetOfApplicationAttributes =
new HashMap<String,String>();
aSetOfApplicationAttributes.put("frequency","300");
aSetOfApplicationAttributes.put("range","320");
anApplicationContext.setApplicationAttributes(
aSetOfApplicationAttributes);
processor.setApplicationContext(anApplicationContext);
266
The event data is used to create a report. For example, a report that gets failed subscribed queries has
the following XQuery expression. This expression gets queries for which the app_name is QBS and
the queries are not executed:
let $lib :=/SystemData/AuditDB/PrimaryDsearch/
let $failingQueries := collection($lib)//event[name=AUTO_QUERY
and application_context[app_name=QBS and app_data[attr[
@name=frequency]/@value < attr[@name=range]/@value]]]/QUERY_ID
return $failingQueries
IDfXQuery
Use the API FtQueryOptions in the package com.emc.documentum.core.fulltext.common.search.
Call setApplicationName(String applicationName) to log the name of the search client application,
for example, webtop.
Call setQueryType(FtQueryType queryType) with the FtQueryType enum.
267
dfc.search.xquery.option.parallel_execution.enable = false
You can also use one of the following APIs to execute a query across several collections in parallel:
DFC API: IDfXQuery FTQueryOptions.PARALLEL_EXECUTION
xPlore API: IFtQueryOptions.setParallelExecution(true)
Parallel queries are not supported in DQL.
CAUTION: Parallel queries may not perform better than a query that probes each collection
in sequence. To probe all collections in parallel, set the API and compare performance with a
sequential query (the default).
Use the input terms from the query to probe the thesaurus.
You can use the optional XQuery relationship and levels parameters of FTThesaurusOption to specify
special processing. For information on these parameters, see FTThesaurusOption.
In the following example, the relationship value is RT (related term), and minLevelValue and
maxLevelValue are 2:
using thesaurus at thesaurusURI relationship RT exactly 2 levels
Package the class in a jar file and put it into the library
xplore_home/jboss5.1.0/server/DctmServer_PrimaryDsearch/deploy/dsearch.war/WEB-INF/lib.
The path in the jar file must match the package name. For example:
jar cvf com\emc\documentum\core\fulltext\common\search\impl\
dsearch-thesaurus.jar com\emc\documentum\core\fulltext\common\
search\impl\SimpleThesaurusHandler.class
Modify indexserverconfig.xml to specify the custom thesaurus. Define a new thesaurus element
under the domain that will use the custom thesaurus. Restart the xPlore instances after making this
change. The following example indicates a thesaurus URI to a custom-defined class. When a query
specifies this URI, the custom class is used to retrieve related terms.
<domain storage-location-name="default" default-document-category="
dftxml" name=... >
<collection ... >
...
<thesaurus uri="my_thesaurus" class-name="
com.emc.documentum.core.fulltext.common.search.impl.FASTThesaurusHandler"/>
</domain>
You can access one thesaurus for full-text and one thesaurus for metadata. For example, you may have a
metadata thesaurus that lists various forms of company names. The following example uses the default
thesaurus to expand the full-text lookup and a metadata thesaurus to expand the metadata lookup:
IDfExpressionSet rootSet = queryBuilder.getRootExpressionSet();
//full-text expression uses default thesaurus
IDfFullTextExpression aFullTextExpression = rootSet.addFullTextExpression(
fulltextValue);
aFullTextExpression.setThesaurusSearchEnabled(true);
//simple attribute expression uses custom metadata thesaurus
IDfSimpleAttributeExpression aMetadataExpression =
rootSet.addSimpleAttrExpression("companyname", IDfValue.DF_STRING,
IDfSimpleAttrExpression.SEARCH_OP_CONTAINS, false, false,
companyNameValue);
aMetadataExpression.setThesaurusSearchEnabled(true);
aMetadataExpression.setThesaurusLibrary("
http://search.emc.com/metadatathesaurus");
269
270
271
Chapter 11
Facets
This chapter contains the following topics:
About Facets
Facet datatypes
Tuning facets
Logging facets
Troubleshooting facets
About Facets
Faceted search, also called guided navigation, enables users to explore large datasets to locate items
of interest. You can define facets for the attributes that are used most commonly for search. Facets
are presented in a visual interface, removing the need to write explicit queries and avoiding queries
that do not return desired results. After facets are computed and the results of the initial query are
presented in facets, the user can drill down to areas of interest. At drilldown, the query is reissued
for the selected facets.
A facet represents one or more important characteristics of an object, represented by one or more
object attributes in the Documentum object model. Multiple attributes can be used to compute a facet,
for example, r_modifier or keywords. Faceted navigation permits the user to explore data in a large
dataset. It has several advantages over a keyword search or explicit query:
The user can explore an unknown dataset by restricting values suggested by the search service.
The data set is presented in a visual interface, so that the user can drill down rather than constructing
a query in a complicated UI.
Faceted navigation prevents dead-end queries by limiting the restriction values to results that are
not empty.
Facets are computed on discrete values, for example, authors, categories, tags, and date or numeric
ranges. Facets are not computed on text fields such as content or object name. Facet results are not
localized; the client application must provide localization.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
273
Facets
Before you create facets, create indexes on the facet attributes. See Configuring facets in xPlore, page
274. Some facets are already configured by default.
For very specific use cases, if the out-of-the-box facet handlers do not meet your needs, you can define
custom facet handlers for facet computation. For example, if a facet potentially include many distinct
values, you can define ranges to group the values.
API overview
Your search client application can define a facet using the DFC query builder API or DFS search
service. For information on using the DFC query builder API, see Building a query with the DFC
search service, page 259. For information on using the DFS search service, see Building a query with
the DFS search service, page 260. Define custom facet handlers using the xPlore facet handler API,
see Defining a facet handler, page 281. In most cases, the out-of-the-box facet handlers are sufficient.
Facets are computed in the following process. The APIs that perform these operations are described
fully in the following topics. For facets javadocs, see the DFC or DFS javadocs.
1. DFC or DFS search service evaluates the constraints and returns an iterator over the results.
2. Search service reads through the results iterator until the number of results specified in
query-max-result-size has been read (default: 10000).
3. For each result, the search service gets the attribute values and increment the corresponding facet
values. Subpath indexes speed this lookup, because the values are found in the index, not in
the xDB pages.
4. The search service performs the following on the list of all facet values:
a.
b.
Keeps only the top facet values according to setMax (DFC) or setMaxFacetValues (DFS).
Default: 10.
c.
274
Facets
275
Facets
Facet datatypes
Each facet datatype requires a different grouping strategy. You can set the following parameters for
each datatype in the specified DfFacetDefinition method (DFC) or FacetDefinition object (DFS). The
main facet datatypes are supported: string, date, and numeric.
Facets
keepDuplicateValues: Documents with repeating value attributes can have duplicate attribute
values. By default, duplicate entries are removed.
alpharange: Group by range. Set a property range that specifies ranges, for example: a:m,n:r,s:z.
Specify range using ASCII characters. Uses unicode order, not language-dependent order. For
example:
myFacetDefinition.setProperty("range", "a:m,n:r,s:z");
Facets are returned as IDfFacetValue. Following is an example of the XML representation of returned
facet string values:
<facet name=r_modifier>
<eleme count=5 value=user2/>
<element count=3 value=user1/>
</facet>
277
Facets
</facet>
Following is an example of the XML representation of returned facet numeric values for
range=0:10,10:100,100:
<facet name=r_full_content_size>
<elem count=5 value=0:10>
<prop name=lowerbound>0</prop>
<prop name=upperbound>10</prop>
</elem>
<elem count=3 value=10:100>
<prop name=lowerbound>10</prop>
<prop name=upperbound>100</prop>
</elem>
<elem count=0 value=100:>
<prop name=lowerbound>100</prop>
</elem>
</facet>
Facets
FacetValue
A FacetValue object groups results that have attribute values in common. The FacetValue has a label
and count for number of results in the group. For example, a facet on the attribute r_modifier could
have these values, with count in parentheses:
Tom Terrific (3)
Mighty Mouse (5)
A FacetValue object can also contain a list of subfacet values and a set of custom properties. For
example, a facet on the date attribute r_modify_date has a value of a month (November). The facet
has subfacet values of weeks in the specific month (Week from 11/01 to 11/08). xPlore computes
the facet, subfacet, and custom property values.
FacetDefinition
A FacetDefinition object contains the information used by xPlore to build facet values. The facet name
is required. If no attributes are specified, the name is used as the attribute. Facet definitions must be
specified when the query is first executed. A facet definition can hold a subfacet definition.
FacetSort is an enumeration that specifies the sort order for facet values. It is a field of the
FacetDefinition object. The possible sort orders include the following: FREQUENCY (default) |
VALUE_ASCENDING | VALUE_DESCENDING | NONE. A date facet must set the sort order
to NONE.
Facet results
A Facet object holds a list of facet values that xPlore builds.
A QueryFacet object contains a list of facets that have been computed for a query as well as the query
ID and QueryStatus. This object is like a QueryResult object. A call to getFacets returns QueryResult.
The getFacets method of the SearchService object calculates facets on the entire set of query results
for a specified Query. The method has the following signature:
public QueryFacet getFacets(
Query query, QueryExecution execution, OperationOptions options)
throws SearchServiceException
This method executes synchronously by default. The OperationOptions object contains an optional
SearchProfile object that specifies whether the call is blocking. For a query on several repositories that
support facets, the client application can retrieve facets asynchronously by specifying a SearchProfile
EMC Documentum xPlore Version 1.3 Administration and Development Guide
279
Facets
object as the OperationOptions parameter. Refer to EMC Documentum Enterprise Content Services for
more information on Query, StructuredQuery, QueryExecution, and SearchProfile.
You can call this method after a call to execute, using the same Query and queryId. Paging information
in QueryExecution has no impact on the facets calculation.
280
Facets
281
Facets
<facet-handlers>
<facet-handler class-name="my.package.MyFacetFactory1"/>
<facet-handler class-name="my.package.MyFacetFactory2"/>
</facet-handlers>
</search-config>
b.
If not already set, modify the subpath configuration for the facet as described in Configuring
your own facets, page 275.
Reindexing is only required if you modify the subpath.
6. To use it in your application, reference the custom handler in the grouping strategy (GroupBy
parameter) of the facet.
static
static
static
static
final
final
final
final
Get a session and instantiate the search service and query builder:
IDfClient client = DfClient.getLocalClient();
IDfSessionManager m_sessionManager = client.newSessionManager();
DfLoginInfo identity = new DfLoginInfo(USER, PASSWORD);
m_sessionManager.setIdentity(DOCBASE, identity);
IDfSearchService m_searchService = client.newSearchService(
m_sessionManager, DOCBASE);
IDfQueryManager queryManager = m_searchService.newQueryMgr();
IDfQueryBuilder queryBuilder = queryManager.newQueryBuilder("dm_sysobject");
Start building the root expression set by adding the result attributes:
IDfExpressionSet exprSet = queryBuilder.getRootExpressionSet();
final String DATE_FORMAT = "yyyy-MM-ddTHH:mm:ss";
queryBuilder.setDateFormat(DATE_FORMAT);
exprSet.addSimpleAttrExpression("r_modify_date", IDfAttr.DM_TIME,
IDfSimpleAttrExpression.SEARCH_OP_GREATER_EQUAL, false, false, "
282
Facets
1980-01-01T00:00:00");
exprSet.addSimpleAttrExpression("r_modify_date", IDfAttr.DM_TIME,
IDfSimpleAttrExpression.SEARCH_OP_LESS_EQUAL, false, false, "
2010-01-01T00:00:00");
The previous code builds a query without facets. Now for the facets definition that defines a facet for
person who last modified the document:
DfFacetDefinition definitionModifier = new DfFacetDefinition("r_modifier");
queryBuilder.addFacetDefinition(definitionModifier);
Another facet definition adds the last modification date and sets some type-specific options for the date:
DfFacetDefinition definitionDate = new DfFacetDefinition("r_modify_date");
definitionDate.setMax(-1);
definitionDate.setGroupBy("year");
queryBuilder.addFacetDefinition(definitionDate);
Keywords facet:
DfFacetDefinition definitionKeywords = new DfFacetDefinition("keywords");
queryBuilder.addFacetDefinition(definitionKeywords);
To submit the query and process the results, instantiate IDfQueryProcessor, which is described in the
following topic.
283
Facets
Tuning facets
Limiting the number of facets to save index space and
computation time
Every facet requires a special index, and every query that contains facets requires computation time
for the facet. As the number of facets increases, the disk space required for the index increases. Disk
space depends on how frequently the facet attributes are found in indexed documents. As the number
of facets in an individual query increase, the computation time increases, depending on whether
the indexes are spread out on disk.
284
Facets
Logging facets
To turn on logging for facets, use xPlore administrator and open the dsearch-search family. Set
com.emc.documentum.core.fulltext.indexserver.services.facets to DEBUG:
Output is like the following:
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10"
logger="com.emc.documentum.core.fulltext.indexserver.services.facets.
impl.CompositeFacetsProcessor" timeInMilliSecs="1249475838953">
<message ><![CDATA[Begin facet computation]]></message>
</event>
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10...
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10"
logger="com.emc.documentum.core.fulltext.indexserver.services.facets.
impl.CompositeFacetsProcessor" timeInMilliSecs="1249475838953">
<message ><![CDATA[Facets computed using 13 results.]]></message>
</event>
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10"
logger="com.emc.documentum.core.fulltext.indexserver.services.facets.
impl.CompositeFacetsProcessor"
timeInMilliSecs="1249475838953">
<message ><![CDATA[Facet handler string(r_modifier) returned 11 values.]]>
</message>
</event>
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10"
logger="com.emc.documentum.core.fulltext.indexserver.services.facets.
impl.CompositeFacetsProcessor"
timeInMilliSecs="1249475838953">
<message ><![CDATA[Facet handler string(r_modify_date) returned 4 values.]]>
</message>
</event>
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10"
logger="com.emc.documentum.core.fulltext.indexserver.services.facets.
impl.CompositeFacetsProcessor"
timeInMilliSecs="1249475838953">
<message ><![CDATA[Sort facets]]></message>
</event>
<event timestamp="2009-08-05 14:37:18,953" level="DEBUG" thread="pool-3-thread-10"
logger="com.emc.documentum.core.fulltext.indexserver.services.facets.
impl.CompositeFacetsProcessor"
timeInMilliSecs="1249475838953">
<message ><![CDATA[End facet computation]]></message>
</event>
285
Facets
Troubleshooting facets
A query returns no facets
Check the security mode of the repository. Use the following IAPI command:
get,c,l,ftsearch_security_mode ... 1
API> retrieve,c,dm_ftengine_config ... 0800007580000916
...
0800007580000916
API> get,c,l,ftsearch_security_mode
...
0
If the command returns a 0, as in the example, set the security mode to evaluation in xPlore, not the
Content Server. Use the following IAPI command:
retrieve,c,dm_ftengine_config
set,c,1,ftsearch_security_mode
1
save,c,1
reinit,c
286
Chapter 12
Using reports
This chapter contains the following topics:
About reports
Types of reports
Indexing reports
Search reports
Editing a report
Report syntax
Troubleshooting reports
About reports
Reports provide indexing and query statistics, and they are also a troubleshooting tool. See the
troubleshooting section for CPS, indexing, and search for uses of the reports. describe how to use
reports for troubleshooting tips.
Statistics on content processing and indexing are stored in the audit database. Use xPlore administrator
to query these statistics. Auditing supplies information to reports on administrative tasks or queries
(enabled by default). For information on enabling and configuring auditing, see Auditing collection
operations, page 167.
To run reports, choose Diagnostic and Utilities and then click Reports. To generate Documentum
reports that compare a repository to the index, see Using ftintegrity, page 73.
Types of reports
The following types of reports are available in xPlore administrator.
287
Using reports
Table 35
List of reports
Report title
Description
288
Using reports
Report title
Description
User activity
289
Using reports
Indexing reports
To view indexing rate, run the report Documents ingested per month/day/hour. The report shows
Average processing latency. The monthly report covers the current 12 months. The daily report covers
the current month. The hourly report covers the current day. From the hourly report, you can determine
your period of highest usage. You can divide the document count into bytes processed to find out the
average size of content ingested. For example, 2,822,469 bytes for 909 documents yields an average
size of 3105 bytes. This size does not include non-indexable content.
Search reports
Enable auditing in xPlore administrator to view query reports (enabled by default).
290
Using reports
To examine a slow or failed query by a user, get the query ID from Top N slowest queries and then
enter the query ID into Get query text. Examine the query text for possible problems. The following
example is a slow query response time. The user searched in Webtop for the string "xplore" (line
breaks added here):
declare option xhive:fts-analyzer-class
com.emc.documentum.core.fulltext.indexserver.core.index.xhive.IndexServerAnalyzer;
for $i score $s in collection(
/DSS_LH1/dsearch/Data) /dmftdoc[( ( ( (dmftmetadata//a_is_hidden = false) ) )
and ( (dmftinternal/i_all_types = 030a0d6880000105) )
and ( (dmftversions/iscurrent = true) ) )
and ( (. ftcontains ( ((xplore) with stemming) ) )) ]
order by $s descending return
<dmrow>{if ($i/dmftinternal/r_object_id) then $i/dmftinternal/r_object_id
else
<r_object_id/>}{if ($i/dmftsecurity/ispublic) then $i/dmftsecurity/ispublic
else <ispublic/>}{if ($i/dmftinternal/r_object_type) then
$i/dmftinternal/r_object_type
else <r_object_type/>}{if ($i/dmftmetadata/*/owner_name)
then $i/dmftmetadata/*/owner_name
else <owner_name/>}{if ($i/dmftvstamp/i_vstamp) then $i/dmftvstamp/i_vstamp
else <i_vstamp/>}{if ($i/dmftsecurity/acl_name) then $i/dmftsecurity/acl_name
else <acl_name/>}{if ($i/dmftsecurity/acl_domain) then $i/dmftsecurity/acl_domain
else <acl_domain/>}<score dmfttype=dmdouble>{$s}</score>{xhive:highlight(
$i/dmftcontents/dmftcontent/dmftcontentref)}</dmrow>
Use the xDB admin tool to debug the query. For instructions on using xhadmin, see Debugging
queries, page 259.
User activity
Use User activity to display queries by the specified user for the specified time. Data can be exported
to Microsoft Excel. Click a query link to see the xQuery.
Note: This report can take a very long time to run. If you enter a short date range or a user name,
the report runs much faster.
Editing a report
You can edit any of the xPlore reports. Select a report in xPlore administrator and click Save as.
Specify a unique file name and title for the report. Alternatively, you can write a new copy of the report
and save it to xplore_home/jboss5.1.0/server/primary_instance/deploy/dsearchadmin.war/reports.
To see the new report in xPlore administrator, click somewhere else in xPlore administrator and
then click Reports.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
291
Using reports
Reports are based on the W3C XForms standard. For a guide to the syntax in a typical report, see
Report syntax, page 292.
Adding a variable
Reports require certain variables. The XForms processor substitutes the input value for the variable in
the query.
1. Declare it.
2. Reference it within the body of the query.
3. Define the UI control and bind it to the data.
These steps are highlighted in the syntax description, Report syntax, page 292.
Report syntax
xPlore reports conform to the W3C XForms specification. The original report XForms are located in
xplore_home/jboss5.1.0/server/DctmServer_PrimaryDsearch/deploy/dsearchadmin.war/reports. You
can edit a report in xPlore administrator and save it with a new name. Alternatively, you can copy the
XForms file and edit it in an XML editor of your choice.
These are the key elements that you can change in a report:
Table 36
Report elements
Element
Description
xhtml:head/input
xhtml:head/query
xforms:action
xforms:setvalue
xforms:bind
xhtml:body
The following example highlights the user the input field startTime in the report Query Counts By
User (rpt_QueryByUser.xml). The full report is line-numbered for reference in the example (some
lines deleted for readability):
292
Using reports
1 ...<xforms:model><xforms:instance><ess_report xmlns="">
2 <input>
3
<startTime/><endTime/>...
4 </input>
1
2
3
4
5
6
7
}
<query><![CDATA[ ...
let $u1 := distinct-values(collection(/SystemData/AuditDB)//
event[@component = "search"...
and START_TIME[ . >= $startTime]...
return <report ...>...<rowset>...
for $d in distinct-values(collection(/SystemData...
and START_TIME[ . >= $startTime] and START_TIME[ . <= $endRange]]...
return let $k := collection(AuditDB)...and
START_TIME[ . >= $startTime] and START_TIME[ . <= $endRange]
... return ...
</rowset></report> ]]></query></ess_report></xforms:instance>
1 Specifies the XQuery for the report. The syntax conforms to the XQuery specification.
xhtml:head/xforms:model/xforms:instance/ess_report/query
3 References the start time and end time variables and sets criteria for them in the query: as greater
than or equal to the input start time and less than or equal to the input end time:
and START_TIME[ . >= $startTime]
and START_TIME[. <= $endRange]]/USER_NAME)
4 return report/rowset: The return is an XQuery FLWOR expression that specifies what is returned
from the query. The transform plain_table.xsl, located in the same directory as the report, processes
the returned XML elements.
5 This expression iterates over the rows returned by the query. This particular expression evaluates
all results, although it could evaluate a subset of results.
6 This expression evaluates various computations such as average, maximum, and minimum
query response times.
7 The response times are returned as row elements (evaluated by the XSL transform).
1 <xforms:action ev:event="xforms-ready">
2
<xforms:setvalue ref="input/startTime" value="seconds-to-dateTime(
seconds-from-dateTime(local-dateTime()) - 24*3600)"/>...
</xforms:action>...
3 <xforms:bind nodeset="input/startTime" constraint="seconds-from-dateTime(
293
Using reports
.) <= seconds-from-dateTime(../endTime)"/>
4 <xforms:bind nodeset="input/startTime" type="xsd:dateTime"/>...
</xforms:model>...</xhtml:head>
5 <xhtml:body>...<xhtml:tr class="">
6 <xhtml:td>Start from:</xhtml:td>
<xhtml:td><xforms:group>
7
<xforms:input ref="input/startTime" width="100px" ev:event="DOMActivate">
8<xforms:message ev:event="xforms-invalid" level="ephemeral">
The "Start from" date should be no later than the "to" date.
</xforms:message>
9<xforms:action ev:event="xforms-invalid">
<xforms:setvalue ref="../endTime" ev:event="xforms-invalid"
value="../startTime"/><xforms:rebuild/>
</xforms:action>
</xforms:input></xforms:group></xhtml:td></xhtml:tr>...
1.
xhtml:body: Defines the UI presentation in xhtml. The body contains elements that conform to
XForms syntax. The browser renders these elements.
6. 6 The first table cell in this row contains the label Start From:
7.
xforms:input contain elements that define the UI for this input control. Attributes on this element
define the width and event that is fired.
8. xforms:message contains the message when the entry does not conform to the constraint.
9.
xforms:action ev:event="xforms-invalid" defines the invalid state for the input control. Entries
after the end date are invalid.
5. This step finds failed queries. Locate the variable definition for successful queries (for $j ...let $k
...) and add your new query. Find the nodes in a QUERY element whose TOTAL_HITS value is
equal to zero to get the failed queries:
294
Using reports
6. Define a variable for the count of failed queries and add it after the variable for successful query
count (let $queryCnt...):
let $failedCnt := count($z)
7. Return the failed query count cell, after the query count cell (<cell> { $queryCnt } ...):
<cell> { $failedCnt } </cell>
8. Redefine the failed query variable to get a count for all users. Add this line after <rowset...>let $k...:
let $z := collection(AuditDB)//event[@component = "search" and @name = "
QUERY" and START_TIME[ . >= $startTime and . <= $endRange] and USER_NAME
and TOTAL_HITS = 0]
9. Add the total count cell to this second rowset, after <cell> { $queryCnt } </cell>:
<cell> { $failedCnt } </cell>
10. Save and run the report. The result is like the following:
295
Using reports
Figure 18
If your query has a syntax error, you get a stack trace that identifies the line number of the error. You
can copy the text of your report into an XML editor that displays line numbers, for debugging.
If the query runs slowly, it will time out after about one minute. You can run the same query in
the xDB admin tool.
Troubleshooting reports
If you update Internet Explorer or turn on enforced security, reports no longer contain content. Open
Tools > Internet Options and choose the Security tab. Click Trusted sites and then click Sites. Add
the xPlore administrator URL to the Trusted sites list. Set the security level for the Trusted sites zone
by clicking Custom level. Reset the level to Medium-Low.
296
Chapter 13
Logging
This chapter contains the following topics:
Configuring logging
CPS logging
Configuring logging
Note: Logging can slow the system and consume disk space. In a production environment, run the
system with minimal logging.
Basic logging can be configured for each service in xPlore administrator. Log levels can be set for
indexing, search, CPS, xDB, and xPlore administrator. You can log individual packages within these
services, for example, the merging activity of xDB. Log levels are saved to indexserverconfig.xml and
are applied to all xPlore instances. xPlore uses slf4j (Simple Logging Faade for Java) to perform
logging.
To set logging for a service, choose System Overview in the left panel. Choose Global Configuration
and then choose the Logging Configuration tab to configure logging for all instances. You can open
one of the logging families like xDB and set levels on individual packages.
To customize the instance-level log setting, edit the logback.xml file in each xPlore instance. The
logback.xml file is located in the WEB-INF/classes directory for each deployed instance war file, for
example, xplore_home/boss5.1.0/server/DctmServer_PrimaryDsearch/deploy/dsearch.war. Levels set
in logback.xml have precedence over log levels in xPlore administrator. Changes to logback.xml
take up to two minutes to take effect.
Each logger logs a package in xPlore or your customer code. The logger has an appender that specifies
the log file name and location. DSEARCH is the default appender. Other defined appenders in the
primary instance logback configuration are XDB, CPS_DAEMON, and CPS.
You can add a logger and appender for a specific package in xPlore or your custom code. The
following example adds a logger and appender for the package com.mycompany.customindexing::
<logger name="com.mycompany.customindexing" additivity="false"
level="INFO">
<appender name="CUSTOM" class="
ch.qos.logback.core.rolling.RollingFileAppender">
<file>C:/xPlore/jboss5.1.0/server/DctmServer_PrimaryDsearch/
logs/custom.log
</file>
<encoder>
<pattern>%date %-5level %logger{20} [%thread]
%msg%n</pattern>
297
Logging
<charset>UTF-8</charset>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.
FixedWindowRollingPolicy">
<maxIndex>100</maxIndex>
<fileNamePattern>C:/xPlore/jboss5.1.0/server/DctmServer_
PrimaryDsearch/logs/custom.log.%i</fileNamePattern>
</rollingPolicy>
<triggeringPolicy class="ch.qos.logback.core.rolling.
SizeBasedTriggeringPolicy">
<maxFileSize>10MB</maxFileSize>
</triggeringPolicy>
</appender>
</logger>
You can add your custom logger and appender to logback.xml. Add it to a logger family if you want
your log entries to go to one of the logs in xPlore administrator. This is an optional step if you dont
add your custom logger to a logger family, it will still log to the file that you specify in your appender.
Logger families are defined in indexserverconfig.xml. They are used to group logs in xPlore
administrator. You can set the log level for the family, or expand the family to set levels on individual
loggers.
The following log levels are available. Levels are shown in increasing severity and decreasing amounts
of information, so that TRACE displays more than DEBUG, which displays more than INFO.
TRACE
DEBUG
INFO
WARN
ERROR
Troubleshooting the index agent, page 85 provides information about the logging configuration for
the index agent.
Enabling logging in a client application, page 308 provides information about logging for xPlore
client APIs.
Tracing, page 304 indicates how to enable and configure tracing.
Viewing logs
You can view indexing, search, CPS, and xDB logs in xPlore administrator. Choose an instance in
the tree and click Logging. Indexing and search messages are logged to dsearch. Click the tab for
dsearch, cps, cps_daemon, or xdb to view the last part of the log. Click Download All Log Files
to get links for each log file.
Query logging
The xPlore search service logs queries. When you turn on query auditing (default is true), additional
information is saved to the audit record and is available in reports. Auditing queries, page 244 provides
more information about query logging.
298
Logging
For each query, the search service logs the following information for all log levels:
Start of query execution including the query statement
Total results processed
Total query time including query execution and result fetching
More query information is logged when native xPlore security (not Content Server security) is
enabled. When query auditing is enabled, you can filter for the following query types in the search
audit records report: interactive, subscription, warmup, test search, report, metrics, ftintegrity,
consistency checker, or all.
Set the log level in xPlore administrator. Open Services in the tree, expand and select Logging, and
click Configuration. You can set the log level independently for administration, indexing, search,
and default. Levels in decreasing amount of verbosity: TRACE, DEBUG, INFO, WARN (default),
and ERROR.
The log message has the following form:
2012-03-28 11:16:45,798 WARN [IndexWorkerThread-6]
c.e.d.c.f.i.core.index.plugin.XhivePlugin - Document id: 090023a380000202,
message: CPS Warning [Unknown error during text extraction(native code: 961,
native msg: access violation)].
To view a log, choose the instance and click Logging. The following examples from dsearch.log
show a query:
2012-03-28 12:19:02,664 INFO [RMI TCP Connection(9)-10.8.47.144]
c.e.d.c.fulltext.indexserver.search.SearchServer QueryID=PrimaryDsearch$6f35b53d-34b8-470d-b699-5b4364ef0815,
query-locale=en,query-string=let $j:= for $i score $s in /dmftdoc
[. ftcontains ASMAgentServer with stemming]
order by $s descending return <d> {$i/dmftmetadata//r_object_id}
{ $i/dmftmetadata//object_name } { $i/dmftmetadata//r_modifier } </d>
return subsequence($j,1,200) is running
...
2012-03-28 12:19:05,117 INFO [pool-14-thread-10]
c.e.d.c.f.i.admin.mbean.ESSAdminSearchManagement QueryID=PrimaryDsearch$6f35b53d-34b8-470d-b699-5b4364ef0815,
Result count=1,bytes count=187
CPS logging
CPS uses the xPlore slf4j logging framework. A CPS instance that is embedded in an xPlore instance
(installed with xPlore, not separately) uses the logback.xml file in WEB-INF/classes of the dsearch
web application. A standalone CPS instance uses logback.xml in the CPS web application, in the
WEB-INF/classes directory.
If you have installed more than one CPS instance on the same host, each instance has its own web
application and logback.xml file. To avoid one instance log overwriting another, make sure that each
file appender in logback.xml points to a unique file path.
299
Chapter 14
Setting up a Customization Environment
This chapter contains the following topics:
Customization points
Tracing
Customization points
You can customize indexing and searching at several points in the xPlore stack. The following
information refers to customizations that are supported in a Documentum environment.
The following diagram shows indexing customization points.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
301
Figure 19
1. Using DFC, create a BOF module that pre-filters content before indexing. See Custom content
filters, page 83.
2. Create a TBO that injects data from outside a Documentum repository, either metadata or content.
You can use a similar TBO to join two or more Documentum objects that are related. See Injecting
data and supporting joins, page 80.
3. Create a custom routing class that routes content to a specific collection based on your enterprise
criteria. See Creating a custom routing class, page 151.
The following diagram shows query customization points.
302
Figure 20
1. Using WDK, modify Webtop search and results UI. See EMC Documentum Search Development
Guide.
2. Using DFS, implement StructuredQuery, which generates an XQuery expression. xPlore processes
the expression directly. See Building a query with the DFS search service, page 260.
3. Using DFC or DFS, create NOFTDQL queries or apply DQL hints (not recommended except for
special cases).
DQL is evaluated in the Content Server. Implement the DFC interface IDfQuery and the DFS query
service. FTDQL queries are passed to xPlore. Queries with the NOFTDQL hint or which do not
conform to FTDQL criteria are not passed to xPlore. See DQL Processing, page 226.
4. Using DFC, modify Webtop queries. Implement the DFC search service, which generates XQuery
expressions. xPlore processes the expression directly. See EMC Documentum Search Development
Guide.
5. Using DFC, create XQueries using IDfXQuery. See Building a DFC XQuery, page 261.
6. Create and customize facets to organize search results in xPlore. See the Facets chapter.
7. Target a specific collection in a query using DFC or DFS APIs. See Routing a query to a specific
collection, page 257.
303
8. Use xPlore APIs to create an XQuery for an XQuery client. See Building a query using xPlore
APIs, page 263.
4. Place your class in the indexagent.war WEB-INF/classes directory. Your subdirectory path under
WEB-INF/classes must match the fully qualified routing class name.
5. Restart the xPlore instances, starting with the primary instance.
Tracing
xPlore tracing provides configuration settings for various formats of tracing information. You can
trace individual threads or methods. Use the file dsearchclientfull.properties, which is located
in the conf directory of the SDK. The configuration parameters are described within the file
dsearchclientfull.properties.
The xPlore classes are instrumented using AspectJ (tracing aspect). When tracing is enabled and
initialized, the tracing facility uses log4j API to log the tracing information.
Enabling tracing
Enable or disable tracing in xPlore administrator: Expand an instance and choose Tracing. Tracing
does not require a restart. Tracing files named ESSTrace.XXX.log are written to the Java IO temp
directory (where XXX is a timestamp generated by the tracing mechanism).
The tracing facility checks for the existence of a log4j and appender in the log4j.properties
file. When a logger and appender are not found, xPlore creates a logger named
com.emc.core.fulltext.utils.trace.IndexServerTrace.
When you enable tracing, a detailed Java method call stack is logged in one file. From that file, you
can identify the methods that are called, with parameters and return values.
304
Configuring tracing
You can configure the name, location, and format of the log file for the logger and its appender in
indexserverconfig.xml or in the log4j.properties file. The log4j configuration takes precedence. You
can configure tracing for specific classes and methods. A sample log4j.properties file is in the SDK
conf directory. The following example in log4j.properties debugs a specific package:
log4j.logger.com.emc.documentum.core.fulltext.client.common = DEBUG
Description
Values
tracing enable
tracing mode
tracing verbosity
standard | verbose
output dir
output file-prefix
output max-file-size
output max-backup-index
output file-creation-mode
print-exception-stack
max-stack-depth
single-file | file-per-thread
305
Description
Values
tracing-filters/ method-name*
tracing-filters/ thread-name
date-output format
date-output column-width
Positive integer
date-output timing-style
nanoseconds | milliseconds |
milliseconds_from_start | seconds |
date ; default: milliseconds
* The method-name filter identifies any combinations of packages, classes, and methods to trace. The
property value is one or more string expressions that identify what is traced. Syntax with asterisk as
wild card:
([qualified_classname_segment][*]|*).[.[method_name_segment][*]0]
Key:
[method-duration] Appears only if tracing-config/tracing[@mode="compact"].
[entry_exit_designation] One of the following:
306
In the following snippet, a CPS worker thread processes the same document:
1263340387580[CPSWorkerThread-1] [ENTER]
....com.emc.documentum.core.fulltext.indexserver.cps.CPSElement@1f1df6b.<init>(
"dmftcontentref","
file:///C:/DOCUME~1/ADMINI~1.EMC/LOCALS~1/Temp/3/In this VM.txt",true,
[Lcom.emc.documentum.core.fulltext.indexserver.cps.CPSOperation;@82efed)
307
A search on a string "FileZilla in the document renders this XQuery expression, source repository,
language, and collection in the executeQuery method:
1263340474627 [http-0.0.0.0-9300-1] [ENTER]
.com.emc.documentum.core.fulltext.indexserver.admin.controller.
ESSAdminWebService@91176d.executeQuery("for $i in /dmftdoc[. ftcontains
FileZilla] return
<d> {$i/dmftmetadata//r_object_id} { $i//object_name } { $i//r_modifier } </d>","
DSS_LH1","en","superhot")
You can find all trace statements for the document being indexed by searching on the dmftkey
value. In this example, you search for "In this VM_txt1263340384408" in the trace log. You
can find all trace statements for the query ID. In this example, you search for the query ID
"PrimaryDsearch$ba06863d-7713-4e0e-8569-2071cff78f71" in the trace log.
308
Figure 21
309
Chapter 15
Performance and Disk Space
This chapter contains the following topics:
Memory consumption
Measuring performance
Indexing
Indexing performance
Throttling indexing
Search performance
311
Figure 22
Multiple collections increase the throughput for ingestion. You can create a collection, ingest
documents to it, then move it to be a subcollection of a parent collection. (See Moving a collection,
page 162. Fewer collections speed up search.
Use the rough guidelines in the following diagram to help you plan scaling of search. The order of
adding resources is the same as for ingestion scaling.
312
Figure 24
Component
Space use
Indexing
xDB
dftxml representation of
document content and
metadata, metrics, audit,
and Document ACLs and
groups.
Search
313
Component
Space use
Indexing
Search
Lucene
Stores transaction
information.
Sometimes provides
snapshot during retrieval.
Lucene temporary
working area
None
Uncommitted data is
stored to the log. Allocate
twice the final index size
for merges.
2. Perform a query to return 1000 documents in each format. Specify the average size range, that is,
r_full_content_size greater than (average less some value) and less than (average plus some value).
Make the plus/minus value a small percentage of the average size. For example:
select r_object_id,r_full_content_size from dm_sysobject
where r_full_content_size >(1792855 -1000) and
r_full_content_size >(1792855 +1000) and
a_content_type = zip enable (return_top 1000)
314
3. Export these documents and index them into new, clean xPlore install.
4. Determine the size on disk of the dbfile and lucene-index directories in xplore_home./data
5. Extrapolate to your production size.
For example, you have ten indexable formats with an average size of 270 KB from a repository
containing 50000 documents. The Content Server footprint is approximately 12 GB. You get a sample
of 1000 documents of each format in the range of 190 to 210 KB. After export and indexing, these
10000 documents have an indexed footprint of 286 MB. Your representative sample was 20% of the
indexable content. Thus your calculated index footprint is 5 x sample_footprint=1.43 GB (dbfile 873
MB, lucene-index 593 MB).
Adding storage
The data store locations for xDB libraries are configurable. The xDB data stores and indexes can
reside on a separate data store, SAN or NAS. Configure the storage location for a collection in xPlore
administrator. You can also add new storage locations through xPlore administrator. See Changing
collection properties, page 161.
Function
SAN
NAS
local disk
iSCSI
CFS
Used for
Content Server
Common
Common
(content)
Common
Rare
Rare
Network
Fiber
Ethernet
Local
Ethernet
Fiber
Performance
Best
Slower than
SAN, improved
with 10GE
Slower than
SAN, improved
with 10GE
Almost as fast
as SAN
315
Function
SAN
NAS
local disk
iSCSI
CFS
High
availability
Requires cluster
technology
Provides shared
drives for server
takeover
Requires
complete dual
system
Requires cluster
technology
Provides shared
drives for server
takeover
xPlore multiinstance
Requires
network shared
drives
Drives already
shared
Requires
network shared
drives
Requires
network shared
drives
Drives already
shared
Memory consumption
Following are ballpark estimates for memory consumption by the various components in an xPlore
installation.
Table 40
Component
RAM
Index agent
4 GB
CPS daemon
2 GB
For best performance, add index agent processes and CPS instances on hosts other than the xPlore host.
Measuring performance
The following metrics are recorded in the metrics database. View statistics in xPlore administrator
to help identify specific performance problems. Select an xPlore instance and then choose Indexing
Service or Search Service to see the metric. Some metrics are available through reports, such as
document processing errors, content too large, and ingestion rate.
Table 41
Metric
Service
Problem
Indexing Service
Indexing Service
Formats
Languages
Search Service
To get a detailed message and count of errors, use the following XQuery in xPlore administrator:
for $i in collection(/SystemData/MetricsDB/PrimaryDsearch)
/metrics/record/Ingest[TypeOfRec=Ingest]/Errors/ErrorItem
317
To get the total number of errors, use the following XQuery in xPlore administrator:
sum(for $i in collection(/SystemData/MetricsDB/PrimaryDsearch)
/metrics/record/Ingest [TypeOfRec=Ingest]/Errors/ErrorItem/ErrorCnt return $i)
xPlore caches
Temporary cache to buffer results. Using xPlore administrator, change the value of
query-result-cache-size in search service configuration and restart the search service.
Using compression
Indexes can be compressed to enhance performance. Compression uses more I/O memory. The
compress element in indexserverconfig.xml specifies which elements in the ingested document have
content compression to save storage space. Compressed content is about 30% of submitted XML
content. Compression can slow the ingestion rate by 10-20% when I/O capacity is constrained. See
Configuring text extraction, page 137.
If ingestion starts fast and gets progressively slower, set compress to false for subpath indexes in
indexserverconfig.xml. Modifying indexserverconfig.xml, page 43 describes how to view and update
this file.
Types of merges
To improve indexing performance, the Lucene index database is split into small chunks called segments
that contain one or more indexed documents. Lucene adds segments as new documents are added to
the index. However, to improve query performance and save disk space, you can reduce the number of
segments by merging smaller segments into larger ones and ultimately into a single segment.
There are three levels of index merges that you can fine-tune to achieve an optimal and balanced level
of indexing and query performance.
EMC Documentum xPlore Version 1.3 Administration and Development Guide
319
Non-final merges
In addition to Lucene internal merges, you can configure the system to run non-final merges that merge
segments under a specified size (determined by the nonFinalMaxMergeSize property) into a fresh, new
index at a regular interval (determined by the cleanMergeInterval property).
Final merges
Perform final merges to merge all existing Lucene index entries into a single large Lucene index entry
to maximize query performance. You can manually run a final merge on a collection or schedule final
merges to run at regular intervals. The final merge is very I/O intensive and may cause noticeable
performance drop in both indexing and query performance when running, so you should closely
monitor and carefully schedule the running of final merges and avoid them during performance-critical
hours. You can use the Audit Records for Final Merge report to view detailed final merge log data to
quickly identify performance issues associated with final merges.
320
During a final merge, the system shrinks existing segments to empty by moving and consolidating
Lucene index entries from them into an empty segment. When an empty segment is not available, the
system does not shrink existing segments and waits for the next final merge to run, which consumes
more disk space. Under such circumstances, you can manually launch a final merge to accelerate
the merging process and free more disk space.
The final merge can require up to two times the size of the final index entries to move things around
during the interim process. For example, you can see disk space usage at 100G at one point but 300G
at another point when a final merge is in progress.
For example, if you set the value to 8-20, the blackout period will be from 8 a.m. to 8 p.m., and
scheduled final merges will not start to run during this period every day.
Blackout periods do not stop an already running final merge process. A scheduled final merge started
before the blackout start hour will continue to run into or even past defined final merge blackout
periods without being affected. Also, manually started final merges ignore any blackout periods.
321
Unlike scheduled final merges, manually started final merges are not affected by any blackout period
settings.
If you find a final merge is in progress and is causing huge performance drops on your systemslowing
ingestions and queries, or preventing backupyou can stop it immediately.
When a collection is merging, a Merging icon is displayed next to the collection in the Data
Management view of the collection. To stop a final merge manually, click Stop Merging.
You can also start and stop a final merge using CLI commands. See Final merge CLIs, page 192.
Indexing
Instance-specific scheduler
If you select this option, you can adjust the interval by specifying the
xdb.lucene.finalMergingInterval value in xdb.properties.
Collection-specific scheduler
Choose one of the following schedule formats and enter your values:
Fixed interval
Daily
Weekly
Advanced
If you define a schedule in advanced format that equates to one of the simpler formats listed
above, the simpler format will be selected instead after you save the settings. For example, if
you enter 6 in the Day of week field and save the settings, you will see Weekly option selected
with Sat checked when you review the settings.
4. Click Save. The scheduler is effective immediately.
Note: For both schedulers, if a scheduled time falls into a blackout period, the final merge will not start.
Indexing
Documentum index agent performance
Index agent settings
The parameters described in this section can affect index agent performance. Do not change these
values unless you are directed to change them by EMC technical support.
In migration mode, set the parameters in the indexagent.xml located in
index_agent_WAR/WEB-INF/classes/.
In normal mode, also set the corresponding parameters in the dm_ftindex_agent_config object.
In normal mode, index agent configuration is loaded from indexagent.xml and from the
dm_ftindex_agent_config object. If there is a conflict, the settings in the config object override the
settings in indexagent.xml.
exporter.thread_count (indexagent.xml) / exporter_thread_count (dm_ftindex_agent_config)
Number of threads that extract metadata into dftxml using DFC.
connectors.file_connector.batch_size (indexagent.xml) / connectors_batch_size
(dm_ftindex_agent_config)
Number of items picked up for indexing when the index agent queries the repository for queue items.
exporter.queue_size (indexagent.xml) / exporter_queue_threshold (dm_ftindex_agent_config)
Internal queue of objects submitted for indexing.
indexer.queue_size (indexagent.xml) / indexer_queue_threshold (dm_ftindex_agent_config)
Queue of objects submitted for indexing.
indexer.callback_queue_size (only in indexagent.xml, used for both migration and normal mode)
EMC Documentum xPlore Version 1.3 Administration and Development Guide
323
Indexing
Size of queue to hold requests sent to xPlore for indexing. When the queue reaches this size, the
index agent waits until the callback queue has reached 100% less the callback_queue_low_percent.
Indexing performance
Various factors affect the rate of indexing. You can tune some indexing and xDB parameters and
adjust allowable document size.
Indexing
325
Indexing
Throttling indexing
If your environment has periodic, frequent bursts of document updates that slow the system, you
can throttle ingestion based on document count, document size, or both. The throttle mechanism
is disabled by default.
Stop all xPlore instances and edit indexserverconfig.xml on the primary instance. Add the following
properties to the index-config element:
enable-throttle: Set to true to enable the throttle mechanism.
throttle-interval: Time in seconds to allow content up to throttle-threshold size.
throttle-threshold: Sets the total content size in KB that xPlore can process during the
throttle-interval. Content above this content size will be rejected.
throttle-document-count: Sets the number of documents to process during the throttle-interval.
To throttle by document size only, set throttle-document-count to a high value. To throttle by document
count only, set throttle-threshold to a high value.
Documents that exceed throttle-threshold or throttle-document-count will be processed after the
throttle-interval.
Search performance
Creating subcollections
You can ingest documents to a new collection and then move the collection to become a subcollection
for faster search performance. For moving subcollections, see Moving a collection, page 162.
Indexing
If you have many ACLs, increase the value of acl-cache-size (number of permission sets in the cache).
327
Indexing
For information on targeted queries, see Routing a query to a specific collection, page 257. To set
parallel mode for DFC-based search applications, set the following property in dfc.properties to true:
dfc.search.xquery.option.parallel_execution.enable = true
The query execution plan recorded is in dsearch.log, which is located in the logs subdirectory of the
JBoss deployment directory.
Indexing
cut_off_text: Set to true to cut off text of large documents that exceed max_text_threshold instead
of rejecting entire document. Default: false. Documents that are partially indexed are recorded
in cps_daemon.log: docxxxx is partially processed. The dftxml is annotated with the element
partiallyIndexed.
daemon_count: Specifies the number of daemons that handle normal indexing and query requests
(not a dedicated query daemon). Set from 1 to 8. Default: 1. For information about adding CPS
daemons, see Adding CPS daemons for ingestion or query processing, page 114.
daemon_restart_threshold: Specifies how many requests a CPS daemon can handle before it restarts.
Default: 1000.
daemon_restart_memory_threshold: Set a value in bytes for maximum CPS memory consumption.
After this limit, CPS will restart. A maximum of 8 GB is recommended. Default: 4000000000.
daemon_restart_consistently: Specifies whether CPS should restart regularly after it is idle for 5
minutes. Default: true.
dump_context_if_exception: Specifies whether to dump stack trace if exception occurs. Default:
true.
failure_document_id_file: The file that contains IDs of failed documents to
be skipped. You can edit this file. IDs of failed documents are added to it
automatically if add_failure_documents_automatically is set to true. Default:
xplore_home/cps/skip_failure_document.txt.
io_block_unit: Logical block unit of the read/write target device. Default: 4096.
io_chunk_size: Size for each read/write chunk. Default: 4096.
linguistic_processing_time_out: Interval in seconds after which a CPS hang in linguistic processing
forces a restart. Valid values: 60 to 360. Default: 360.
load_content_directly: For internal use only.
query_dedicated_daemon_count: The number of CPS daemons dedicated to query processing.
Other CPS daemons handle ingestion when there is a dedicated query daemon. Valid values: 0
to 3. Default: 1.
retry_failure_in_separate_daemon: Specifies whether to retry failed documents in a newly spawned
CPS daemon. Default: true. A retry daemon is not limited by the value of daemon_count.
skip_failure_documents: Specifies whether CPS should skip documents that fail processing instead
of retrying them, to reduce CPS crashes. Default: true. Failed documents are retried once unless
this property is set to true.
skip_failure_documents_upper_bound: Specifies the maximum number of failed documents that
CPS will record in the failure document. Valid values: integers. Default: -1 (no upper bound)
text_extraction_time_out: Interval in seconds after which a CPS hang in text extraction forces a
restart. Valid values: 60 to 300. Default: 300.
use_direct_io: Requires CPS to read and write staging files to devices directly. Default: false. If
most incoming files are local, use the default caching. If most files are remote, use direct IO.
329
Indexing
You can also limit query results set size, which is 12000 results by default. This default value supports
facets. If your client application does not support facets, you can lower the result set size. Open
xdb.properties, which is located in the directory WEB-INF/classes of the primary instance. Set the
value of queryResultsWindowSize to a number smaller than 12000.
Use the following DQL query to determine the number of documents modified and accessed in
the past two years (change DQL to meet your requirements):
select count(+) from dm_sysobject where
datediff(year,r_creation_date,r_access_date)<2 and
datediff(year,r_creation_date,r_modify_date)<2
2.
Use the following DQL query to determine the number of documents in the repository:
select count(*) from dm_sysobject
330
Indexing
3.
Divide the results of step 1 by the results of step 2. If the number is high, for example, 0.8, most
documents were modified and accessed in the past two years. (80%, in this example)
331
Appendix A
Index Agent, CPS, Indexing, and
Search Parameters
This appendix covers the following topics:
dm_ftengine_config
API Reference
333
dm_ftengine_config
dm_ftengine_config
Attributes
Following are attributes specific to dm_ftengine_config. Some attribute values are set by the index
agent when it creates the dm_ftengine_object.
Note: Not all attributes values are set at object creation. If you do not set values, the default
values are used. For instructions on changing the attribute values, see Query plugin configuration
(dm_ftengine_config), page 222.
For iAPI syntax to change attributes, see Query plugin configuration (dm_ftengine_config), page 222.
Attribute
Description
acl_check_db:
acl_domain:
acl_name
dm_fulltext_admin_acl
default_fuzzy_search_similarity
dsearch_config_host
dsearch_config_port
dsearch_domain
Name of repository
dsearch_override_locale
dsearch_qrserver_host
dsearch_qrserver_port
dsearch_qrserver_protocol
dsearch_qrserver_target
dsearch_qrygen_mode
dsearch_result_batch_size
334
Attribute
Description
filter_config_id
folder_cache_limit
ft_collection_id
ftsearch_security_mode
fuzzy_search_enable
group_name
dm_fulltext_admin
object_name
query_plugin_mapping_file
.
Path on Content Server host to mapping file. This file
maps attribute conditions to the XQuery subpaths.
query_timeout
security_mode
thesaurus_search_enable
use_thesaurus_on_phrase
335
Parameter name
Description
acl_exclusion_list
acl_attributes_exclude_list
collection
dsearch_qrserver_host
dsearch_qrserver_port
dsearch_domain
Repository name
group_exclusion_list
group_attributes_exclude_list
index_type_mode
max_requests_in_batch
336
Parameter name
Description
max_batch_wait_msec.
max_pending_requests
max_tries
group_attributes_exclude_list
Parameter
Description
queue_size
queue_low_percent
callback_queue_size
callback_queue_low_percent
337
Parameter
Description
wait_time
thread_count
shutdown_timeout
runaway_timeout
content_clean_interval
partition_config
Parameter
Description
contentSizeLimit
338
339
index-executor-queue-size: Maximum size of index queue before spawning a new worker thread.
Default: 10.
index-executor-retry-wait-time: Wait time in milliseconds after index queue and worker thread
maximums have been reached. Default: 1000.
status-requests-batch-size: Maximum number of status update requests in a batch. Default: 1000.
status-thread-wait-time: Maximum wait time in milliseconds to accumulate requests in a batch.
Default: 1000.
index-check-duplicate-at-ingestion: Set to true to check for duplicate documents. May slow
ingestion. Default: true.
enable-subcollection-ftindex: Set to true to create a multi-path index to search on specific
subcollections. Ingestion is slower, especially when you have multiple layers of subcollections. If
false, subcollection indexes are not rebuilt when you rebuild a collection index. Default: false.
rebuild-index-batch-size: Sets the number of documents to be reindexed. Default: 1000.
rebuild-index-embed-content-limit: Sets the maximum size of embedded content for language
detection in index rebuild. Larger content is streamed. Default: 2048.
If the results are larger than Result buffer threshold, they are saved in this path. This setting
does not apply to remote CPS instances, because the processing results are always embedded in
the return to xPlore.
Result buffer size threshold: Number of bytes at which the result buffer returns results to file.
Valid values: 8 - 16 MB. Default: 1 MB (1048576 bytes). Larger value can accelerate process but
can cause more instability.
Processing buffer size threshold: Specifies the number of bytes of the internal memory chunk used
to process small documents.
If this threshold is exceeded, a temporary file is created for processing. Valid values: 100 KB-10 MB.
Default: 2 MB (2097152 bytes). Increase the value to speed processing. Consumes more memory.
Load file to memory: Check to load the submitted file into memory for processing. Uncheck to pass
the file to a plug-in analyzer for processing (for example, the Documentum index agent).
Batch in batch count: Average number of batch requests in a batch request.
Range: 1-100. Default: 5. CPS assigns the number of Connection pool threads for each
batch_in_batch count. For example, defaults of batch_in_batch of 5 and connection_pool_size of 5
result in 25 threads.
Thread pool size: Number of threads used to process a single incoming request such as text
extraction and linguistic processing.
Range: 1-100. Default: 10). Larger size can speed ingestion when CPU is not under heavy load.
Causes instability at heavy CPU load.
System language: ISO 639-1 language code that specifies the language for CPS.
Max text threshold: Sets the size limit, in bytes, for the text within documents. Range: 5MB - 2GB
expressed in bytes. Default: 10485760 (10 MB). Maximum setting: 2 GB. Larger values can slow
ingestion rate and cause more instability. Above this size, only the document metadata is tokenized.
Includes expanded attachments. For example, if an email has a zip attachment, the zip file is
expanded to evaluate document size. If you increase this threshold, ingestion performance can
degrade under heavy load.
Illegal char file: Specifies the URI of a file that defines illegal characters.
To create a token separator, xPlore replaces illegal characters with white space. This list is
configurable.
Request time out: Number of seconds before a single request times out.
Range: 60-3600. Default: 600.
Daemon standalone: Check to stop daemon if no manager connects to it. Default: unchecked.
IP version: Internet Protocol version of the host machine. Values: IPv4 or IPv6. Dual stack is not
supported.
Use express queue: This queue processes admin requests and query requests. Queries are processed
for language identification, lemmatization, and tokenization. The express queue has priority over the
regular queue. Set the maximum number of requests in the queue. Default: 128.
The regular queue processes indexing requests. Set the maximum number of requests in the queue.
Default: 1024.
When the token count is zero and the extracted text is larger than the configured threshold, a
warning is logged
EMC Documentum xPlore Version 1.3 Administration and Development Guide
341
You can configure the following additional parameters in the CPS configuration file
PrimaryDsearch_local_configuration.xml, which is located in the CPS instance directory
xplore_home/dsearch/cps/cps_daemon. If these properties are not in the file, you can add them. These
settings apply to all CPS instances.
detect_data_len: The number of bytes used for language identification. The bytes are analyzed
from the beginning of the file. A larger number slows the ingestion process. A smaller number
increases the risk of language misidentification. Default: 65536.
max_batch_size: Limit for the number of requests in a batch. Valid values: 2 - 65535 (default:
65535).
Note: The index agent also has batch size parameters.
max_data_per_process: The upper limit in bytes for a batch of documents in CPS processing.
Default: 30 MB. Maximum setting: 2 GB.
normalize_form: Set to true to remove accents in the index, which allows search for the same
word without the accent.
slim_buffer_size_threshold: Sets memory buffer for CPS temporary files. Increase to 16384 or
larger for CenterStage or other client applications that have a high volume of metadata.
temp_directory: Directory for CPS temporary files. Default:
xplore_home/dsearch/cps/cps_daemon/temp.
temp_file_folder: Directory for temporary format and language identification. Default:
xplore_home/dsearch/cps/cps_daemon/temp.
daemon_restart_memory_threshold: Maximum memory consumption at which CPS is restarted.
use_direct_io: Requires CPS to read and write to devices directly.
io_block_unit: Logical block unit of the read/write target device.
io_chunk_size: Size for each read/write chunk.
cut_off_text: Set to true to cut off text of large documents that exceed max_text_threshold instead of
rejecting entire document.
342
343
344
API Reference
CPS APIs
Content processing service APIs are available in the interface IFtAdminCPS in the package
com.emc.documentum.core.fulltext.client.admin.api.interfaces. This package is in the SDK jar file
dsearchadmin-api.jar.
To add a CPS instance using the API addCPS(String instanceName, URL url, String usage), the
following values are valid for usage: all, index, or search. If the instance is used for CPS alone, use
index. For example:
addCPS("primary","
http://1.2.3.4/services","
index")
345
Search APIs
Search APIs
Search service APIs are available in the following packages of the SDK jar file dsearchadmin-api.jar. :
IFtAdminSearch in the package com.emc.documentum.core.fulltext.client.admin.api.interface.
IFtSearchSession in com.emc.documentum.core.fulltext.client.search
IFtQueryOptions in com.emc.documentum.core.fulltext.common.search.
Auditing APIs
Auditing APIs are available in the interface IFtAdminAudit in the package
com.emc.documentum.core.fulltext.client.admin.api.interfaces. This package is in the SDK jar file
dsearchadmin-api.jar.
346
Appendix B
Documentum DTDs
347
To view the dftxml representation of a document that has been indexed, open xPlore administrator
and click the document in the collection view.
To find the path of a specific attribute in dftxml, use a Documentum client to look up the object ID of a
custom object. Using xPlore administrator, open the target collection and paste the object ID into the
Filter word box. Click the resulting document to see the dftxml representation.
DTD
This DTD is subject to change. Following are the top-level elements under dmftdoc.
Table B.46
Element
Description
dmftkey
dmftmetadata
dmftvstamp
dmftsecurity
dmftinternal
dmftversions
dmftfolders
dmftcontents
348
Documentum DTDs
Element
Description
dmftcustom
dmftsearchinternals
349
<i_branch_cnt dmfttype="dmint">0</i_branch_cnt>
<i_direct_dsc dmfttype="dmbool">false</i_direct_dsc>
<r_immutable_flag dmfttype="dmbool">false</r_immutable_flag>
<r_frozen_flag dmfttype="dmbool">false</r_frozen_flag>
<r_has_events dmfttype="dmbool">false</r_has_events>
<acl_domain dmfttype="dmstring">Administrator</acl_domain>
<acl_name dmfttype="dmstring">dm_450a0d6880000101</acl_name>
<i_is_reference dmfttype="dmbool">false</i_is_reference>
<r_creator_name dmfttype="dmstring">Administrator</r_creator_name>
<r_is_public dmfttype="dmbool">true</r_is_public>
<r_policy_id dmfttype="dmid">0000000000000000</r_policy_id>
<r_resume_state dmfttype="dmint">0</r_resume_state>
<r_current_state dmfttype="dmint">0</r_current_state>
<r_alias_set_id dmfttype="dmid">0000000000000000</r_alias_set_id>
<a_is_template dmfttype="dmbool">false</a_is_template>
<r_full_content_size dmfttype="dmdouble">130524</r_full_content_size>
<a_is_signed dmfttype="dmbool">false</a_is_signed>
<a_last_review_date dmfttype="dmdate"/>
<i_retain_until dmfttype="dmdate"/>
<i_partition dmfttype="dmint">0</i_partition>
<i_is_replica dmfttype="dmbool">false</i_is_replica>
<i_vstamp dmfttype="dmint">0</i_vstamp>
<webpublish dmfttype="dmbool">false</webpublish>
</dm_sysobject>
</dmftmetadata>
<dmftvstamp>
<i_vstamp dmfttype="dmint">0</i_vstamp>
</dmftvstamp>
<dmftsecurity>
<acl_name dmfttype="dmstring">dm_450a0d6880000101</acl_name>
<acl_domain dmfttype="dmstring">Administrator</acl_domain>
<ispublic dmfttype="dmbool">true</ispublic>
</dmftsecurity>
<dmftinternal>
<docbase_id dmfttype="dmstring">658792</docbase_id>
<server_config_name dmfttype="dmstring">DSS_LH1</server_config_name>
<contentid dmfttype="dmid">060a0d688000ec61</contentid>
<r_object_id dmfttype="dmid">090a0d6880008848</r_object_id>
<r_object_type dmfttype="dmstring">techpubs</r_object_type>
<i_all_types dmfttype="dmid">030a0d68800001d7</i_all_types>
<i_all_types dmfttype="dmid">030a0d6880000129</i_all_types>
<i_all_types dmfttype="dmid">030a0d6880000105</i_all_types>
<i_dftxml_schema_version dmfttype="dmstring">5.3</i_dftxml_schema_version>
</dmftinternal>
<dmftversions>
<r_version_label dmfttype="dmstring">1.0</r_version_label>
<r_version_label dmfttype="dmstring">CURRENT</r_version_label>
<iscurrent dmfttype="dmbool">true</iscurrent>
</dmftversions>
<dmftfolders>
<i_folder_id dmfttype="dmid">0c0a0d6880000105</i_folder_id>
</dmftfolders>
<dmftcontents>
<dmftcontent>
350
Documentum DTDs
<dmftcontentattrs>
<r_object_id dmfttype="dmid">060a0d688000ec61</r_object_id>
<page dmfttype="dmint">0</page>
<i_full_format dmfttype="dmstring">crtext</i_full_format>
</dmftcontentattrs>
<dmftcontentref content-type="text/plain" islocalcopy="true" lang="en"
encoding="US-ASCII" summary_tokens="dmftsummarytokens_0">
<![CDATA[...]]>
</dmftcontentref>
</dmftcontent>
</dmftcontents>
<dmftdsearchinternals dss_tokens="excluded">
<dmftstaticsummarytext dss_tokens="excluded"><![CDATA[mylog.txt ]]>
</dmftstaticsummarytext>
<dmftsummarytokens_0 dss_tokens="excluded"><![CDATA[1Tkns ...]]>
</dmftsummarytokens_0></dmftdsearchinternals></dmftdoc>
Note: The attribute islocalcopy indicates whether the content was indexed. If true, only the metadata
was indexed, and no copy of the content exists in the index.
351
Appendix C
XQuery and VQL Reference
353
For example:
for $i in collection("dsearch/SystemInfo")
return count($i//trackinginfo/document)
For example:
Get object count in library
count(//trackinginfo/document[library-path="<LibraryPath>"])
Find documents
Find collection in which a document is indexed
//trackinginfo/document[@id="<DocumentId>"]/collection-name/string(.)
For example:
for $i in collection("dsearch/SystemInfo")
where $i//trackinginfo/document[@id="TestCustomType_txt1276106246060"]
return $i//trackinginfo/document/collection-name
Status information
Get operations and status information for a document
//trackinginfo/operation[@doc-id="<DocumentId>"]
354
DQL
XQuery
IN
for $i in collection(
/XX/dsearch/Data)/dmftdoc[
(dmftcontents/dmftcontent
ftcontains (test1)) ]
NEAR/N
for $i in collection(
/XX/dsearch/Data)/dmftdoc[
(dmftcontents/dmftcontent
ftcontains (test1 ftand test2
distance exactly N words)) ]
ORDERED
for $i in collection(
/XX/dsearch/Data)/dmftdoc[
(dmftcontents/dmftcontent
ftcontains (test1 ftand test2)
ordered]
ENDS
STARTS
for $i in collection(
/XX/dsearch/Data)/dmftdoc[
(dmftcontents/dmftcontent
ftcontains (test1)) and
starts-with(dmftinternal/r_object_type,
dm_docu)]
355
xPlore Glossary
Term
category
collection
domain
DQL
FTDQL
ftintegrity
full-text index
index agent
ingestion
Description
A category defines a class of documents and
their XML structure.
A collection is a logical group of XML
documents that is physically stored in an
xDB library. A collection represents the most
granular data management unit within xPlore.
The content processing service (CPS)
retrieves indexable content from content
sources and determines the document format
and primary language. CPS parses the content
into index tokens that xPlore can process into
full-text indexes.
A domain is a separate, independent group of
collections with an xPlore deployment.
Documentum Query Language, used by many
Content Server clients
Full-text Documentum Query Language
A standalone Java program that checks index
integrity against Content Server repository
documents. The ftintegrity script calls the
state of index job in the Content Server.
Index structure that tracks terms and their
occurrence in a document.
Documentum application that receives
indexing requests from the Content Server.
The agent prepares and submits an XML
representation of the document to xPlore for
indexing.
Process in which xPlore receives an XML
representation of a document and processes
it into an index.
Term
instance
lemmatization
Lucene
node
persistence library
status library
stop words
text extraction
token
Description
A xPlore instance is one deployment of the
xPlore WAR file to an application server
container. You can have multiple instances on
the same host (vertical scaling), although it is
more common to have one xPlore instance
per host (horizontal scaling). The following
processes can run in an xPlore instance: CPS,
indexing, search, xPlore administrator. xPlore
can have multiple instances installed on the
same host.
Lemmatization is a normalization process in
which the lemmatizer finds a canonical or
dictionary form for a word, called a lemma.
Content that is indexed is also lemmatized
unless lemmatization is turned off. Terms in
search queries are also lemmatized unless
lemmatization is turned off.
Apache open-source, Java-based full-text
indexing, and search engine.
In xPlore and xDB, node is sometimes used
to denote instance. It does not denote host.
Saves CPS, indexing, and search metrics.
Configurable in indexserverconfig.xml.
Content Server configuration installs the state
of index job dm_FTStateOfIndex. This job is
run from Documentum Administrator. The
ftintegrity script calls this job, which reports
on index completeness, status, and indexing
failures.
A status library reports on indexing status for
a domain. There is one status library for each
domain.
Stop words are common words filtered out of
queries to improve query performance. Stop
words can be searched when used in a phrase.
Identification of terms in a content file.
Piece of an input string defined by semantic
processing rules.
Term
tracking library
transactional support
watchdog service
xDB
XQFT
XQuery
Description
An xDB library that records the object IDs
and location of content that has been indexed.
There is one tracking database for each
domain.
Small in-memory indexes are created in
rapid transactional updates, then merged into
larger indexes. When an index is written
to disk, it is considered clean. Committed
and uncommitted data before the merge is
searchable along with the on-disk index.
Installed by the xPlore installer, the watchdog
service pings all xPlore instances and sends
an email notification when an instance does
not respond.
xDB is a database that enables high-speed
storage and manipulation of many XML
documents. In xPlore, an xDB library stores
a collection as a Lucene index and manages
the indexes on the collection. The XML
content of indexed documents can optionally
be stored.
W3C full-text XQuery and XPath extensions
described in XQuery and XPath Full Text 1.0.
Support for XQFT includes logical full-text
operators, wildcard option, anyall option,
positional filters, and score variables.
W3C standard query language that is designed
to query XML data. xPlore receives xQuery
expressions that are compliant with the
XQuery standard and returns results.
Index
A
ACL replication
job, 53
script, 53
aclreplication, 53
ACLs
large numbers of, 66
aspects, 61
attach, 189
events
register, 28
exporter queue_size, 323
exporter_thread_count, 323
B
backup
incremental, 179
batch_hint_size, 343
C
capacity
allocate and deallocate, 14
categories
Documentum, 26
overview, 23
collection
configure, 161
global, 24
move to another instance, 162
overview, 159
connectors_batch_size, 323
Content Server
indexing, 27, 29
D
detach, 189
dm_FTStateOfIndex, 76
dm_fulltext_index_user, 28
dmi_registry, 28
document
maximum size for ingestion, 341
document size
maximum, 98
Documentum
categories, 26
domains, 25
domain
create, 156
Documentum, 25
overview, 23
reset state, 180
restore with xDB, 180
DQL
using IDfQueryBuilder, 259
DQL, compared to DFC/DFS, 224
F
facets
date, in DFC, 277
numeric, in DFC, 278
out of the box, 275
results from IDfQueryProcessor, 283
string in DFC, 276
FAST
migration, sizing, 316
federation
restore with xDB, 180
force
detach and attach, 189
freshness-weight, 202
ftintegrity
running, 74
full-text indexing
Content Server documents, 27, 29
index server, 29
overview, 27
software installation, 27, 29
verifying indexes, 74
xPlore, 29
G
Get query text, 290
H
highlighting
in results summary, 214
I
IDfQueryBuilder, 259
incremental backup, 179
index agent
role in indexing process, 27, 29
index server
role in indexing process, 27, 29
indexer callback_queue_size, 323
indexer queue_size, 323
indexing
queue items, 28
indexserverconfig.xml
Documentum categories, 26
installing indexing software, 27, 29
instance
deactivate, 34
359
Index
jobs
state of index, 76
save-tokens, 106
security
manually update, 53
view in Content Server, 57
view in log, 55
size
content size limit, in Documentum, 66
maximum, for ingestion, 341
sizing
CPS, 316
migration from FAST, 316
search, 316
spare instance
deactivate, 34
replace primary, 35
state of index, 76
summary
dynamic, 213
performance, 214
M
metadata
boost in results, 202
P
password
change, 53
performance
language identification, 342
query summary, 214
primary instance
replace, 35
Q
query
counts by user, 291
query definition, 259
queue
items, 28
T
text size
maximum, 99
Top N slowest queries, 290
W
watchdog service, 37
recent documents
boost in results, 202
reindexing, 28
report
Get query text, 290
Query counts by user, 291
Top N slowest queries, 290
reset
domain state, 180
restore
domain, with xDB, 180
federation, with xDB, 180
360