TREX Architecture and Functionality
TREX Architecture and Functionality
TREX Architecture and Functionality
TREX is the one search technology in SAP solutions. TREX is deployed in over a dozen SAP produts TREX searches and analyses as well unstructured documents as structured business data. TREX in knowledge management provides search access to an extensible number of document repositories TREX will provide the backend technology for Enterprise Search
TREX Architecture
TREX Anatomy
Queue server: manages asynchronous indexing Preprocessor: document retrieval, filtering, linguistic processing
Name Server
Example
When a service sends the name server the request GetServer (IndexServer, SearchMode, MyIndex) the name server answers with the address <host>:<port> of the index server to which to send the request
sapprofile.ini
Read by all TREX services and clients Specifies:
Port number of local name server Host and port numbers of all master name servers Amount of shared memory used by topology.ini data System ID Path information to where each service saves its data
Queue Server
Preprocessor 1
TREX Preprocessor
Delivers documents that the engines can use directly Supports almost any data type Gets documents via HTTP from source Converts documents to HTML Keeps the document structure
Extracts attributes
- Metadata from DOC, PDF, ... - Names from a lexicon - Application-specific attributes
.* .zip .ppt
.pdf
.* <html> <head></head> <body></body> </html>
.*
.doc
Preprocessor 2
TREX Preprocessor
Reduces workload on the other engines Works independently of the indexes Is stateless
Java Client ABAP Client Index Server
Python Extensions
Preprocessor
Lexicon
Highlighting
Extensions
How Search Works: An Example BooksOnline, an online bookstore, offers a range of books with the special feature that a customer can search the full text of the books online before purchase
Auditor Jane wants to buy a book about invoice verification and decides to evaluate the suggestions offered by the BooksOnline search service The following slides describe how the SAP NetWeaver search service used by BooksOnline answers her search request
Search Example 1
Jane enters invoice verification in the BooksOnline search field in the Web browser on her office desktop PC The business application forwards her search request, together with information about the kind of search and which index to use, as an HTTP/XML packet via the Java client to the Web server
Java Client
TREX
Name Server Preprocessor Queue Server
Server
Attribute Engine
Index
Index
Index
Search Example 2
The Web server converts the HTTP message into the format used inside TREX and sends a request to the name server for the name and address of a service to handle the request The name server checks its list of available servers and tells the Web server the address of an index server that has received the fewest calls so far and can handle the request
Java Client
Web Server
Index
Server
Attribute Engine
Index
Index
Index
Search Example 3
The Web server passes the search request to the index server as a TCP/IP packet The index server sees that the request is for a phrase search and therefore forwards the phrase to the preprocessor for language identification, tokenization, tagging, and stemming
Java Client
TREX
Do a phrase search forer invoice verification in the BooksOnline index
Name Preprocessor Queue Server
Web Server
!Text Mining
En
Index
Server
rib te - e
Index
Index
Search Example 4
The preprocessor performs linguistic processing. It parses the phrase into two words invoice and verification, tags them as nouns, reduces the words to their stem forms (in this case the words themselves) and sends the result back to the index server
Java Client
TREX
Name Server Preprocessor Queue Server
Web Server
Index
Server
Index
Index
Index
Search Example 5
The index server sends the preprocessed request to the search engine for optimization and result retrieval
The query optimizer in the search engine analyzes the query, builds the query tree, which in this case has three nodes, one for each word and one for AND, and optimizes it based on index statistics, to evaluate the term that appears less frequently first
Java Client
TREX
Name Server Queue Server
Preprocessor
Server
Attribute Engine
The index listing for invoice is longer than the index listing for verification so select verification first
SAP AG 2006, Title of Presentation / Speaker Name / 18
Index
Index
Search Example 6
The search engine finds the row for the term verification in the BooksOnline index and selects the set of books containing the term, then it checks this set of books against the row for the term invoice and selects just the books that contain both terms Next, it reads the addresses of the terms in each book, calculates rank values, sorts the results, and takes the top ten (or more)
Java Client
TREX
Name Server Queue Server
Preprocessor
1. Find set of books Index Server with verification Text Search Attribute Engine Engine 2. Find subset with invoice 3. Find addresses Index Index of both terms
Search Example 7
The search engine reads all the requested attributes for the selected books, including titles and authors and keys to the documents The engine uses the keys to load the document contents and scans the texts for the first occurrences of the search phrase (or linguistic variants of the phrase) to create a brief summary text
Java Client
TREX
Name Server Preprocessor Queue Server
Web Server
Scans through the texts to find the first few sentences containing the phrase invoice verification
Index
Index
Server
Attribute Engine
Index
Index
Search Example 8
The search engine passes the result set back via the index server for merging with results from any other engines (here none) The index server passes the result set back via the Web server and the Java client to the graphical user interface Jane sees a ranked list of books about invoice verification less than a second after she launched the search
Java Client
TREX
Name Server Preprocessor Queue Server
Server
Attribute Engine
Index
Index
Index
Internal Auditing
by First Author, Second Author Economic Publishers, New York Invoice verification is the next step ... The invoice verification in the ...
375 pages First edition ISBN 0-3XX-XXXXX-X
Browse full text Document attributes Link to document Sample phrases with search terms highlighted
How Indexing Works: An Example BooksOnline worked hard to give Jane such a rewarding search experience
Before Jane could see a ranked list of books about invoice verification and browse the books, BooksOnline had to index the full texts of all the books The following slides describe how the SAP NetWeaver search service used by BooksOnline indexes the full texts of the books on show in its website
Indexing Example 1
The BooksOnline indexing administrator opens the SAP queue and index administration tool and sends a request to TREX to create an index called BooksOnline The ABAP Client forwards the index request as a Remote Function Call via the SAP Gateway to the RFC server
ABAP Client
TREX
RFC Server Name Server Preprocessor Queue Server
Gateway
Index
Server
Attribute Engine
Index
Index
Index
Indexing Example 2
The name server tells the RFC server the address of an index server that can create the index In a one-box implementation of TREX, this step is straightforward unless the index server is down for some reason The name server uses a round robin procedure to select an index server
ABAP Client
TREX
RFC Server Name Server Preprocessor Queue Server
Gateway
So go to <host>:<port>
Text Mining Engine
Index
Server
Attribute Engine
Index
Index
Index
Indexing Example 3
The RFC server sends the request to the index server The index server creates a new index called BooksOnline The new index is still empty but any documents to be indexed can now be assigned to it
ABAP Client
TREX
RFC Server Name Server Preprocessor Queue Server
Gateway
Index
Text Mining Engine
Server
Attribute Engine
Index
Index
Index
Indexing Example 4
The administrator sends a request to index the new books in a specified folder and write the results in the BooksOnline index The digital files for the books are in a variety of formats, but TREX can handle all standard formats, such as Microsoft Word (.doc), Adobe Page Description Format (.pdf), and plain text (.txt) The name server directs the request to an available queue server
ABAP Client
TREX
RFC Server Name Server Preprocessor Queue Server
Gateway
Please put this indexing request in your queue and have the documents indexed as soon as TREX finds the time to do it
Text Mining Engine Text Search Engine
Attribute Engine
Index
Index
Index
Indexing Example 5
The queue server receives the list of URLs for the documents from the specified folder and persists them in a queue for the index for as long as required until a preprocessor is available Indexing a large collection of documents can be a long job, so the administrator can hold or flush the queue manually at any time
.htm .xls
.pdf .doc
.ppt .txt
ABAP Client
TREX
RFC Server Name Server Preprocessor Queue Server
Gateway
Queue server receives document URLs and adds them to the BooksOnline queue for indexing
Text Mining Engine Text Search Engine Attribute Engine
BooksOnline has all its books available in digital form (either as author files or scanned and OCR'd) ready for indexing and browsing
SAP AG 2006, Title of Presentation / Speaker Name / 28
Index
Index
Index
Indexing Example 6
The queue server sends the documents to a free preprocessor The preprocessor fetches documents via URLs, filters them from their original format to HTML, identifies their language, tokenizes them into sequences of terms, tags the terms as nouns or whatever, and stems the terms as appropriate The preprocessed documents are then sent to the index server TREX
RFC Server Name Server Queue Server
.htm .xls
.pdf .doc
.ppt .txt
ABAP Client
Gateway
Preprocessor
Server HTML
Attribute Engine
Index
Index
Index
Indexing Example 7
The index server forwards the documents to the search engine For each document, the search engine writes a list of all its terms and for each term it writes a list of positions in the document where the term appears The engine merges the term list for each document to the existing term-document matrix that forms the BooksOnline index TREX
RFC Server Name Server Preprocessor Queue Server
.htm .xls
.pdf .doc
.ppt .txt
ABAP Client
Gateway
Index
Text Mining Engine
Server
Attribute Engine
Index
Index
Indexing Example 8
The BooksOnline indexing administrator can use the TREX queue and index administration tool to display the status of the indexing process at any time during the process
ABAP Client
TREX
The tool lets you follow the progress of queued documents from left to right
Gateway
ABAP
- Restricted feature set - Easy access on customer systems
Java
- Highly restricted feature set - Browser access via Portal
Landscape Example
Alert area
TREX Traces
Trace file logs are available in below trex path: cd /usr/sap/SID/TRX02/ldtr01<sid>/trace
THANK YOU