SQL Server Full Text Search
SQL Server Full Text Search
1. Introduction
Over the last decade, the focus of the commercial database management community has been primarily on structured data and the industry as a whole has been fairly effective at addressing the needs of these structured storage applications. However, only a small fraction of the data stored and managed each year is fully structured while the vast preponderance of the data stored is either wholly unstructured or only semistructured in the form of documents, web-pages, spreadsheets, email and other weakly structured formats. This work investigates the features and capabilities of the full text search access method in Microsoft SQL Server 2000 and the follow-on release of this product and how these search capabilities are integrated into the query language. Well first outline the architecture for the full text search support, then describe the full text query features in more detail, and finally show examples of how this support allows single SQL queries over structured, unstructured, and semi-structured data.
MS Search Process
Chunk Buffer Status Batch Requests
Ready Queue
Protocol Handler
Filter(s)
Word Breaker
Project
In-Progress Map Batch Object
Light Transaction Object (LTO) + On-Hold + Ready to Filter + in Filtering Process ----------------------------Total Items in In-Progress Map
S r M or haed em y
Shared Memory
Table Search QP
Keywords Indexer
SQL QP
Query
Catalog
forms of distributed and all words meaning the same as databases (thesaurus support).
3. Weighted Terms: Query terms can be assigned relative weight to impact the rank of matching
documents when one wants to favor one term over another. In the following the spread search term is given twice the weight of the sauces search team which is, in turn, given twice the weight of the relishes search term:
SELECT a.CategoryName, a.Description, b.rank FROM Categories a,ContainsTable(Categories, description,
'ISABOUT (spread weight (.8), sauces weight (.4), relishes weight (.2))') b WHERE a.categoryId = b.[key]
4. Phrase and Proximity Query: One can specify queries over phrases, and more generally using 5. 6.
proximity (NEAR) between terms in a matching document, e.g. distributed NEAR databases matches items in which the term distributed appears close to the term databases. Prefix match: Search conditions can specify a query for matching the prefix of a term. For example data* matches all terms data, databases, datastore, datasource, etc. Composition: Terms can be composed using conjuncts (AND), disjuncts (OR) and conjuncted complementation ( AND NOT ).
This query finds information on documents authored by Linda Chapman where title includes the term child close to the term development.
SELECT a.Title, a.Author , a.PublishedDate, b.rank FROM Documents a, FreetextTable(Documents,Content,child development AND insomnia) b WHERE a.DocumentId = b.[key] and a.Author = Linda Chapman order by b.rank desc
This query finds all documents authored by Linda Chapman on child development and insomnia. The result is presented in descending order of rank. Scenario 2. We want to search for information distributed in heterogeneous sources. Data is stored in an Exchange mail server in email format, in the filesystem in form of locally-authored documents, and in SQL Server in form of published documents. SQL Server content schema is the same as above in Scenario 1. The filesystem content index is provided by the filesystem indexing service. The following query gets all documents related to marketing and cosmetics from the email store, the filesystem, and from the SQL Server document store.
--Get qualifying email docs SELECT DisplayName,hRef,MailFrom, Subject FROM openquery(exchange, 'SELECT "DAV:displayname" as DisplayName,"DAV:href" as hRef, urn:schemas:mailheader:from as MailFrom, urn:schemas:mailheader:subject as subject FROM "manager\Inbox" WHERE contains(*,''marketing AND cosmetics)') UNION ALL --Get qualifying filesystem data SELECT filename, vpath, docauthor, doctitle FROM OpenQuery(Monarch, 'SELECT vpath, Filename, size, doctitle, docauthor FROM SCOPE(''deep traversal of "c:\tapasnay\My Documents" '') WHERE contains (''marketing AND cosmetics)) UNION ALL -Get qualifying SQL Server data SELECT Title, SQLServer:Documents:+cast(DocumentId as varchar(20)) as docref, author, title FROM Documents WHERE contains(*,'marketing AND cosmetics)
5. Conclusions
In this paper we motivate the integration of a native full text search access method into the Microsoft SQL Server product, describe the architecture of the access method and motivate some of the trade-offs and advantages of the engineering approach taken. We explore the features and functions of the full text search feature and provide example SQL queries showing query integration over structured, semi-structured, and unstructured data. For further SQL Server 2000 full text search usage and feature details one may look at Inside Microsoft SQL Server [1] or SQL Server 2000 Books online [2]. On the implementation side, we are just completing a major architectural overhaul of the indexing engine and its integration with SQL Server in the next release of the product and this paper is the first description of this work.
6. References
[1] Delaney, Kalen; Inside Microsoft SQL Server 2000, Microsoft Press, 2001 [2] SQL Server 2000 Books Online