Index Server Faq
Answers to questions asked on the SSWUG web cast 1/26/05.
Q. Paschal what do you think about the search approach done by blinkx.com
A. Interesting. Videos.google.com is somewhat similar
Q. Shaun Is there an online resource that would give a demo of how to set FTS up?
Q. Grant how would it do against SQL server profiler textdata to find occurrences of objects, for example?
A. You can index SQL Server traces stored in tables. You would index the textdata column. This will work well.
Q. Vinay Is FTS aware of dictionary words?
A. I am unsure what you mean by this question. SQL FTS will index the content in your tables. If this content contains words which are in a dictionary - yes it will. It will also index character sequences, i.e. XXX, qwerty, which may not exist in a dictionary.
Q. Ben I have had a hard time finding information about using iFilters with SQL Server and FTS. I am hoping to use FTS with Word, PDF, Word Perfect, MSWorks, XML, and other document types. Can you speak to this issue and explain how to use different iFilters?
A. As a dba all you have to do is install the iFilter on the OS, and make sure you indicate the type of data or document which is stored in the table. The last iFilter from Corel for Word Perfect was for Word Perfect 8. I am unaware of an iFilter for MSWorks. The XML iFilter does work somewhat with XML docs stored in the image datatype columns. With the XML iFilter only the element values are indexed, not element names, attributes, or comments.
Q. Paschal Is it possible to create remotely a catalog (think for example if you are hosting your SQL database with an ISP) and maintain the catalog remotely?
A. Yes, many ISP's offer this, for instance webhost4life.com, maximumasp.com, etc. If you are talking about building your catalog in one location, and then exporting it to second location and then querying it there consult:
for more information.
Q. Dylan Would full text indexing be useful for identifying duplicate data, and/or finding items (serial numbers) within note fields (text) (varchar)
A. Not really. You could query on unique identifiers, or terms like serial numbers to identify rows containing these tokens/terms. If these terms guarantee uniqueness it would work.
Q. Bob Is it possible to perform full text search on zipped data in an image datatype column in SQL Server?
A. Yes, with a zip iFilter installed this should work. I have tested this on SQL 2000 running on Win2k and Win2003.
Q. Paschal In blinkx they have ranking by percentage :-)
A. It is easy enough to generate your own percentage on the fly. Whether rank is expressed as a percentage or in terms of 10, or 1000 should not matter. Keep in mind that what ranking is really useful for is returning ordered results in blocks or 10, 20, or 25 results at a time; in other words; results in pages. The result you are looking for should be in the top 10, top 20, or top 25. The row which contains the data you are looking for might not be ranked number 1, or 3, but should be on the first page. If it is not then the search algorithm is not good. Google has the ???I???m feeling lucky??? button which banks of the fact that with their search algorithm the result which is ranked highest is the one you are looking for. Well ??? this option is not often used, as for many people the result they are looking for is not ranked number 1 on Google; however it is frequently on the first page.
Pay attention to which hit is the one which satisfies your query. Chances are its not number 1, but it will be on the first page.
Indexing Services allows you to return an unordered first page of results which offers better performance and capitalizes on the fact that ranking is relative as opposed to exact.
Q. Da Do you know of a good resource, hands-on tutorial resource for Fulltext indexing?
A. I find SQL Server 2000 Bible pretty good (I reviewed this chapter for Paul Neilson). Other than that look at http://www.indexserverfaq.com in the SQL FTS section.
Q. Paschal Where can I found an open source Thesaurus I can use with FTS
A. Can't think of one. A Google search, and others are fruitless.
Q. Paschal Is it possible to see in a close future a solution to include grammar at least for English?
A. I'm not sure what you mean here. There are a variety of spell checkers available.
Q. Paschal Can we have a preview of your book before the release, maybe few chapters
A. A Troubleshooting chapter will be available for download.
Q. Ray The new Google "mini" appliance is $5,000
A. Thanks for this info. I was unaware of this. We evaluated Google search appliance at a company I was working for and decided against it. If you are considering using it you should test carefully to see using a document collection seeded with known numbers of terms to evaluate which search solution will work best for you.
Q. John The use of the Thesaurus XML files in SQL Server 2000 is not directly supported by Microsoft. See KB article: "FIX: Thesaurus Support is Not Available for Full-Text Search" at: http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q309678
A. This is true. I have other options for doing a thesaurus search on my web site http://www.indexserverfaq.com/Thesaurus.htm
Q. Deepa you can not do prefix searches only Suffix searches. For example, can search for words starting with go* but not words that end with go
A. Thanks - I got the terms mixed up. Again, IBM does do both prefix and suffix.
Q. Chris Does anyone have a document or a link to a document that describes in detail how to create a full-text catalog off of a folder of PDF documents?
Q. Erik Your web site indserverfaq.com has a lot of bad links, especially under Querying section of SQL FTS Programming
A. Yes, I have updated all of the querying sections. This site is a work in progress.
Q. Gary please put the MSDN article URL back up again.
Q. John SQL FTS is not case sensitive. All of the following SQL FTS queries in the pubs database retruns the same results:
select * from pub_info where contains(*,'books')
select * from pub_info where contains(*,'Books')
select * from pub_info where freetext(*,'books')
select * from pub_info where freetext(*,'Books')
A. Did I say it was case sensitive - if I did I misspoke. In some of the other languages there are casing language rules which make it sometimes case sensitive. But in general it is case insensitive. For instance marie-claire is considered to be two words with the French word breaker, whereas in Marie-Claire is considered to be one word.
Q. Paschal Can you find some noise words list ready to use?
A. I'm not sure what you mean by this question. All noise word lists are ready to use.
Here is a script that will import the noise word list into a table which you can query. This uses the US English noise word list and expects SQL FTS to be in the default location:
set nocount on
Q. Gary Right, you can index large blobs up to SQL's blob size.
A. BOL states the first 256 of text content emitted. This is my experience as well - no matter what the setting of FilterProcessMemoryQuota. See http://support.microsoft.com/default.aspx?scid=kb;en-us;308771 from more information on this. I have opened a support incident on this question with Microsoft and will send the results to Steve when I get a response.
Q. Bob Hilary - I did not quite catch the URL for help with iFilters with regard to search zip files stored as BLOBs - after the presentation, could you send the link to XXX? Thanks Bob
Q.Richard repeat the link on the iFilters
Q. Ben Can we get the URL again for those iFilters?
Q. Ben Do you recommend using Full Text indexing on XML data? If so, what is the best approach as I know there are issues with the XML tags and the cataloging of the data.
A. No, I don???t. While SQL FTS 2000 will index portions of XML documents when they are stored in the image data type columns, you may find it more useful to index them as text or char (although there are some problems with the tags here), or to shred them and store them in tabular format. The reason for this is the element name, comments and attributes are not indexed. Element values and only element values will be indexed. If you are only interested in the element values, it will work for you. Tested on SQL 2000, Win2k and Win2003.
Q. Jason What does the special rank of 1000 mean?
A. It seemed that prior to SQL SP 3, recently indexed tokens would have a higher rank than tokens which had been indexed in the past. I suspect that there were anomalies in ranking calculations done on a shadow index which has less tokens, than on your master index which has many tokens.
Q. Michael Are searches done while an incremental update in process at least as accurate as the searches done just prior to starting the incremental update?
A. Since SP3 yes. But performance is worse.
Q. Stephanie Would you mind giving an example of linguistic searching again? Thank you.
A. A search on the term book would match with books, book???s, booked, book, booking, etc.
Q. Walter What account should start the MSSearch service: domain entity account? Same account as SQL server?
A. Never change the MSSearch service account. It should always run under the local system account. Changing it to another account will send CPU utilization through the roof. Check out this kb article for more information.
Q. Paschal What you can say about the reliability of the schedule to index a catalog? I had many times the bad surprise to see the schedule not firing up without reasons
A. I have never had a problem with this. I would look at the gatherer logs or the job history to try to diagnose this one.
Q. Paschal The problem with hit highlighting is that I want an extract of a long text with the results highlighted
A. The hit highlighting functionality which ships with Indexing Services allows you to click on a hyperlink that will display only portions of your text marked up, or the full text.
Q. Rick I am not sure if you answered my question - I was interrupted for a few minutes. When trying to start FTS, I get the following error: "Error 7635 The Microsoft Search Service cannot be administered under the present under account." I am log in as administrator.
A. The most frequent cause of this problem is:
Q. Paschal Do you know a way of highlighting the words searched in a sentence (on a web page for example)
Q. David You mention using a text datatype as opposed to others, does that include varchar? Or is varchar faster for building indexes?
A. Varchar is far easier to work with than text. So if possible, I use char or varchar. However, this been said, varchar can cause performance problems on your tables.
Consider these two tables:
Create Table VarcharTable
PK int not null identity(1,1),
Constraint pkvarchar primary key clustered (pk)
And this table
Create Table TextTable
PK int not null identity(1,1),
Constraint pkTxt primary key clustered (pk)
If you were to look at how the data is stored on the pages you will find that with the first table, all the rows are stored on a single page. This means that you can only fit 1 row per page. So if you were doing this query
Select PKint, Charcol1, intcol1, Charcol2, Intcol2, Charcol3, Intcol3 From VarcharTable
And you had 10,000 rows in your table, you would have to pull 10,000 pages off disk.
For the text table, the non text rows (the int and char columns) and a text pointer would be stored on one page, the text column would be stored on another page. This means that you could fit many more pages on a single page this way.
So if you were doing this query
Select PKint, Charcol1, intcol1, Charcol2, Intcol2, Charcol3, Intcol3 From TextTable
And you had 10,000 rows in your table, you would have to pull several hundred pages off disk.
With the text in row option, if the majority of your data can be stored in page you are essentially making your table wider than it should be. But this depends on the distribution of size of your text data. If the majority of the data in the text column is very small text in row is a good option.
The datatype has little significance on indexing speed due to the peculiar way MSSearch extracts rows from the database and then waits a predetermined length of time. With SQL 2005 this does change, and indexing speed is much faster.
Q. Ray You briefly mentioned the Google Search appliance. Could you compare using the techniques you're describing with the Google Search Appliance?
A. If you are asking about accuracy of searching - I have decompiled MSDN and BOL into individual html documents and further converted them to text documents. I have done a word count analysis and can tell you that there are 22 occurrences of the word party occurs in MSDN (actually I am not sure how many times it occurs, but I do have the exact word distribution). I have compared this against various search engines, one of them being Google search appliance. There are inconsistencies with the Google search appliance. SQL FTS and Indexing Service???s numbers matched what the counts I did on the content. Google wasn???t. This was several years ago. Verity similarly leaves something to be desired.
One of the more interesting aspects of comparing Google with MSSearch is that Google is tuned to index many more pages than there are words or tokens. Whereas with MSSearch the ratio of indexed pages, documents or rows to the number of words or tokens indexed tends to be 1 to 1, or .1 to 1, or some fraction thereof. Because of this difference in ratios, different algorithms are used and you get different results. I am unsure if Google is using the same algorithm they use to search the internet in their search appliance.
Google works best when 1) there is no Google bombing (search on Liv Tyler nude), 2) the links pointing to a site don???t come from community sites or large sites (search on go to hell today Microsoft Google to understand this one), 3) there are links ???voting??? on a site containing the correct information. Sometimes they can be heavily skewed and the result you are looking for is on page 2 or 3. It is difficult to assess this as a term may be on page 1 one day, or page 3 or even page 7 the other day. I spoke with a Google Search engineer about this and he claimed it was due to index instabilities.
Q. Glenn MSSearch service is tied to the OS. In Windows 2003 Microsoft made the ampersand (&) a noise word. How can you search on words like AT&T without using the like statement?
A. After removing the a, t, and at from the noise word list, I find that searches on at&t on Win2k work as you observer. While on win2003, I get matches to at&t, at t, at- t, at-t.
Q. Paschal Do you have an idea about searching easily plurals?
A. FreeText search, use wildcarding (you may have mixed results with this) or use FormsOf(Inflectional with Contains
Q. Dennis It appears the searches are CASE sensitive. Are searches sensitive to the OS active font table?
A. Searches are for the most part not case sensitive, in other words they are case insensitive. There are exceptions.
Q. Jeff. Is it possible to get an example or how to: Allows you to search for words or phrases in documents stored in image columns in their native data type, i.e. Word, native data type, i.e. Word, PDFs, etc???
Q. Robin Have you ever combined MSSearch with IIS full text searching? Would it be better to add a column in the table for the document rather than combine these?
A. The performance hit of using a linked server to Indexing Services is so expensive it is better to dump your database content to disk, and having Indexing Services query and index it, or store it all in your database and have SQL FTS query and index it.
Q. Kerry Does stemming work with irregular verbs like going, gone?
A. Yes it works very well. It knows that the past tense of sit is sat and also the past tense of fit is fit (and not fat). What is really interesting is feeding some of the irregular French verbs at it. Again, it is highly accurate here as well.
Q. John How do you obtain the ranking value for each of the records returned in a select query?
A. Use FreeTextTable or ContainsTable. Consult
for more information on how to use this.
Q. Vinay How can we compare FTS to something like Amazon's search or can we even compare them?
A. Amazon???s search engine is tuned to sell books. Its results are skewed accordingly.
Q. Phil Is there any way to search of partial words, e.g. retrieve all occurrences of ".jpg" where ".jpg" might be part of a filename.
A. In Win2003 searches on filename.jpg will match with filename.jpg, filename jpg, and filename,jpg. In Win2k searches on filename.jpg will match with filename jpg, and filename,jpg.
You should replace tokens like filename.jpg should be replaced with filenameDOTjpg for accuracy of your searches.
Q. Dennis Are searches performance sensitive to servers with "hyper threading" capability is the feature is active (CMOS)?
A. No, in fact SQL FTS has a sweet spot on 8 way boxes, so using SQL FTS on a Quad with hyper-threading is a good way of getting to the advantages of an 8 way on a Quad.
Q. Steven If you upload a Microsoft Word document into a blob, can you search against the document properties (metadata) such as title or keywords?
A. In SQL 2005 you can, but not in SQL 2000. The way it works in SQL 2005 is that the search is done on the content and properties, so you can???t restrict the search to only search the DocAuthor property (for example). With XQuery you can search on a specific property if you were to store an XML document (saved perhaps as WordXML) in the XML datatype.
Q. Mira Is here some tips or any way to increase the speed of population of FT catalogs?
A. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqlpro04/html/sp04f9.asp and also have a look at the SQL FTS Search white paper. http://support.microsoft.com/default.aspx?scid=kb;en-us;323739
Q. Max Can you expect any server performance issues when installing and using SQL FTS?
A. Yes, while you install SQL FTS your server will be shut down. When indexing table, SQL FTS does impact performance. The amount of the impact is a factor of the number of rows in your tables, the language of the documents you are indexing (German and Asian languages impact performance considerably), the type of content, and your hardware. Even though the performance hit while indexing is noticeable, the cost of this performance degradation should be weighed against the performance gains made while querying which are staggering, especially when contrasted with a LIKE search.
Q. Bob Is this feasible for searching very large BLOBs (2 Gigs or more)? Or would it take a very long time to do a search in these files?
A. No, SQL FTS can only index the first 256k of extracted text. Query ???Filtering Supported File Types??? in BOL for more information on this limitation.
Q. Prabhakar What is a Master Index. What is a Shadow Index
A. When SQL FTS builds an index it creates multiple shadow indexes which are merged into a single master index after 50,000 rows are indexed. All shadow indexes are merge into a master index at midnight. If you had a single index, updates to this index would hamper querying and indexing performance, just like updates cause locking on a regular index on a table. By having the updates occur on a separate index (the shadow index) the impact to of the indexing and updating on performance is mitigated.
Q. Kerry Does linguistic stemming apply to verbs as well as nouns?
A. Linguistic stemming applies language rules to each token. It looks at the word and makes a decision based on several factors on how to stem it. It has its own internal dictionaries (peek at them using a text editor have a look at %windir%\system32\wbdbase.enu to see the suffixes.
Q. Cedric can you run MSSearch on across multiple instances of SQL Server on the same box?
A. MSSearch is cluster aware, so you can use SQL FTS on a cluster. If you have two or more instances of SQL Server on a single server (clustered or unclustered) it will share the same instance of MSSearch. With SQL 2005, both instances will have their own MSSearch service (called MSFTESQL, and MSFTESQL$InstanceName)
Q. Erik Is there any Script or UDF that will clean string before a FTS is done so that errors will not be thrown? Customers search our knowledgebase, and they can enter any type of word, phrase or characters. Customer entering error messages containing a weird mix of characters will usually throw an error. The only one I really found was on: http://www.sqlservercentral.com/columnists/ckempster/fulltextindexingtextparsingroutine_printversion.asp
and this still throws several errors per day, down from maybe 30 or 40 errors per day, even after I additional some additional text parsing.
A. Please consult http://www.indexserverfaq.com/noise.htm
I don???t really recommend doing this unless you also remove the noise words from your content. Consider this search.
Select * from TableName where contains(*,??????University of California??????)
If you strip out the noise words like this:
Select * from TableName where contains(*,??????University California??????)
You will not get hits to content containing the phrase ???University of California???. So you have to remove the of from your content as well. Or you could do this:
Select * from TableName where contains(*,??????University??? and ???California??????)
But again this is not satisfactory as you will then get hits to documents contains the phrase ???University of San Francisco, located in San Francisco, California???.???
So, I recommend replacing your noise word list with a single space. There are performance benefits with using very large numbers of noise words, for example for one client we had noise word lists of 14000, but you must know in advance everything your searchers might search on.
You can use a FreeText search will in most cases will work well (except if the noise word is at the end of your search phrase, i.e. select * from tablename where FreeText (*,??????the??????). The problem with using FreeText is that performance isn???t always the best, and then some users will find it returns too many hits. On the other hand FreeText search is considered to be a ???natural??? way to search and is the best way to search when your users are searching on phrases as opposed to exact words.
Q. Timothy I know you're going to cover SQL 2005 new features, but I've just got to ask -- Will we be able to index VIEWs in 2005?
A. Yes you can index indexed views. There are a few caveats though.
Here is an example of how to do this (courtesy of David Poole of Microsoft.com ??? I am sure David extends the standard disclaimer to this).
drop database viewexample
create database viewexample
create table t1(id int identity, ftcol nvarchar(100))
insert t1(ftcol) values('hello')
insert t1(ftcol) values('world')
insert t1(ftcol) values('hello again')
create fulltext catalog cat1
-- now create the view, must be schemabound view so it must explicitly
-- specify columns and 2 part name for table
CREATE VIEW myview with SCHEMABINDING
select id,ftcol from dbo.t1 where id<=2 go
-- create the index on the view to be used as fulltext key index
create unique clustered index idx on myview(id)
create fulltext index on myview(ftcol) key index idx on cat1
while fulltextcatalogproperty('cat1','populatestatus') <> 0
waitfor delay '00:00:01'
go -- this will only hit row 1 as row 3 does not fall into the view definition
select * from myview where contains(ftcol,'hello')
go < /P>
Q. Michael Has Hillary published his book yet? I've been hoping for it for some time.
A. It should be released to coincide with the release of Yukon.
Q. David Any thoughts on replicating full text index catalogs across servers? Not sure if it will be in the presentation.
A. use the post snapshot command to create the catalog on the subscriber and to full text index the tables. You can start populations on the subscriber by using the sp_addscriptexec command which works on all subscribers replicated through UNCs.
designed by :: smilla group