Expand News
IS Administration
Expand MMC
Expand Com Objects
Expand IDA/IDQ
IS Programming
Expand IS Programming
IS TroubleShooting
Expand Indexing
Expand Querying
IS Optimizing
Expand Querying
Expand Indexing
IS Links
Expand MSDN
Expand TechNet
Expand Other MS
Expand Other Links
iFilters
Expand iFilters
SQL 2000 FTS
Expand Loading Data
Expand Indexing
Expand Querying
Expand Blobs
Blobs
Expand SQL FTS Links
SQL 2005 FTS
Expand What's New

Index Server Faq

Indexing Blobs

Blobs stand for Binary Large ObjectS (also known as LOBS (Large binary OBjectS). Blobs are typically stored in Image data type columns although they can be stored in text or the char family (char, nchar, varchar, nvarchar) data type columns (this is not recommended, but I have run across people doing this). Examples of blobs are Word documents, jpgs, PDFs, txt, html, XML documents, etc, although txt, html, and XML and not binary in nature. If you are wondering if your document is binary or not, open it in a text editor. If the vast majority of it shows up as readable text, it is probably text, if not, it is binary.

What follows is simple instructions on how to load, index, and search blobs. This is followed by some FAQs on blobs.

SQL Server 2000 allowed you to index and search blobs if they are stored in the image data type and you have a column which contains the extention the document would have if it was stored in the file system. IE, if you have a PDF stored in an image data type column the extension would be .PDF; Word docs, .doc.

Please refer to the list of the iFilters (click here to display list) to determine which types of documents SQL FTS can index and search. SQL FTS 2000 can only index the contents or body of a document, it can't index the document's properties. If the iFilters for your document you are indexing only emit properties (i.e., the jpg, or mp3 iFilters) you will not be able to search these documents.

Requirements to index blobs.

  • SQL 2000 Developer, Standard, or Enterprise Editions

  • SQL FTS installed on your machine

  • The documents you wish to index are stored in the Image data type column

  • You have used the type_colname parameter of sp_fulltext_column to indicate which column will contain the document extension value.

  • The document type column is char(3), char(4), or varchar(X), and contains the file extension your document would have if it is stored in the file system.

  • You have an iFilter for your document type. Note that if you don't have an iFilter for your document type, you can enter txt as the value for the type_colname column to use the text iFilter. You will have unpredicable results using this, but if your document is largely text (ie log files, etc) it will work.

    An Example

    Let's plunge right in with an example.

    The database and table:

    CREATE DATABASE blobs

    GO

    USE blobs

    GO

    CREATE TABLE BlobTable ( PK int IDENTITY (1, 1) NOT NULL,

    BlobColumn image NULL ,

    DocumentTypeColumn char (4),

    filepath char (200),

    CONSTRAINT PrimaryKey PRIMARY KEY CLUSTERED(pk) )

     GO

    exec sp_fulltext_database 'enable'

    GO

    exec sp_fulltext_catalog 'blobs', 'create'

    GO

    exec sp_fulltext_table 'BlobTable', 'create', 'blobs', 'PrimaryKey'

    GO

    exec sp_fulltext_column 'BlobTable', 'BlobColumn', 'add', 1033, 'DocumentTypeColumn'

    GO

    exec sp_fulltext_table 'BlobTable', 'activate'

    GO

    Loading the table

    This vbscript will load the table with all the documents stored in the c:\blobs directory

    Then run is like this cscript store this cscript c:\loadblobs.vbs (where loadblobs.vbs is the name of my vbs script). Note this is good for smaller blobs, for large blobs use textcopy (yes it works for loading documents into the Image data type as well). Save this file as a with the vbs extension.

    Set objConn = CreateObject("ADODB.Connection")

    Set objRS = CreateObject("ADODB.RecordSet")

    Set objStream=CreateObject("ADODB.Stream")

    objConn.Open "Provider=SQLOLEDB;data Source=ServerName;Initial Catalog=blobs;User Id=non_sa;Password=password"

    Set objFileSystem=createobject("Scripting.FileSystemObject")

    Set objDir=objFileSystem.GetFolder("c:\blobs")

    for each objFile in objDir.Files

    count=count+1

    wscript.echo objFile.name

    'a hack to handle .'s occuring in the file name

    objConn.Execute "insert into blobTable (DocumentTypeColumn,FilePath) values ('" & strreverse(left(strreverse(objFile.name),instr(strreverse(objFile.name),"."))) & "','" & Replace(objFile.path,"'","''")&"')"

    objRs.Open "select BlobColumn from blobTable where pk=" & count, objConn, 1, 3

     objStream.Type = 1 '1 indicates binary, 2 text

    objStream.Open

    objStream.LoadFromFile

    objFile.Path

    objRs.Fields("BlobColumn").Value=objStream.Read

    objRs.Update objRs.Close

    objStream.Close

    next

    objConn.Close

    Set objStream=nothing

    Set objShell=nothing

    Set objConn=nothing

    Set objFileSystem=nothing

    Set objDir=nothing

    That's it. Now you have to run your population.

    Running your population

    The best way to proceed is to use change tracking. To get this working, issue the following command using Query Analyzer:

    exec sp_fulltext_table 'BlobTable', 'start_change_tracking'

    EXEC sp_fulltext_table 'BlobTable', 'Start_background_updateindex'

    Querying blobs

    Query your blobs in the same way you query your regular full-text indexed tables.

    Here is an example:

    select * from blobtable where contains(*,'microsoft')

    The problem is that your data will come back in binary format and it will be difficult to interpret your hits. You are best to use ContainsTable for performance, and sluff your blobs off into another table. I advocate leaving a copy of the blob in the file system and a path or virtual path to the document's location in the file system. I further advocate converting your document to text and storing it in text or char columns - again in a child table. Please refer to the hit highlight sample for an example of how to do this. Also consult the FAQ for a more indepth discussion of the benefits of storing your data in a child table.

    Troubleshooting

    There are several kb  articles on problems related to indexing blobs.

    Blobs are extracted and stored in the file system while being indexed. You can change this location using the following kbarticle. http://support.microsoft.com/default.aspx?scid=kb;en-us;817296

    Full-Text Query Does Not Return the Expected Result Set. http://support.microsoft.com/?id=308841

    Problems with indexing html documents (specifically if Data That is Enclosed in Parenthesis). http://support.microsoft.com/default.aspx?scid=kb;en-us;321332

    FIX: Full-Text Search Population of a Word Document Stored in SQL Server Causes CPU Spin with SQL Server 2000 Service Pack 2. http://support.microsoft.com/default.aspx?scid= kb;en-us;323039

    A Full-Text Search May Not Return Any Hits If It Fails to Index a File http://support.microsoft.com/default.aspx?scid=kb;en-us;308771

    FIX: Non-OCR/Non-Display TIFF Data Indexed by SQL Server Full-Text. http://support.microsoft.com/default.aspx?scid=kb;en-us;321820

    FIX: SQL Server Full-Text Population by Using a Single-Threaded Filter DLL or a PDF Filter DLL May Not Succeed. http://support.microsoft.com/default.aspx?scid=kb;en-us;323040

    Full-Text Queries Using Adobe PDF IFilter Do Not Return Any Search Results. http://support.microsoft.com/default.aspx?scid=kb;en-us;323324

    FIX: Infinite Loop During Full-Text Catalog Population. http://support.microsoft.com/default.aspx?scid=kb;en-us;32303

    Here are some troubleshooting FAQs

    Q. I am getting the following message in my application log (use event viewer view the application log).

    Full-text indexing a document stored in an image column failed with error '0x8007007b'.

    (http://support.microsoft.com/default.aspx?scid=kb;en-us;308841fixed with SP 2 ).

    One or more documents stored in image columns with extension '' did not get full-text indexed because loading the filter failed with error '0x1'. Note: These documents will not be passed to MSSearch for indexing, and therefore this failure will not be reflected in the end of crawl summary statistics.

    A. A NULL values exists for one or more rows for the document type column.

    Q. Full-text indexing a document stored in an image column failed with error '0x80070003'.

    A. I am currently researching this error. It seems to be transient.

    Q. I got the following message in my gatherer log.

    7/29/2001 12:07:04 PM Add Started Full crawl

    7/29/2001 12:08:20 PM MSSQL75://SQLServer/6fe99f9f Add Error fetching URL,

    (800705b4 ‑ This operation returned because the timeout period expired.)

    I also have the message in the application log of my event viewer:

    One or more warnings or errors for Gatherer project <SQLServer SQL0000500005>

    were logged to file <C:\Program Files\Microsoft SQL Server\MSSQL\FTData\SQLServer\GatherLogs\SQL0000500005.3.gthr.

    A. Change your data_timeout value to something larger.

    Sp_fullTextServer ???datatimeout???, 600 --the default is 300

    Q. I am getting the following message in the application log - Source SQLFTHNDLR -Event 2001. It reads "One or more documents stored in image columns with extension '.doc ' did not get full-text indexed because loading the filter failed with error '0x1'. Note: These documents will not be passed to MSSearch for indexing, and therefore this failure will not be reflected in the end of crawl summary statistics." What should I do? 

    A. What is important to note from the above message is that there are spaces after the .doc. So SQL FTS is trying to load an iFilter for the extension 'doc   ', no iFilter exists for this extension, but one does exist for '.doc'. To solve this problem use varchar(10), or char(3).

    Q. I am getting the following message in the application log - Source SQLFTHNDLR -Event 2001. It reads "One or more documents stored in image columns with extension 'xxx' did not get full-text indexed because loading the filter failed with error '0x1'. Note: These documents will not be passed to MSSearch for indexing, and therefore this failure will not be reflected in the end of crawl summary statistics." What should I do?

    A. No iFilter exists for the document type 'xxx'. Locate the iFilter for this document type or use the txt iFilter by making the document column type column values for this row - txt.

    Q. What are the performance considerations I should be thinking about when indexing blobs.

    A. If at all possible convert your data to text, and store it in columns of the text or char (or varchar, nchar, or nvarchar)  data types. This is because it takes longer to index documents in their native formats than the text formats.

    Q. Should I then use text or varchar?

    A. When you use the text data type, the text data is stored in a seperate page. So when you query columns which aren't of the text or image data type (in our blobtable example - select Pk, DocumentType, FilePath from blobtable) less database pages have to be pulled off disk. If you store the text data in the varchar columns, these columns would also be pulled off disk to satisfy queries which do not involve these columns. In essence your table is wider when you use varchar.  How much wider, depends on how much data you are storing in your varchar columns as they collapse to accomodate the amount of data in them. Other than this important performance consideration, use char wherever possible. To get around this added width when using varchar, use a child table.

    Q. What about text in row?

    A. Text in row has the effect of making the table wider when your data is can fit in the page. It will save on storage space when the larger portion of your data can fit on the page, but this can impact your queries as they will be executed on wider tables. The effective width of course depends on how much data is stored in the text column. 

    Q. What is text in row?

    A. If the amount of data is small enough to fit in available space in a data page (8k - the length consumed by the other columns), the text data can be stored in page. If it is not, it will be saved in a seperate data page(s). Using the text in row option can save storage space on disk, but may cause performance problems under certain circumstances. 

    For more information on this option consult the following links:

    http://www.windowsitpro.com/SQLServer/Article/ArticleID/37635/37635.html

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/architec/8_ar_da2_5f1d.asp

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_sp_ta-tz_1cfi.asp

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/architec/8_ar_da2_5f1d.asp

    http://www.sqlserverfaq.com/content/kbase/store/Tutorials/61.htm

    http://www.winnetmag.com/SQLServer/Article/ArticleID/26852/26852.html

    Q. Do you recommend indexing XML documents using the method discussed above.

    A. If you have the XML parser and XML iFilter installed on your SQL Server you will be unable to index and query node comments, element names, attributes. You will be able to index and query element values.

    You can index XML documents as text by storing then in the varchar, Text, or Image columns (with a value o fthe corresponding document type column of txt) but your results will be unpredictable as the iFilter will not process the XML tags and their correctly. If you want to index comments, element names and attributes (as well as element values), you should shred the XML document to a relational representation of the XML document and store it in a table(s).

  •  

    designed by :: smilla group