Friday, April 4, 2008

What is a "searhable" PDF?

ADVANCED FUNCTIONS- create searchable PDF documents with this application.

What is a “searchablePDF?-
In simple terms, a searchable PDF is an image (picture) containing the “text” in a layer (usually behind the image and not visible). A scanned document (PDF as image format) is NOT searchable until an OCR (optical character recognition) process is performed on the document. Some scanning hardware can deliver a searchable PDF (the OCR process is performed during the delivery process). A searchable PDF can also be created by PDF distiller software. This process “converts” a digitized file, such as a MS Word document, to a PDF format. Because the original document was digitized (contains text), the OCR process is not required and a searchable PDF is rendered. An easy way to determine if a PDF is searchable is to open the document with Adobe Reader or Acrobat and perform a “find” function. If the found “text” is highlighted, the document is a searchable PDF.

Benefits of a “searchablePDF-

Searchable PDF’s are very useful for retrieving documents from a document Repository (full content management) and useful to find the location of a word (s) within the document.
Adobe Systems provides a free downloadable tool known as an iFilter. The iFilter provides a link between the “text” layer of the searchable PDF and an “indexing” engine. This connection provides for retrieval of the document by any word(s) contained in the “text” layer or in the metadata (Title, Subject, Author, Keywords) of a PDF. Indexing engines include:

A) The catalog feature of Adobe Acrobat- very powerful engine which provides advanced searching functionality.
B) MS Indexing Services- an “index” maintained at the server level with “load” processing options built into MS server platforms. Note: this application has an unlimited user retrieval tool that leverages this free MS service.
C) MS Desktop Search- a free, downloadable MS powerful tool that maintains an “index” on the desktop of either desktop files, server files or both.
D) MS Sharepoint- searchable PDF’s can be retrieved with the built-in query tool.
E) Other DMS- most document management systems can retrieve searchable PDF’s.



The requirements are:
1) A full license of TOCR (the OCR engine)
2) The “captured” document should be a Group 3 or IV B/W tif.

There are two components that are used to create “searchablePdfsOCR Processor must be set as Full Page; however, you may OCR All pages, the first page, or identify the pages to OCR.

1 comment:

Unknown said...

One noteworthy item relating to OCR. Creating searchable text should not alter the original formatting of the document. This typically is only an issue when saving to a Word Processing app such as Word. But is something to look out for with OCR engines.