No Documents Left Behind: The Case for Universal Searchability

By Dean Sappey, President and CEO, DocsCorp

Despite investment in document management systems (DMS) and search technology, 30% of documents in content repositories may be non-searchable and therefore “invisible” to search.

We now live in a “Google World,” one in which we search for anything—anytime, anywhere! Using the firm’s enterprise search, our expectation is that we should be able to find any document and do so instantly—even if we don’t know the client, matter or any other key metadata. Surely, it is just a matter of typing a few words that are likely to be in the document, and hey presto, the document appears by magic!

Most law firms accept the importance of universal searchability, but just in case, let’s recap why it is so important.

Courts require documents provided during eDiscovery to be fully text-searchable. This is key to reducing costs on both sides and reducing the overall cost of litigation to the client.

Lawyers know that document search is essential to managing risk. Conflict checking often involves searching document archives for potential conflicts, using a specific word or phrase. Firms affected by HIPAA (Health Insurance Portability and Accountability Act) regulations for the storage and searchability of medical records must be able to search for any word or phrase in any document. Documents are often misfiled and searching for words ‘in’ the document is often the only way of finding that long lost document.

How many times have IT departments heard the refrain, “I know I saved the document, but now I can’t find it!”

Most law firms assume that having invested heavily in DMS and Knowledge Management (KM) systems means that all documents are searchable. The IT department also assures them that the scanners use optical character recognition (OCR) on paper documents as part of the scanning process so they are searchable. Problem solved, right?

In law firms, the reality is often very different. After conducting hundreds of audits across North America, Europe and Asia in the past year I can attest that this is a global issue. Most firms have at least 30 percent of their documents in non-searchable formats—even those with powerful DMS and high-tech scanning technologies in place.

Non-searchable content is generally scanned PDFs, TIFF images or emails with PDF or TIFF attachments, and is usually received from a client. PDF documents are unique in that they can store both the ‘image’ of a document, complete with graphics, signatures and coffee cup stains, together with an invisible layer of searchable ‘text.’ This text is generated using OCR software so that the document can be searched for. TIFF images by contrast cannot store any text. The file format only allows for an image to be stored in the file.

Of course, if you save your email with attachments direct to your DMS, there is no place in the process for your scanner to OCR the document. In this case, you have just added a non-searchable document into your system.

It’s a little known fact that search engines supplied with all commercial DMS do not make image-based documents searchable. They assume that the document was a searchable when it was saved.

So, having determined that it’s imperative that all your documents are searchable, what should you do? Should your scanners OCR all documents? What about emails with attachments you save to your DMS? This requires setting up controls over all the ‘entry points’ to your enterprise.

If that’s too hard and unreliable, as is nearly always the case, consider setting up a background process to find image-based documents in the DMS and add the text layer after they are saved. That way, you will know that all your documents are 100% searchable. This is the simplest and easiest way of ensuring universal document searchability for new documents and legacy documents.

