Universal Document Searchability: The Case for Shifting the OCR Goalposts

The dictionary defines the term Achilles heel as "a seemingly small but actually crucial weakness," and this seems to be an appropriate term to use when writing about indexing and searching in document and enterprise content management systems. This article looks at why it would be unwise to assume that all documents in a content repository are completely and fully searchable despite the OCR’ing hardware, software and workflows as well as the advanced search technology that many law firms have at their disposal.

The Risks Are Great

Law firms have invested significantly in document management systems and search technologies over the years to store and manage all their client documents. The idea is that these technologies will deliver efficiency and complete access to each and every document related to any case or matter. Simply type a series of keywords into a search query field, and all the documents that meet the search criteria will be displayed. Sounds great--in theory.

The reality however is very different. Research indicates that as much as 30% of documents in a content repository are actually "invisible" to search. This means that about a third of the documents relating to your case are missing. The culprit -- image-based documents.

Image-based documents are JPGs, TIFs, PNGs and image PDFs. These documents get profiled in a variety of ways into your document management system. While many of these documents get OCR'ed, many do not, and since they are image files with no text, they do not get indexed. Instead they become invisible to your search technology. This has enormous implications for law firms and law departments.

Failure to produce documents on demand impacts the bottom line, workplace efficiency, regulatory compliance, productivity and exposes a firm to unnecessary risks, which can lead to sanctions, dismissal of claims, ultimate loss of case as well as undermining a firm's reputation.

OCR – Wrong Time, Wrong Place

So what is the answer? Perhaps the answer is not so much what, but when.

Certainly, the answer is still OCR'ing, ie converting image-based documents to text-searchable documents so that they can be indexed when profiled into the document management system. The bigger question is when and where should the OCR'ing process take place.

Multifunction devices, scanners and OCR'ing software are commonplace in a modern law firm. OCR'ing in most firms is at the point of entry, ie paper and electronic documents get OCR'ed as soon as they are received by the firm. This however is inefficient, costly and unreliable. Consider how much time staff spend either OCR'ing documents at their desk or feeding documents into the scanner or multifunction device. Or, consider how documents bypass the OCR'ing process all together; documents ingested from acquisitions and imported litigation files; documents saved into the document management system using mobile technology and portable devices; or, how your legacy documents are bulk imported into these systems.

Now the 30% of non-searchable content starts to make sense.

OCR – Right Time, Right Place

So if OCR’ing documents at the point of entry is not the answer. What is? OCR’ing documents at the “end point,” ie when the documents have been saved into the document management system.

Shifting the OCR goalposts to a backend rather than a frontend process will deliver huge benefits to law firms in terms of efficiency, productivity, searchability as well as cost savings. More importantly, a backend approach to OCR'ing will ensure that all documents in the content repository are made searchable once they are saved into the content repository, irrespective of the entry point.

The system will work in dual modes: one will monitor newly-profiled documents so that they are OCR’ed and made available for indexing immediately; the other will OCR all the legacy documents in the system. This approach provides law firms with significant benefits:

  • 100% searchability – all image-based documents in the document management are OCR’ed, adding an invisible layer of text to the document. This will ensure that the document is indexed by the system. Law firms can be certain that all documents are 100% searchable.
  • Increased organizational productivity – staff members do not need to OCR documents. Instead, they can concentrate on more important tasks. By ensuring that every document is text-searchable, firms will be able to eliminate productivity losses and downtime looking for misfiled documents.
  • Increased efficiency through automation – firms will be able to automate the entire process so that processing can take place 24/7.
  • Simplified management of image-based documents – firms will be able to do away of multiple OCR’ing processes and workflows in favour of a single, centralized approach.
  • Reduced costs – firms will be able to reduce OCR’ing hardware and software requirements.


DocsCorp LLC provides solutions that enable legal organizations to improve workflow, reduce risk and increase the efficiency of managing business-critical documents. pdfDocs offers PDF and PDF/A creation, collation, annotation, redaction, bates numbering, file-splitting, electronic filing with the USPTO, OCR, Closing Book creation and document comparison capabilities. Click here for more information.

Return to Forefront main page »
Thomson Reuters Elite Headquarters
800 Corporate Pointe, Suite 150, Culver City, CA 90230
© 2015 Thomson Reuters
Thomson Reuters