A PDF document can have an image layer and a text layer. Alfresco development allows one to index the content contained in the text layer. Thus, a PDF in Alfresco with a text layer is searchable in Alfresco.
But what happens with a PDF document without any text layer, like a scanned PDF ? They are not indexed, and the search will never retrieve them. This behaviour can be confusing for the user, as he/she won’t see the same behaviour from 2 documents with the same Mime type (PDF). One will show up in the search, while the other won’t.
We created an open-source OCR solution to address this topic. The target is to identify all PDF's with no text layer in the repository and run the following actions on each one of them:
- split each document into multiple images : one for each page.
- run an OCR engine on each image, in order to extract the text (and layout) from the image. The input is a PDF document, the output is a hOCR file.
- merge each image page and the its corresponding hOCR file into a PDF. The result will contain the visual content from the input image with a hidden text layer from the hOCR file.
- merge back all PDF's created for each page into a single PDF In few words, we take a multiple-page PDF with only an image layer that we transform into another multiple-page PDF which has the same look, and a hidden text layer that includes the OCR output. hOCR is an open format based on HTML. It represents an OCR output, by combining layout and style along with the recognized text itself.
Here are the different open-source tools that we choose for each step:
- splitting PDF pages : PDFtk - OCR : Tesseract-ocr
- merging image & hOCR : hOcr2Pdf
- merging PDF pages : PDFJoin
We wrote a linux script to run the whole process, and we call it from Alfresco through a custom ContentTransformer. This is a special one because it has an identical source & target Mime type. Then, we don’t want Alfresco to use it in an uncontrollable way, so that we created it as “unregistered”, which means that they are not find-able through the Transform service and can be called only by direct reference.
As the OCR process can be quite demanding for the server, we choose to run it at night. Thus, we built a job that runs every night, checking the new PDF documents in the repository with no text layer, and manually call the custom transformer on each one of them. Then, the job creates a new version of the document in the repository from the ContentTransformer output.
It’s very easy to make the difference, in Alfresco, between a PDF with or without a text layer. We use the PDFBox library included in Alfresco for this purpose.
In conclusion, it would be easy to customize this example to adapt it to other requirements. For instance, we can create a policy to call the transformation on the fly instead of calling it at night, or we can directly take an image as an input, or we can create a new document in a specific folder instead of creating a new version.
This shows how flexible Alfresco development and open-source solutions can be.