Wednesday, June 23, 2010

Google Docs Now Provides Free PDF Conversion

Just last week, I had a request from someone who wanted to convert and existing PDF to an MS Word file so that students could complete the forms that the document contained on the computer rather than with pen  and paper.

As luck would have it, Google just announced today that users will now have the choice to fully convert PDFs to Google Docs when they upload them.  Furthermore, they have added the ability to perform Optical Character Recognition on a PDF which began as a scanned document.  This is a huge announcement.

Until now, if someone wanted to convert a PDF document to a Word document, they either had to resort to doing so one or two pages at a time, and painfully slowly, on one of several free conversion sites, or they would have to purchase overpriced software to do so.  Now they can quickly perform the conversion to Google Docs, and then, if necessary, save the Google Doc to a Word Document, all for free.

But the even bigger news is that they can do this on scanned documents.  Even most commercial PDF conversion software balks at converting documents that have their origins as scanned images, because these files contain no text to convert; they are just a series of images or pictures.  Now anyone can convert these files to editable text absolutely for free.  Google Docs will perform the necessary Optical Character Recognition (OCR) to manage the conversion.

Couple this news with my previous post on using your school photocopier to scan to PDF, and you have a readily available system for taking any printed page and converting it to a digital file which can be edited and re-distributed.  This is just one more example of how online applications are rapidly replacing desktop software for so many everyday activities.

Now, before I sign off, I am obliged to point out, for those who have never used OCR software that Optical Character Recognition is an imperfect science.  Inevitably the user will need to clean up the resulting text to arrive at a final copy.  And, the quality of the OCR greatly depends on the quality of the original copy.  These things are true of any OCR software.  However, in my initial examination of the Google Docs implementation I was suitably impressed.  I tried three different documents, a scanned magazine article formatted in columns, a scanned fax, and a text PDF.  Both scans were done in black and white at 600 DPI.

  • In the case of the magazine article, Google Docs was able to recognize the column format and scanned each column separately.  There were certainly some errors, but I was impressed that the software wasn't fooled by the columns.
  • The faxed letter turned out remarkably well, although because of the original formatting, it would have taken some work to re-format.  Google Docs chose to entirely ignore the letterhead, which was in an italic/script font (not easy for most software to scan).
  • The text document came out fine.  Fonts were, for the most part, preserved, and the text was, of course accurate, since the conversion was not subject to the vagaries of OCR.  The document I chose was a form, so it needed some re-formatting to adjust the length of the blank lines provided.
Certainly both scanned documents were far from perfect, but anyone who has used OCR software knows this is most often the case.  Google does make editing the final product of an OCR scan somewhat easier by alternating the image pages of the original with the final scanned product.  It also does this with text PDFs which it converts.  When they are no longer needed, the images can be deleted.

All in all this development can save users considerable time, and even money, for many users.


No comments:

Post a Comment