OCR-Team-­Report

Technical sketch for an IT-System

 

Basically there the following steps have to be performed:

 

  • Digitization of the literary journals. The existing corpus of texts can be digitized in the digitization service at the university library of Tübingen. The result will be high quality scans as images.
  • Optical Character Recognition (OCR) will be needed to transform the digitized images to text.

◦     First experiments with Devanagari OCR proved to be very successful.

◦     As to the Perso-Arabic scripts used for Urdu, mainly the Nasta’liq script will be a more challenging OCR task. Additional research will have to be performed during the project and one project result will be a productive solution. To prevent an asymmetry of text corpora it is planned that first Urdu-Text will be manually typed.

◦     In both cases intelligent algorithms from computer linguistics will be applied to improve the results

  • Text-to-speech systems will be applied to generate audio-files out of the OCRed text. Again state of the art technology will be deployed to achieve apprehensible results
  • Additionally a full text index and a data base of all materials will be created.
  • All materials, i.e. scans images, OCRed texts and audio files will be published on the web site, via a retrieval interface searches can be made in the corpora via the data base and the full text index

Leave a Reply

Your email address will not be published. Required fields are marked *