Reconocimiento de Escritura

48
Reconocimiento de Escritura Daniel Keysers Image Understanding and Pattern Recognition German Research Center for Artificial Intelligence (DFKI) Kaiserslautern, Germany Mar-2007 Keysers: RES-07 1 Mar-2007

Transcript of Reconocimiento de Escritura

Page 1: Reconocimiento de Escritura

Reconocimiento de Escritura

Daniel Keysers

Image Understanding and Pattern RecognitionGerman Research Center for Artificial Intelligence (DFKI)

Kaiserslautern, Germany

Mar-2007

Keysers: RES-07 1 Mar-2007

Page 2: Reconocimiento de Escritura

Outline

OCR - Introduction

OCR fonts

Tesseract

Sources of OCR Errors

Keysers: RES-07 2 Mar-2007

Page 3: Reconocimiento de Escritura

Outline

OCR - Introduction

OCR fonts

Tesseract

Sources of OCR Errors

Keysers: RES-07 3 Mar-2007

Page 4: Reconocimiento de Escritura

OCR

I OCR = Optical Character Recognition

I steady progress in OCR since the mid-fifties

I 1975: IBM Optical Page Reader cost over 3,000,000 US$ anddisplaced several dozen keypunch operators

I applicationsI home or office useI forms processingI address readingI conversion of large archives of text to computer-readable form

Keysers: RES-07 4 Mar-2007

Page 5: Reconocimiento de Escritura

OCR

Although researchers have worked on the problem of OCR for atleast thirty years, there has been a renewed interest in OCRtechnology in the recent years. This is partly due to

I the increasing need for efficient information storage andretrieval,

I the increasing need for cross-language information access, and

I the dramatic drop in scanner prices

I [large scale digitization projects]

(Kanungo, 1998)

Keysers: RES-07 5 Mar-2007

Page 6: Reconocimiento de Escritura

OCR Paradigms

Keysers: RES-07 6 Mar-2007

Page 7: Reconocimiento de Escritura

OCR

building blocks of an OCR system:

I noise removal

I skew estimation

I page segmentation & line detection (= layout analysis)

I segmentation

I character classifier

I language modeling

What are the main differences to a handwriting recognition system?

Keysers: RES-07 7 Mar-2007

Page 8: Reconocimiento de Escritura

OCR

even 99% accuracy= 30 errors on a typical printed page of 3000 characters

“in almost every application, either the OCR results must becorrected by a human operator or a significant fraction of thedocuments are rejected in favor of operator entry” (Nagy+ 2000)

Keysers: RES-07 8 Mar-2007

Page 9: Reconocimiento de Escritura

Comercial OCR

Keysers: RES-07 9 Mar-2007

Page 10: Reconocimiento de Escritura

Commercial OCR

Keysers: RES-07 10 Mar-2007

Page 11: Reconocimiento de Escritura

Commercial OCR

Keysers: RES-07 11 Mar-2007

Page 12: Reconocimiento de Escritura

Commercial OCR

Keysers: RES-07 12 Mar-2007

Page 13: Reconocimiento de Escritura

Systems

commercial:

I Abbyy (FineReader)

I Nuance/Scansoft (OmniPage)

I Oce (RecoStar)

I Iris (ReadIris)

I many smaller vendors

open source:

I ocrad

I gocr

I Tesseract

I OCRopus

Keysers: RES-07 13 Mar-2007

Page 14: Reconocimiento de Escritura

Outline

OCR - Introduction

OCR fonts

Tesseract

Sources of OCR Errors

Keysers: RES-07 14 Mar-2007

Page 15: Reconocimiento de Escritura

OCR fontsIf you can, try to make the problem easier. This is usually easierthan improving the classifier.

Keysers: RES-07 15 Mar-2007

Page 16: Reconocimiento de Escritura

OCR fonts

two groups of OCR fonts

I Magnetic Ink Character Recognition (MICR)

I Optical Character Recognition (OCR)

I artificial distinction

history:

I financial world

I bar-code not easily readable by humans

I magnetic ink preferred

(source: D. Winter)

Keysers: RES-07 16 Mar-2007

Page 17: Reconocimiento de Escritura

OCR fonts - E13-B

I first font used in automated banking

I digits and four special symbols used in banking””dash”, ”amount”, ”on*us” and ”transit”

I The font was specially designed so that magnetic pulses wouldbe read unambiguously. That is the reason for some of theheavy black features of some symbols. For this purpose a gridof 7 by 11 squares was used that either had to be white orblack. The filling was done so that a magnetic scan wouldgive a pulse signal that was very distinctive.

Keysers: RES-07 17 Mar-2007

Page 18: Reconocimiento de Escritura

OCR fonts CMC-7

I digits, the letters, and five special symbols

I can also be seen as a barcode:each symbol encoded by seven vertical bars separated by smallor large spaces

Keysers: RES-07 18 Mar-2007

Page 19: Reconocimiento de Escritura

OCR fonts - OCR-A

I end of the sixties: full character recognition

I first font: OCR-A

Keysers: RES-07 19 Mar-2007

Page 20: Reconocimiento de Escritura

OCR fonts - OCR-B

I accompanies EAN barcodes

I more symbols exist

Keysers: RES-07 20 Mar-2007

Page 21: Reconocimiento de Escritura

Output Representation: hOCR Format

I proposed open standard for representing OCR resultsI motivation: existing formats have limitations in

I multi lingual capabilitiesI typographic phenomenaI separate formats for intermediate and final results

I goal: reuse as much existing technology as possibleI main idea: represent various aspects of OCR output

I logical structuringI typesettingI character informationI etc.

I HTML microformat

Keysers: RES-07 21 Mar-2007

Page 22: Reconocimiento de Escritura

Assessment of OCR Error Rate

I usually: edit- or Levenshtein distance (cp. SpeechRecognition)

I in the presence of reading order errors: allow blockmovements at certain cost (cp. Machine Translation)

I interesting problem: predict OCR accuracy from a set ofdocument image measurements

Keysers: RES-07 22 Mar-2007

Page 23: Reconocimiento de Escritura

OCRopus Open Source OCR System

I Layout Analysis → see previous lectureI Character Recognition Engine:

I based on Tesseract → discussed hereI based on segmentation → see lecture on handwriting

recognition

I flexible architecture

Keysers: RES-07 23 Mar-2007

Page 24: Reconocimiento de Escritura

OCRopus Open Source OCR System

I ‘OCRopus’ is the DFKI open source OCR system

I framework for layout analysis and OCR integration

I DFKI layout analysis + DFKI or HP-labs ‘Tesseract’ OCR

I preliminary evaluation on 18 documents, 15K words

I goals: improving adaptivity to new fonts and layouts

Keysers: RES-07 24 Mar-2007

Page 25: Reconocimiento de Escritura

OCRopus Open Source OCR System — Release

I first version (technology preview) to be released soonI other application: screen OCR

Keysers: RES-07 25 Mar-2007

Page 26: Reconocimiento de Escritura

screen OCR challenges

Keysers: RES-07 26 Mar-2007

Page 27: Reconocimiento de Escritura

Screen OCR

I motivation:

I image-based HTML analysisI image-based cut and paste

I character recognizers:

I HMMsI Tesseract

Keysers: RES-07 27 Mar-2007

Page 28: Reconocimiento de Escritura

Outline

OCR - Introduction

OCR fonts

Tesseract

Sources of OCR Errors

Keysers: RES-07 28 Mar-2007

Page 29: Reconocimiento de Escritura

History

R. Smith: ‘An overview of the Tesseract OCR Engine’, submittedto ICDAR 2007, personal communication.

I developed at HP between 1984 and 1994

I obtained good results at the 1995 UNLV Annual Test of OCRAccuracy

I then, development was stopped while other commercial OCRengines improved

I In late 2005, HP released Tesseract for open source.

Keysers: RES-07 29 Mar-2007

Page 30: Reconocimiento de Escritura

Architecture

I Layout Analysis was separate, therefore not included

I (OCRopus closes this gap by including the possibility to useDFKI layout analysis with the Tesseract recognition engine)

I only supports US-ASCII

I connected component analysis

I stored as outlines (‘blobs’)→ enable recognition of inverse text easily

I blobs → text-lines

Keysers: RES-07 30 Mar-2007

Page 31: Reconocimiento de Escritura

Chopping into Words

distinguish fixed pitch and proportional text

fixed-pitch: chopping simple

proportional: Measure gaps in a limited vertical range between thebaseline and mean line. Spaces close to the threshold are madefuzzy, so that a final decision can be made after word recognition.

Keysers: RES-07 31 Mar-2007

Page 32: Reconocimiento de Escritura

Adaptation Using Two Passes

first pass:

I attempt to recognize each word

I ‘good’ words are used as adaptation data

second pass:

I recognize words that were not recognized well again

I use adaptation data

Keysers: RES-07 32 Mar-2007

Page 33: Reconocimiento de Escritura

Line Finding

I does not need de-skewing

I filter large and small blobs out

I fit remaining blobs to parallel text line model

I re-assign left out blobs

I fit baselines as splines

Keysers: RES-07 33 Mar-2007

Page 34: Reconocimiento de Escritura

Segmentation of Touching Characters

While the result from a word is unsatisfactory, Tesseract attemptsto improve the result by chopping the blob with worst confidence.

Candidate chop points are found from concave vertices of apolygonal approximation of the outline, and may have eitheranother concave vertex opposite, or a line.

Keysers: RES-07 34 Mar-2007

Page 35: Reconocimiento de Escritura

Combine Broken Characters

A* search of the segmentation graph of possible combinations ofthe maximally chopped blobs into candidate characters withoutactually building the segmentation graph, but instead maintaininga hash table of visited states

character classifier can recognize broken characters directly

Keysers: RES-07 35 Mar-2007

Page 36: Reconocimiento de Escritura

Character Classifier

I match outlines of test many-to-one

I test features 3-dimensional, (x, y position, angle),typically 50-100

I prototype features are 4-dimensional (x, y, position, angle,length), typically 10-20 features in a prototype

I hierarchical classification and use of lookup-tables for speed-up

Keysers: RES-07 36 Mar-2007

Page 37: Reconocimiento de Escritura

Normalization

baseline and moment normalization used inadaptive and static classifier

Keysers: RES-07 37 Mar-2007

Page 38: Reconocimiento de Escritura

Training Data

I no need for broken characters in training

I 20 samples of 94 characters from 8 fonts in a single size, butwith 4 attributes (normal, bold, italic, bold italic), making atotal of 60,160 training samples

I other classifiers: often more than 1,000,000 training samples

Keysers: RES-07 38 Mar-2007

Page 39: Reconocimiento de Escritura

Linguistic Analysis

very basic linguistic analysis:look-up in different dictionaries

Keysers: RES-07 39 Mar-2007

Page 40: Reconocimiento de Escritura

Outline

OCR - Introduction

OCR fonts

Tesseract

Sources of OCR Errors

Keysers: RES-07 40 Mar-2007

Page 41: Reconocimiento de Escritura

Major Classes of OCR Error Sources

(Nagy+ 2000)

Keysers: RES-07 41 Mar-2007

Page 42: Reconocimiento de Escritura

Examples

Keysers: RES-07 42 Mar-2007

Page 43: Reconocimiento de Escritura

Examples

Keysers: RES-07 43 Mar-2007

Page 44: Reconocimiento de Escritura

Examples

Keysers: RES-07 44 Mar-2007

Page 45: Reconocimiento de Escritura

Examples

Keysers: RES-07 45 Mar-2007

Page 46: Reconocimiento de Escritura

Potential Sources of Improvement

(Nagy+, 2000)

I improved image processing

I adaptation of the classifier to the current document

I multi-character recognition

I increased use of context

I [combination of multiple OCR results]

Keysers: RES-07 46 Mar-2007

Page 47: Reconocimiento de Escritura

Adaptation

I OCR systems generally optimized for average performanceover large sets of characters of different

I fontsI sizesI scan qualities

I improve performance of character recognizer by using styleinformation

Keysers: RES-07 47 Mar-2007

Page 48: Reconocimiento de Escritura

Adaptation Approaches

I modeling the sample distribution [Breuel 2001]I each document contains only a single styleI modeling sample distribution as a mixture of Gaussians

I hierarchical Bayesian approach [Mathis et al. 2002]I each document contains a small number of fontsI estimation of prior distributions of a style variable

I style constrained classifiers [Sarkar et al. 2005]I style consistency constraint is hidden variableI combination of class and style represented by Gaussian mixture

modelI each mixture component trained by estimating it directly from

set of samples from a specific class and style

I maximum likelihood linear regression [Senior et al. 1997]I based on work on speaker adaptation [Legetter 1995]I adaptation using linear transformations

Keysers: RES-07 48 Mar-2007