Logo del repository
  1. Home
 
Opzioni

A Probabilistic Approach to Printed Document Understanding

MEDVET, Eric
•
BARTOLI, Alberto
•
DAVANZO, GIORGIO
2011
  • journal article

Periodico
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION
Abstract
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results---e.g., a success rate often greater than 90% even for classes with just two samples.
DOI
10.1007/s10032-010-0137-1
WOS
WOS:000297247700003
Archivio
http://hdl.handle.net/11368/2303667
info:eu-repo/semantics/altIdentifier/scopus/2-s2.0-81355135319
Diritti
closed access
license:digital rights management non definito
FVG url
https://arts.units.it/request-item?handle=11368/2303667
Soggetti
  • Document understandin...

  • Automatic model upgra...

  • Invoice analysi

  • Maximum likelihood

Web of Science© citazioni
21
Data di acquisizione
Mar 25, 2024
google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback