Logo del repository
  1. Home
 
Opzioni

Machine Learning Techniques for Document Processing and Web Security

Sorio, Enrico
2013-03-13
  • Controlled Vocabulary...

Abstract
The task of extracting structured information from documents that are unstructured or whose structure is unknown is of uttermost importance in many application domains, e.g., office automation, knowledge management, machine-to-machine interactions. In practice, this information extraction task can be automated only to a very limited extent or subject to strong assumptions and constraints on the execution environment. In this thesis work I will present several novel application of machine learning techniques aimed at extending the scope and opportunities for automation of information extraction from documents of different types, ranging from printed invoices to structured XML documents, to potentially malicious documents exposed on the web. The main results of this thesis consist in the design, development and experimental evaluation of a system for information extraction from printed documents. My approach is designed for scenarios in which the set of possible documents layouts is unknown and may evolve over time. The system uses the layout information to define layout-specific extraction rules that can be used to extract information from a document. As far as I know, this is the first information extraction system that is able to detect if the document under analysis has an unseen layout and hence needs new extraction rules. In such case, it uses a probability based machine learning algorithm in order to build those extraction rules using just the document under analysis. Another novel contribution of our system is that it continuously exploits the feedback from human operators in order to improve its extraction ability. I investigate a method for the automatic detection and correction of OCR errors. The algorithm uses domain-knowledge about possible misrecognition of characters and about the type of the extracted information to propose and validate corrections. I propose a system for the automatic generation of regular expression for text-extraction tasks. The system is based on genetic programming and uses a set of user-provided labelled examples to drive the evolutionary search for a regular expression suitable for the specified task. As regards information extraction from structured document, I present an approach, based on genetic programming, for schema synthesis starting from a set of XML sample documents. The tool takes as input one or more XML documents and automatically produces a schema, in DTD language, which describes the structure of the input documents. Finally I will move to the web security. I attempt to assess the ability of Italian public administrations to be in full control of the respective web sites. Moreover, I developed a technique for the detection of certain types of fraudulent intrusions that are becoming of practical interest on a large scale.
Archivio
http://hdl.handle.net/10077/8533
Diritti
open access
Soggetti
  • machine learning

  • document understandin...

  • web security

  • cloud computing

Visualizzazioni
1
Data di acquisizione
Apr 19, 2024
Vedi dettagli
google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback