GROPUS
GROPUS - an Adaptive Rule-Based Approach to Information Extraction
Internet, mass media, scientific literature are the source of huge, continuously growing amount of information that is comprised by natural language texts and stored in digital form. This information can hardly be immediately accessed and processed by computers while human access is often connected with a time-consuming search. Extracting and storing it in a formal representation (e.g. in form of relations in databases) allows efficient querying and easy administration of the extracted data.
Moreover, information stored and queried in a canonical way can be processed and interpreted by computers without human interaction; it can serve for establishing ontologies, creation of knowledge bases and data analysis. The area of IE comprises techniques, algorithms and methods performing two important tasks: finding (identifying) the desired, relevant data in natural language texts and storing it in a structured representation suitable for automatic processing.
First IE systems relied on domain-specific extraction rules written by a domain expert requiring large human effort and lacking portability to other domains. To compensate the insufficiencies of the classical rule-based approach human effort should be adequately replaced by a learning component. The main goal of this project has been the development of an adaptive, rule-based algorithm for IE that autonomously learns the extraction rules. The algorithm is based on induction learning deriving general extraction rules from a set of sample extractions annotated by a human in a training corpus. Requiring only an annotated training corpus and no additional resources the approach is portable to different application domains and even languages (in the dissertation its effectiveness for English and German text corpora has been examined).
The extraction rules incorporate linguistic patterns that capture typical expression forms of extracted information in a given text corpus. We introduce a higher-order formal pattern specification language that supports regular expressions, permutation, negation and hierarchical XML structures significantly extending common pattern models. Linguistic patterns are not restricted to a fix context window, but encode whole sentences as primary semantic units of natural language. The proposed pattern language is powerful and expressive enough to capture non-trivial kinds of phrases and sentences containing relevant information.
The linguistic patterns are matched with linguistically preprocessed texts that have a valid XML markup. Regarding linguistic patterns as XML queries we reduce the problem of IE to XML query evaluation. Having developed formal semantics and an efficient query evaluation algorithm for the pattern language we create a new XML query language, which is especially suitable for querying XML annotated texts.
As a part of semantic text preprocessing we propose a new method for determination of synonymy. We construct a lexical graph connecting lexical items in a way corresponding to the sentence structure building an implicit context representation. We demonstrate that the synonymy metric based on the length of paths between two lexical items, number and specificity of shared neighbors achieves satisfactory results evaluating it on a test corpus of 200 German synonyms. Identified synonyms are used for abstraction of lexical items during the rule induction.
Beginning with the rules generated from training instances, which were extracted by the human, rules are generalized to account for different kinds of information expression in the texts. The generalization of rules is formally specified and involves beside rule merging abstraction of single rules and substitution of extracted parts in context of different rules. For establishing a similarity measure for extraction rules and rule merging an algorithm for determination of optimal alignment of two sequences with minimum runtime (which is an extension of the LCS problem) has been designed and its correctness proved. To achieve a gradual generalization of extraction rules the rule learning algorithm includes validation of induced rules and rule correction.
We demonstrate the effectiveness of our approach comparing its performance with other state of the art approaches achieving comparable or even best results depending on the kind of texts and assess its potential comparing its results with the human performance. Based on varying performance of different approaches on different corpora conclusions about the efficiency of statistical and rule-based approaches for different kinds of text are made. The quantitative investigation is supplemented by the analysis what factors influence the extraction quality, what are the sources of errors etc. Finally, we draw a conclusion in what conditions application of IE in general is expedient, what kinds of text can be managed and characterize the range of environments where the presented approach can be usefully utilized.