The CORE project
The work reported here was carried out within the Computer-based methods for coreference resolution in Polish texts (PL: Komputerowe metody identyfikacji nawiązań w tekstach polskich) project financed by the Polish National Science Centre (contract number 6505/B/T02/2011/40) and carried out between April 2011 and July 2014 at the Institute of Computer Science, Polish Academy of Sciences. It was targeted at the creation of innovative methods and tools for automated coreference resolution in Polish, with planned quality compared to state-of-the-art tools available for other languages.
A monograph published by Walther De Gruyter. The book presents work on coreference understanding, annotation and resolution of a Slavic language which can be applied to natural language processing in computers and software using English and other languages. The book presents specificities of reference, anaphora and coreference in Polish, establish identity-of-reference annotation model and present methodology used to create the corpus of Polish general nominal coreference. Various resolution approaches are presented, followed by their evaluation. By presenting the subsequent steps of building a coreference-related component of the natural language processing toolset, the volume also serves as a reference book on state-of-the art methods in carrying out coreference projects for new languages and a tutorial for NLP practitioners.
Earlier publications and project presentations
PolTAL 2014 (September 17-19, 2014, Warsaw) conference paper describing the results of creating a shallow grammar of Polish capable of detecting multi-level nested nominal phrases, intended to be used as mentions in coreference resolution tasks. The work is based on existing grammar developed for the National Corpus of Polish and evaluated on manually annotated Polish Coreference Corpus.
A Peter Lang book chapter and Cognitive Linguistics in the Year 2012 (September 17-18, 2012) conference paper presenting problems related to coreference annotation in the Polish Coreference corpus. There are three main causes of annotator errors: grammatical (e.g. the lack of an article system in Polish), semantic (the so-called co-extension, involving lexical relations between words) and cognitive (the annotators’ insufficient real-world knowledge about certain relationships). Apart from provided examples of different kinds of annotation problems, the paper analyzes how coreference relates to identity in extralinguistic reality and in discourse. It also discusses the distinction between coreference and anaphora, as well as dependence of coreference on specific properties of Polish grammar. It questions M. Recasens’ theory of near-identity and the need for its detailed classification.
LREC 2014 (May 26-31, 2014, Reykjavik) conference paper on a preliminary interpretation of the occurrence of different types of linguistic constructs in the manually-annotated Polish Coreference Corpus by providing analyses of various statistical properties related to mentions, clusters and near-identity links. Among others, frequency of mentions, zero subjects and singleton clusters is presented, as well as the average mention and cluster size. We also show that some coreference clustering constraints, such as gender or number agreement, are frequently not valid in case of Polish. The need for lemmatization for automatic coreference resolution is supported by an empirical study. Correlation between cluster and mention count within a text is investigated, with short characteristics of outlier cases. We also examine this correlation in each of the 14 text domains present in the corpus and show that none of them has abnormal frequency of outlier texts regarding the cluster/mention ratio. Finally, we report on our negative experiences concerning the annotation of the near-identity relation. In the conclusion we put forward some guidelines for the future research in the area.
EACL 2014 (April 26-30, 2014, Gothenburg) paper on the first machine learning experiments on detection of null subjects in Polish. It emphasizes the role of zero subject detection as the part of mention detection – the initial step of end-to-end coreference resolution. Anaphora resolution is not studied in this article.
EACL 2014 demo session paper presenting major modifications in the MMAX2 manual annotation tool, which were implemented for the coreference annotation of Polish texts. Among other things, a new feature of adjudication is described, as well as some general insight into the manual annotation tool selection process for the natural language processing tasks.
See BibTeX citation.
ACIIDS 2014 (April 7-9, 2014, Bangkok) conference paper discussing different methods of estimating the inter-annotator agreement in manual annotation of Polish coreference and proposing a new BLANC-based annotation agreement metric. The commonly used agreement indicators are calculated for mention detection, semantic head annotation, near-identity markup and coreference resolution.
Presentation at the Natural Language Processing Seminar (January 27, 2014, Warsaw) discussing automated null subject detection (in Polish).
LTC 2013 (December 7-9, 2013, Poznań) conference paper describing the composition, annotation process and availability of the newly constructed Polish Coreference Corpus – a large Polish corpus of general nominal coreference. The tools used in the process and final linguistic representation formats are also presented.
LTC 2013 conference paper describing evaluation of a set of surface, syntactic and anaphoric features proposed in Uryupina 2007 and their usefulness for coreference resolution in Polish texts.
MIKE 2013 (December 18-20, 2013, Virudhunagar) conference paper reporting on the preliminary experiment aimed at verification whether extraction of nominal facts corresponding to world knowledge from both structured and unstructured data could be effectively performed and its results used as a source of pragmatic knowledge for coreference resolution in Polish. Being the proof-of-concept only, this approach is work in progress and is intended to be further validated in a full-scale project.
CCL 2013/NLP-NABD 2013 (October 10-12, 2013, Suzhou) conference paper reporting on linguistic features and decisions that we find vital in the process of annotation and resolution of coreference for highly inflectional languages. The presented results have been collected during preparation of a corpus of general direct nominal coreference of Polish. Starting from the notion of a mention, its borders and potential vs. actual referentiality, we discuss the problem of complete and near-identity, zero subjects and dominant expressions. We also present interesting linguistic cases influencing the coreference resolution such as the difference between semantic and syntactic heads or the phenomenon of coreference chains made of indefinite pronouns.
NLDB 2013 (June 19-21, 2013, Manchester) conference paper presenting a new implementation of the multipurpose set of NLP tools for Polish, made available online in a common web service framework. The tool set comprises a morphological analyzer, a tagger, a named entity recognizer, a dependency parser, a constituency parser and a coreference resolver. Additionally, a web application offering chaining capabilities and a common BRAT-based presentation framework is presented.
LP&IIS 2013 (June 17-18, 2013, Warsaw) conference paper describing the test of the translation- and projection-based method of implementation of a coreference resolver for an inflectional language. The paper also presents evaluation of the result on a corpus of general coreference and compare the results with state-of-the-art solutions of this type for other languages.
CICLING 2013 (March 24–30, 2013, Samos) conference paper commenting on the experience gained in preparation of a coreference corpus for an inflectional and free-word-order language carried out in an ongoing project, aiming at creating tools for coreference resolution. Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.
Presentation at the Natural Language Processing Seminar (December 3, 2012, Warsaw) discussing methodology of the construction of the Polish Coreference Corpus (in Polish).
HLT Days 2012 (September 27-28, 2012, Warsaw) poster presented in Language Resources and Tools Hackathon session.
KI 2012 (September 24-27, 2012, Saarbrücken) coreference paper confronting the idea of continuous nature of identity with experimental data for Polish, resulting in a new approach to this notion. It extends the definition of coreference with speaker/recipient relation, believed to be valid for all languages, and explains the near-identity with lexical and conceptual means. The theory is supported with Polish-English examples presenting difficulties in coreference interpretation.
LREC 2012 (May 21-27, 2012, Istanbul) conference paper presenting the results of the first attempt of the coreference resolution for Polish using statistical methods. It presents the conclusions from the process of adapting the Beautiful Anaphora Resolution Toolkit (BART; a system primarily designed for the English language) for Polish and collates its evaluation results with those of the previously implemented rule-based system. Finally, we describe our plans for the future usage of the tool and highlight the upcoming research to be conducted, such as the experiments of a larger scale and the comparison with other machine learning tools.
Presentation at the Natural Language Processing Seminar (March 5, 2012, Warsaw) discussing typology of coreference and strategies of its annotation (in Polish).
LTC 2011 (November 25-27, 2011, Poznań) conference paper presenting the results of the first attempt of coreference resolution for Polish, intended to create a useful baseline for future experiments with this topic. The resulting implementation is designed to run either on true mention boundaries (discovering coreference chains between them) or in an end-to-end manner, performing their automatic detection as the first step. The system uses a few rich rules, corresponding to syntactic constraints (elimination of nested nominal groups), syntactic filters (elimination of syntactic incompatible heads), semantic filters (wordnet-derived compatibility) and selection (weighted scoring). Results are evaluated against human annotation for two commonly used baseline variants of the resolver (all-singletons/head-match) and two target rule-based settings. The best working method is analysed, showing simple statistics about the two classes of errors made by the system.
DAARC 2011 (October 6-7, 2011, Faro) conference paper presenting the results of the first attempt of coreference resolution for Polish running on true mention boundaries and using a few rich rules, corresponding to syntactic constraints (elimination of nested nominal groups), syntactic filters (elimination of syntactic incompatible heads), semantic filters (wordnet-derived compatibility) and selection (weighted scoring). The results are compared to human annotation and presented in four sets: with two common baselines: all singletons/head-match, and two slightly more complex settings with four and five rules.
Parts of the work described here were also contributed by other externally funded projects, carried out simultaneously with CORE:
- works on the new version of the Polish grammar for Spejd by Alicja Wójcicka and Katarzyna Głowińska were co-funded by the Polish Ministry of Science and Higher Education as an Investment in CLARINPL Research Infrastructure and by the European Union from resources of the European Social Fund
works related to linguistic evaluation of usefulness of Uryupina’s coreference features for Polish by Piotr Batko and development of adaptation of BART (Beautiful Anaphora Resolution Toolkit) for Polish by Bartłomiej Nitoń were co-funded by the European Union from financial resources of the European Social Fund, project PO KL Information technologies: Research and their interdisciplinary applications
works related to coreference-based approach to summarization were carried out within PhD studies of Mateusz Kopeć at the Institute of Computer Science, Polish Academy of Sciences
help with adaptation of coreference tools to Multiservice, a Web service framework for Polish NLP tools, was offered by Michał Lenart taking part in CESAR project (Central and South-east European Resources, part of META-NET) financed from a European Competitiveness and Innovation framework Programme, Information and Communication Technologies Policy Support Programme (CIP ICT-PSP, grant agreement 271022)
projection-based experiments were made possible by the University Research Program for Google Translate
contacts established with the parallel French coreference annotation project ANCOR were also beneficial for some of our scientific results and helped relate the CORE project more deeply to the international coreference community.
The core CORE project team constituted of (almost alphabetically):
Maciej Ogrodniczuk — principal investigator
- Barbara Dunin-Kęplicz — formalization of coreference rules
- Maria Głąbska — coreference annotation
- Katarzyna Głowińska — linguistic expertise related to anaphora, coreference and Polish syntax
- Anna Grzeszak — coreference annotation
- Mateusz Kopeć — technical leadership, implementation and IT design, development of the annotation environment and project tools
- Emilia Kubicka — coreference annotation
- Barbara Masny — coreference annotation
- Paulina Rosalska — coreference annotation
- Agata Savary — coreference annotation and annotation work expertise
- Magdalena Zawisławska — linguistic and semantic expertise, annotation management, adjudication of the annotation of Polish Coreference Corpus
- Sebastian Żurowski — coreference annotation
but there were numerous other people, mainly colleagues from the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences, who contributed to various stages of the project with their selfless help:
- Piotr Batko — coreference annotation, verification of coreference features for Polish (linguistic part)
- Łukasz Degórski — help related to processing NKJP data
- Łukasz Dębowski — statistical expertise
- Michał Lenart — help related to processing NKJP data, hardware expertise, Multiservice integration assistance
- Małgorzata Marciniak — HPSG anaphora expertise
- Bartłomiej Nitoń — verification of coreference features for Polish (implementation part)
- Adam Przepiórkowski — linguistic and natural language processing expertise, management of co-operation with the National Corpus of Polish
- Filip Skwarski — translation and proofreading
- Jakub Waszczuk — expertise related to annotation and named entity-related tools, versioning system management
- Joanna Wierucka — translation and proofreading.