ParlaMint and CLARIN

ParlaMint

ParlaMint, is a CLARIN flagship project that resulted in the creation of comparable corpora of parliamentary debates of 29 European countries and autonomous regions, covering at least the period from 2015 to 2022, and containing over 1 billion words. [1]

It provides structured and linguistically annotated corpora of parliamentary debates across various countries. It aims to facilitate comparative research in political and social sciences, linguistics, and computational text analysis. The corpora are encoded in TEI XML format and contain metadata such as speaker information, party affiliation, and timestamps. The project also includes named entity recognition (NER), part-of-speech tagging (POS), and lemmatization. ParlaMint corpora are openly available under the CC BY license, as well as freely available for analysis and browsing through noSketch Engine and TEITOK. The latest version of the corpora is 4.1. [1]. Key features are standardization, annotations, comparabiltiy and multilingual data.

CLARIN

CLARIN (Common Language Resources and Technology Infrastructure) is a European digital research infrastructure offering data, tools and services to support research based on language resources.[2] It provides access to digital linguistic datasets, tools, and services for computational linguistics, digital humanities, and social sciences.

Relationship between ParlaMint and CLARIN

ParlaMint is hosted and supported within the CLARIN infrastructure, making it easily accessible to researchers through CLARIN’s digital repositories. This integration ensures that the parliamentary corpora are standardized, well-documented, and reusable for computational analysis.