KIParla Documentation

KIParla is a corpus of spoken Italian built from recordings collected in university settings across Italy. This site documents the corpus modules, derived projects, and maintainer guides.

Corpus modules

The KIParla corpus is split into four modules, each maintained as an independent repository.

Module Description

KIP

Spoken Italian collected at the Universities of Bologna and Turin (2016–2019). ~70 hours, 121 conversations, 184 speakers.

KIPasti

KIP sub-corpus recorded in Asti.

ParlaBO

Spoken Italian collected in Bologna.

ParlaTO

Spoken Italian collected in Turin.

All modules share the same file format and metadata schema. See any module’s documentation for a full description of the data structure.

The KIParla-collection aggregates all four modules into a single downloadable release.

Derived projects

Projects that use or extend KIParla data:

  • Lemmatization project — guidelines and tools for manual lemmatization and POS-tagging correction using Universal Dependencies.

  • KIParla tools — standalone maintenance and transformation scripts for operational corpus workflows.

Maintainer guides

  • Release guide — how to release a new version of a module, including the automated tagging and collection rebuild workflows.

  • Collection automation — how the KIParla-collection rebuild pipeline works.

  • Tools workflows — where the standalone maintenance scripts fit in day-to-day and release operations.