KIParla Documentation
KIParla is a corpus of spoken Italian built from recordings collected in university settings across Italy. This site documents the corpus modules, derived projects, and maintainer guides.
Corpus modules
The KIParla corpus is split into four modules, each maintained as an independent repository.
| Module | Description |
|---|---|
Spoken Italian collected at the Universities of Bologna and Turin (2016–2019). ~70 hours, 121 conversations, 184 speakers. |
|
KIP sub-corpus recorded in Asti. |
|
Spoken Italian collected in Bologna. |
|
Spoken Italian collected in Turin. |
All modules share the same file format and metadata schema. See any module’s documentation for a full description of the data structure.
The KIParla-collection aggregates all four modules into a single downloadable release.
Derived projects
Projects that use or extend KIParla data:
-
Lemmatization project — guidelines and tools for manual lemmatization and POS-tagging correction using Universal Dependencies.
-
KIParla tools — standalone maintenance and transformation scripts for operational corpus workflows.
Maintainer guides
-
Release guide — how to release a new version of a module, including the automated tagging and collection rebuild workflows.
-
Collection automation — how the KIParla-collection rebuild pipeline works.
-
Tools workflows — where the standalone maintenance scripts fit in day-to-day and release operations.