KIParla-collection
The KIParla-collection aggregates all available KIParla corpus modules into a single repository. It is rebuilt automatically whenever any module publishes a new release.
This is a read-only reference repository — do not edit files here directly. If you find an inconsistency in the data, correct it in the originating module.
|
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, contact the corpus coordinators through the KIParla website. |
Repository organization
| Path | Contents |
|---|---|
|
Speaker metadata merged across all modules |
|
Conversation metadata merged across all modules |
|
Time-aligned Jefferson-style transcriptions (open with ELAN) |
|
Linearized Jefferson-style transcriptions, one TU per line |
|
Linearized transcriptions retaining orthographic words only |
|
Verticalized, token-level data with Jefferson features as columns |
See KIP documentation for a full description of the metadata schema and TSV format.
How it is built
The collection is rebuilt automatically via GitHub Actions whenever a module publishes a new release. The build workflow:
-
Clones the latest
mainof all four modules. -
Syncs TSV and EAF files.
-
Regenerates linear formats using
tools/tsv2formats.py. -
Merges metadata using
tools/merge_metadata.py. -
Bumps the collection patch version and publishes a new GitHub Release.
See Collection automation for technical details.
How to cite
To cite the full KIParla collection:
@article{Caterina_KIParla_corpus_a_2019,
author = {Mauri, Caterina and Ballarè, Silvia and Cerruti, Massimo
and Goria, Eugenio and Suriano, Francesco},
journal = {Proceedings of the 6th Italian Conference on Computational Linguistics CLiC-it.},
title = {{KIParla corpus: a new resource for spoken Italian}},
year = {2019}
}
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.