KIPasti
The KIPasti corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface.
The KIPasti corpus was compiled within the framework of the DiverSIta – Diversity in spoken Italian project, funded by the Italian Ministry of University and Research (MUR) under the PRIN 2022 PNRR Call.
It consists of over 40 hours of spoken data collected in thirteen different Italian regions (Abruzzo, Basilicata, Calabria, Campania, Emilia-Romagna, Lazio, Lombardy, Marche, Apulia, Sardinia, Tuscany, Umbria, Veneto) during mealtime conversations, generally within family settings. The interactions, recorded between 2020 and 2024, involved 145 speakers with different origins, ages, education levels, and occupations. Italian is predominantly used in all interactions, but in most of them (78%), various passages in dialect are also present.
The transcriptions have been anonymized.
Overall, the module is made up of 63 conversations and includes 145 speakers.
Repository organization
This repository contains:
-
metadata for both speakers and conversations, in the
metadatasubfolder (see Metadata below) -
descriptions of the transcription conventions used for this module (Transcription conventions)
For each conversation you will find:
-
.eaffile ineaf/: time-aligned Jefferson-style transcriptions (open with ELAN). -
.txtfile inlinear-jefferson/: linearized Jefferson-style transcription. -
.txtfile inlinear-orthographic/: linearized transcription retaining only orthographic words. -
.tsvfile intsv/: verticalized, token-level data with Jefferson features as columns (see Verticalized content).
Linear files contain one Transcription Unit (TU) per line with two columns: speaker code and transcription. TUs are sorted by start time.
Metadata
Each participant and conversation have associated metadata in metadata/participants.tsv and metadata/conversations.tsv.
-
Participants metadata:
-
code: unique anonymized 5-char identifier. Unknown or occasional participants use the special code???. -
gender:Mfor masculine orFfor feminine. -
age-range: 5-year range including the participant’s age. -
birth-region: Italian region[1] where the participant was born. If outside Italy, the labelesterois used. -
occupation: occupation according to ISTAT categories. -
study-level: highest completed level of education.[2] -
L1: participant’s first language —italian,dialect, orother. -
conversations: summary of conversations in which the participant appears.
-
-
Conversations metadata:
-
code: unique conversation identifier. -
type: type of interaction — for this module, all conversations arefree-conversations. -
duration: duration inhh:mm:ssformat. -
participants-number: number of participants. -
languages: languages spoken —italian,dialect, or both. -
participants-relationship:symmetricfor this module. -
moderator: presence of a moderator. -
topic: free for this module. -
year: year of collection. -
collection-point: two-letter code of the collection area. -
collection-region: Italian region where the collection point is located. -
macro-region:NORTH,CENTRE, orSOUTH. -
participants: codes of participants in the conversation.
-
Verticalized content
Conversations are also available in a vertical, pseudo-tokenized version in tsv/.
Tokenization is obtained by validating the Jefferson transcription using custom tools and splitting on whitespaces, prosodic links (=), and apostrophes used for elision.
Each token is represented as 13 columns:
-
token_id: unique token identifier within the conversation. -
speaker: speakercodeas found inmetadata/participants.tsv. -
tu_id: progressive identifier of the transcription unit. -
span: portion of the original Jefferson transcription containing the token. -
form: orthographic form. Short pauses ((.)) are represented as[PAUSE]; unintelligible tokens asx. -
type: one oflinguistic,nonverbalbehavior,shortpause,unknown,error. -
variation:someif the TU contains dialect code-mixing;noneotherwise. -
jefferson_feats: word-level features from the Jefferson transcription:-
SpaceAfter=No: no whitespace between this token and the next. -
ProsodicLink=Yes: prosodic link (=) to the following token. -
Intonation:Falling,Rising, orWeaklyRising. -
Interrupted=Yes: word interrupted in speech (transcribed with~). -
Truncated=Yes: truncated form. -
Volume:HighorLow.
-
-
align:AlignBeginandAlignEndfor the first and last token of each TU, in milliseconds. -
prolongations: sound prolongations encoded as<char_id>x<count>pairs (e.g.,2x2,6x1). -
pace: fast or slow paced spans —Fast=<start>-<end>orSlow=<start>-<end>, zero-based overform. -
guesses: uncertain character spans —<start>-<end>, zero-based, inclusive, overform. -
overlaps: simultaneous speech spans —<start>-<end>(<overlap_id>)per group.
Data access
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, contact the corpus coordinators through the KIParla website and follow the provided procedure.
How to cite
To cite this module please include:
Mauri, C., Ballarè, S., & Zucchini, E. (2024). Modulo KIPasti [Data set]. https://doi.org/10.60760/unibo/kipasti
@misc{Mauri_Modulo_KIPasti_2024,
author = {Mauri, Caterina and Ballarè, Silvia and Zucchini, Eleonora},
doi = {10.60760/unibo/kipasti},
title = {{Modulo KIPasti}},
url = {https://github.com/KIParla/KIPasti},
year = {2024}
}
If you use the KIPasti module in your research, please also reference this repository (commit/tag) in your data statement or appendix.
Changelog
-
2025-10-07 v1.0.0
-
First release
-
-
2025-11-28 v1.1.0
-
Major fix: wrong speaker attribution in linear-jefferson and linear-orthographic
-
Minor fix: empty turns in linear-orthographic were removed
-
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
abruzzo, basilicata, calabria, campania, emilia-romagna, friuli-venezia-giulia, lazio, liguria, lombardia, marche, molise, piemonte, puglia, sardegna, sicilia, toscana, trentino-alto-adige, umbria, valle-d-aosta, veneto
elementary-school, liceo-diploma, middle-school, phd, technical-vocational-diploma, university-degree, university-degree-ongoing