KIPasti

CC BY-NC-SA 4.0

The KIPasti corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface.

The KIPasti corpus was compiled within the framework of the DiverSIta – Diversity in spoken Italian project, funded by the Italian Ministry of University and Research (MUR) under the PRIN 2022 PNRR Call.

It consists of over 40 hours of spoken data collected in thirteen different Italian regions (Abruzzo, Basilicata, Calabria, Campania, Emilia-Romagna, Lazio, Lombardy, Marche, Apulia, Sardinia, Tuscany, Umbria, Veneto) during mealtime conversations, generally within family settings. The interactions, recorded between 2020 and 2024, involved 145 speakers with different origins, ages, education levels, and occupations. Italian is predominantly used in all interactions, but in most of them (78%), various passages in dialect are also present.

The transcriptions have been anonymized.

Overall, the module is made up of 63 conversations and includes 145 speakers.

Repository organization

This repository contains:

  • metadata for both speakers and conversations, in the metadata subfolder (see Metadata below)

  • descriptions of the transcription conventions used for this module (Transcription conventions)

For each conversation you will find:

  • .eaf file in eaf/: time-aligned Jefferson-style transcriptions (open with ELAN).

  • .txt file in linear-jefferson/: linearized Jefferson-style transcription.

  • .txt file in linear-orthographic/: linearized transcription retaining only orthographic words.

  • .tsv file in tsv/: verticalized, token-level data with Jefferson features as columns (see Verticalized content).

Linear files contain one Transcription Unit (TU) per line with two columns: speaker code and transcription. TUs are sorted by start time.

Metadata

Each participant and conversation have associated metadata in metadata/participants.tsv and metadata/conversations.tsv.

  1. Participants metadata:

    • code: unique anonymized 5-char identifier. Unknown or occasional participants use the special code ???.

    • gender: M for masculine or F for feminine.

    • age-range: 5-year range including the participant’s age.

    • birth-region: Italian region[1] where the participant was born. If outside Italy, the label estero is used.

    • occupation: occupation according to ISTAT categories.

    • study-level: highest completed level of education.[2]

    • L1: participant’s first language — italian, dialect, or other.

    • conversations: summary of conversations in which the participant appears.

  2. Conversations metadata:

    • code: unique conversation identifier.

    • type: type of interaction — for this module, all conversations are free-conversations.

    • duration: duration in hh:mm:ss format.

    • participants-number: number of participants.

    • languages: languages spoken — italian, dialect, or both.

    • participants-relationship: symmetric for this module.

    • moderator: presence of a moderator.

    • topic: free for this module.

    • year: year of collection.

    • collection-point: two-letter code of the collection area.

    • collection-region: Italian region where the collection point is located.

    • macro-region: NORTH, CENTRE, or SOUTH.

    • participants: codes of participants in the conversation.

Verticalized content

Conversations are also available in a vertical, pseudo-tokenized version in tsv/. Tokenization is obtained by validating the Jefferson transcription using custom tools and splitting on whitespaces, prosodic links (=), and apostrophes used for elision. Each token is represented as 13 columns:

  1. token_id: unique token identifier within the conversation.

  2. speaker: speaker code as found in metadata/participants.tsv.

  3. tu_id: progressive identifier of the transcription unit.

  4. span: portion of the original Jefferson transcription containing the token.

  5. form: orthographic form. Short pauses ((.)) are represented as [PAUSE]; unintelligible tokens as x.

  6. type: one of linguistic, nonverbalbehavior, shortpause, unknown, error.

  7. variation: some if the TU contains dialect code-mixing; none otherwise.

  8. jefferson_feats: word-level features from the Jefferson transcription:

    • SpaceAfter=No: no whitespace between this token and the next.

    • ProsodicLink=Yes: prosodic link (=) to the following token.

    • Intonation: Falling, Rising, or WeaklyRising.

    • Interrupted=Yes: word interrupted in speech (transcribed with ~).

    • Truncated=Yes: truncated form.

    • Volume: High or Low.

  9. align: AlignBegin and AlignEnd for the first and last token of each TU, in milliseconds.

  10. prolongations: sound prolongations encoded as <char_id>x<count> pairs (e.g., 2x2,6x1).

  11. pace: fast or slow paced spans — Fast=<start>-<end> or Slow=<start>-<end>, zero-based over form.

  12. guesses: uncertain character spans — <start>-<end>, zero-based, inclusive, over form.

  13. overlaps: simultaneous speech spans — <start>-<end>(<overlap_id>) per group.

Data access

Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, contact the corpus coordinators through the KIParla website and follow the provided procedure.

How to cite

To cite this module please include:

Mauri, C., Ballarè, S., & Zucchini, E. (2024). Modulo KIPasti [Data set]. https://doi.org/10.60760/unibo/kipasti

@misc{Mauri_Modulo_KIPasti_2024,
  author = {Mauri, Caterina and Ballarè, Silvia and Zucchini, Eleonora},
  doi    = {10.60760/unibo/kipasti},
  title  = {{Modulo KIPasti}},
  url    = {https://github.com/KIParla/KIPasti},
  year   = {2024}
}

If you use the KIPasti module in your research, please also reference this repository (commit/tag) in your data statement or appendix.

Changelog

  • 2025-10-07 v1.0.0

    • First release

  • 2025-11-28 v1.1.0

    • Major fix: wrong speaker attribution in linear-jefferson and linear-orthographic

    • Minor fix: empty turns in linear-orthographic were removed



1. abruzzo, basilicata, calabria, campania, emilia-romagna, friuli-venezia-giulia, lazio, liguria, lombardia, marche, molise, piemonte, puglia, sardegna, sicilia, toscana, trentino-alto-adige, umbria, valle-d-aosta, veneto
2. elementary-school, liceo-diploma, middle-school, phd, technical-vocational-diploma, university-degree, university-degree-ongoing