ParlaBO

The ParlaBO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface.

The ParlaBO corpus was compiled within the framework of the DiverSIta – Diversity in spoken Italian project, funded by the Italian Ministry of University and Research (MUR) under the PRIN 2022 PNRR Call.

It consists of over 65 hours of spoken data collected in Bologna and its province through semi-structured interviews. The interviews, conducted between 2021 and 2024, involved 155 speakers with different origins, ages, education levels, and occupations, and covered a variety of topics (study, work, leisure activities, retirement, memories of the past, life in the city, traditions, local customs, etc.).

The transcriptions have been anonymized.

Overall, the module is made up of 86 conversations and includes 155 speakers.

Repository organization

This repository contains:

metadata for both speakers and conversations, in the metadata subfolder (see Metadata below)
descriptions of the transcription conventions used for this module (Transcription conventions)

For each conversation you will find:

.eaf file in eaf/: time-aligned Jefferson-style transcriptions (open with ELAN).
.txt file in linear-jefferson/: linearized Jefferson-style transcription.
.txt file in linear-orthographic/: linearized transcription retaining only orthographic words.
.tsv file in tsv/: verticalized, token-level data with Jefferson features as columns (see Verticalized content).

Linear files contain one Transcription Unit (TU) per line with two columns: speaker code and transcription. TUs are sorted by start time.

Metadata

Each participant and conversation have associated metadata in metadata/participants.tsv and metadata/conversations.tsv.

Participants metadata:
- code: unique anonymized 5-char identifier. Unknown or occasional participants use the special code ???.
- gender: M for masculine or F for feminine.
- age-range: 5-year range including the participant’s age.
- birth-region: Italian region^[1] where the participant was born. If outside Italy, the label estero is used.
- occupation: occupation according to ISTAT categories.
- study-level: highest completed level of education.^[2]
- mothertongue: participant’s first language — italian, other, or both.
- conversations: summary of conversations in which the participant appears.
Conversations metadata:
- code: unique conversation identifier.
- type: type of interaction — for this module, all conversations are semistructured-interview.
- duration: duration in hh:mm:ss format.
- participants-number: number of participants.
- languages: languages spoken — italian, dialect, or both.
- participants-relationship: asymmetric for this module.
- moderator: presence of a moderator.
- topic: fixed for this module.
- year: year of collection.
- collection-point: two-letter code of the collection area — BO for Bologna.
- participants: codes of participants in the conversation.

Verticalized content

Conversations are also available in a vertical, pseudo-tokenized version in tsv/. Tokenization is obtained by validating the Jefferson transcription using custom tools and splitting on whitespaces, prosodic links (=), and apostrophes used for elision. Each token is represented as 13 columns:

token_id: unique token identifier within the conversation.
speaker: speaker code as found in metadata/participants.tsv.
tu_id: progressive identifier of the transcription unit.
span: portion of the original Jefferson transcription containing the token.
form: orthographic form. Short pauses ((.)) are represented as [PAUSE]; unintelligible tokens as x.
type: one of linguistic, nonverbalbehavior, shortpause, unknown, error.
variation: some if the TU contains dialect code-mixing; none otherwise.
jefferson_feats: word-level features from the Jefferson transcription:
- SpaceAfter=No: no whitespace between this token and the next.
- ProsodicLink=Yes: prosodic link (=) to the following token.
- Intonation: Falling, Rising, or WeaklyRising.
- Interrupted=Yes: word interrupted in speech (transcribed with ~).
- Truncated=Yes: truncated form.
- Volume: High or Low.
align: AlignBegin and AlignEnd for the first and last token of each TU, in milliseconds.
prolongations: sound prolongations encoded as <char_id>x<count> pairs (e.g., 2x2,6x1).
pace: fast or slow paced spans — Fast=<start>-<end> or Slow=<start>-<end>, zero-based over form.
guesses: uncertain character spans — <start>-<end>, zero-based, inclusive, over form.
overlaps: simultaneous speech spans — <start>-<end>(<overlap_id>) per group.

Data access

Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, contact the corpus coordinators through the KIParla website and follow the provided procedure.

How to cite

To cite this module please include:

Mauri, C., Ballarè, S., & Zucchini, E. (2024). Modulo ParlaBO [Data set]. https://doi.org/10.60760/unibo/parlabo

@misc{Mauri_Modulo_ParlaBO_2024,
  author = {Mauri, Caterina and Ballarè, Silvia and Zucchini, Eleonora},
  doi    = {10.60760/unibo/parlabo},
  title  = {{Modulo ParlaBO}},
  url    = {https://github.com/KIParla/ParlaBO},
  year   = {2024}
}

If you use the ParlaBO module in your research, please also reference this repository (commit/tag) in your data statement or appendix.

Changelog

2025-10-07 v1.0.0
- First release
2025-11-28 v1.1.0
- Major fix: wrong speaker attribution in linear-jefferson and linear-orthographic
- Minor fix: empty turns in linear-orthographic were removed

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

1. abruzzo, basilicata, calabria, campania, emilia-romagna, friuli-venezia-giulia, lazio, liguria, lombardia, marche, molise, piemonte, puglia, sardegna, sicilia, toscana, trentino-alto-adige, umbria, valle-d-aosta, veneto

2. elementary-school, liceo-diploma, middle-school, phd, technical-vocational-diploma, university-degree, university-degree-ongoing