ParlaTO

The ParlaTO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface.

The ParlaTO corpus was funded by the CRT Foundation (ParlaTO – Corpus del Parlato di Torino project).

It consists of about 50 hours of interactions collected in Turin and its province through semi-structured interviews. The interviews, conducted between 2018 and 2020, involved 100 speakers with different origins, ages, education levels, and types of occupation, and addressed personal life experiences in the city (study, work, leisure activities, retirement, memories of the past, etc.).

The transcriptions have been anonymized.

Overall, the module is made up of 68 conversations and includes 100 speakers.

Repository organization

This repository contains:

metadata for both speakers and conversations, in the metadata subfolder (see Metadata below)
descriptions of the transcription conventions used for this module (Transcription conventions)

For each conversation you will find:

.eaf file in eaf/: time-aligned Jefferson-style transcriptions (open with ELAN).
.txt file in linear-jefferson/: linearized Jefferson-style transcription.
.txt file in linear-orthographic/: linearized transcription retaining only orthographic words.
.tsv file in tsv/: verticalized, token-level data with Jefferson features as columns (see Verticalized content).

Linear files contain one Transcription Unit (TU) per line with two columns: speaker code and transcription. TUs are sorted by start time.

Metadata

Each participant and conversation have associated metadata in metadata/participants.tsv and metadata/conversations.tsv.

Participants metadata:
- code: unique anonymized 5-char identifier. Unknown or occasional participants use the special code ???.
- gender: M for masculine or F for feminine.
- age-range: 5-year range including the participant’s age.
- birth-region: Italian region^[1] where the participant was born. If outside Italy, the label estero is used.
- occupation: occupation according to ISTAT categories.
- study-level: highest completed level of education.^[2]
- conversations: summary of conversations in which the participant appears.
Conversations metadata:
- code: unique conversation identifier.
- type: type of interaction — for this module, all conversations are semistructured-interview.
- duration: duration in hh:mm:ss format.
- participants-number: number of participants.
- languages: languages spoken — italian, dialect, or both.
- participants-relationship: relationship among speakers — symmetric or asymmetric.
- moderator: presence of a moderator.
- topic: fixed for this module.
- year: year of collection.
- collection-point: two-letter code of the collection area — TO for Turin.
- participants: codes of participants in the conversation.

Verticalized content

Conversations are also available in a vertical, pseudo-tokenized version in tsv/. Tokenization is obtained by validating the Jefferson transcription using custom tools and splitting on whitespaces, prosodic links (=), and apostrophes used for elision. Each token is represented as 13 columns:

token_id: unique token identifier within the conversation.
speaker: speaker code as found in metadata/participants.tsv.
tu_id: progressive identifier of the transcription unit.
span: portion of the original Jefferson transcription containing the token.
form: orthographic form. Short pauses ((.)) are represented as [PAUSE]; unintelligible tokens as x.
type: one of linguistic, nonverbalbehavior, shortpause, unknown, error.
variation: some if the TU contains dialect code-mixing; none otherwise.
jefferson_feats: word-level features from the Jefferson transcription:
- SpaceAfter=No: no whitespace between this token and the next.
- ProsodicLink=Yes: prosodic link (=) to the following token.
- Intonation: Falling, Rising, or WeaklyRising.
- Interrupted=Yes: word interrupted in speech (transcribed with ~).
- Truncated=Yes: truncated form.
- Volume: High or Low.
align: AlignBegin and AlignEnd for the first and last token of each TU, in milliseconds.
prolongations: sound prolongations encoded as <char_id>x<count> pairs (e.g., 2x2,6x1).
pace: fast or slow paced spans — Fast=<start>-<end> or Slow=<start>-<end>, zero-based over form.
guesses: uncertain character spans — <start>-<end>, zero-based, inclusive, over form.
overlaps: simultaneous speech spans — <start>-<end>(<overlap_id>) per group.

Data access

Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, contact the corpus coordinators through the KIParla website and follow the provided procedure.

How to cite

To cite this module please include:

Cerruti, M., & Ballarè, S. (2020). ParlaTO: corpus del parlato di Torino. Bollettino dell’Atlante Linguistico Italiano (BALI), 44, 171–196.

@article{Cerruti_ParlaTO_corpus_del_2020,
  author  = {Cerruti, Massimo and Ballarè, Silvia},
  journal = {Bollettino dell'Atlante Linguistico Italiano (BALI)},
  pages   = {171--196},
  title   = {{ParlaTO: corpus del parlato di Torino}},
  volume  = {44},
  year    = {2020}
}

If you use the ParlaTO module in your research, please also reference this repository (commit/tag) in your data statement or appendix.

Changelog

2025-10-07 v1.0.0
- First release
2025-11-28 v1.1.0
- Major fix: wrong speaker attribution in linear-jefferson and linear-orthographic
- Minor fix: empty turns in linear-orthographic were removed

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

1. abruzzo, basilicata, calabria, campania, emilia-romagna, friuli-venezia-giulia, lazio, liguria, lombardia, marche, molise, piemonte, puglia, sardegna, sicilia, toscana, trentino-alto-adige, umbria, valle-d-aosta, veneto

2. elementary-school, liceo-diploma, middle-school, phd, technical-vocational-diploma, university-degree, university-degree-ongoing