Scripts
This page documents the standalone scripts currently maintained in the tools repository.
eaf2csv.py
Converts ELAN .eaf files into tab-separated transcript CSV files.
Typical use cases:
-
export time-aligned annotations into a simpler tabular format
-
remap annotation IDs used by per-file YAML overlap exceptions
Key behaviors:
-
rows are sorted by start time
-
id:N ` prefixes in annotation text are stripped from the exported `text -
ignorepairs in YAML annotations are remapped to the new sequentialtu_id
Example:
python eaf2csv.py \
--input-dir data/eaf \
--annotations-dir data/annotations \
-o data/csv
You can also convert explicit files:
python eaf2csv.py \
--input-files data/eaf/CONV001.eaf data/eaf/CONV002.eaf \
-o data/csv
Output columns:
-
tu_id -
speaker -
start -
end -
duration -
text
tsv2eaf.py
Rebuilds .eaf files from corpus .vert.tsv exports.
Key behaviors:
-
reads
Begin=andEnd=values from thealigncolumn -
reconstructs Jefferson-style TU text by joining token spans and prosodic links
-
exits with an error if a transcription unit is missing
Begin=orEnd=
Process a single file:
python tsv2eaf.py -i tsv/BOA1007.vert.tsv -o eaf/
Process a whole directory:
python tsv2eaf.py -i tsv/ -o eaf/
Output goes to eaf/ by default.
One .eaf is written per input file, named after the file stem (e.g. BOA1007.eaf).
This script is also invoked automatically by the regenerate-derived workflow
whenever TSV files are pushed to dev.
See Automatic regeneration on TSV push.
tsv2formats.py
Generates two linear text outputs from .vert.tsv files:
-
Jefferson-style linear transcription
-
orthographic linear transcription
These outputs are used in downstream publication and maintenance workflows.
Example for a whole folder:
python tsv2formats.py \
--input tsv \
--out_jefferson linear-jefferson \
--out_orthographic linear-orthographic
Example for a single file:
python tsv2formats.py --input tsv/CONV001.vert.tsv
Output shape:
-
one
.txtfile per input.vert.tsv -
one transcription unit per line
-
two columns per line: speaker code and linearized transcription
Important details:
-
the Jefferson output keeps Jefferson spans, pauses, and error token punctuation
-
the orthographic output strips non-alphabetic material from
errortokens -
unknowntokens are rendered as repeatedxcharacters in the orthographic output
This script is also invoked automatically by the regenerate-derived workflow
whenever TSV files are pushed to dev.
See Automatic regeneration on TSV push.
merge_metadata.py
Merges participant and conversation metadata from module repositories into the collection-level metadata tables.
Key behaviors:
-
normalizes columns before merging
-
fills missing target columns with empty values
-
deduplicates by
code, keeping the first occurrence
Example:
python merge_metadata.py \
--modules ../KIP ../KIPasti ../ParlaBO ../ParlaTO \
--output-dir ../KIParla-collection/metadata
Expected module layout:
-
each module path must contain
metadata/participants.tsv -
each module path must contain
metadata/conversations.tsv
Normalization currently handled by the script:
-
in
KIP,school-regionis renamed tobirth-region -
extra columns are dropped
-
missing target columns are added as empty strings
make_patch.py
Generates a unified diff .patch file from a lemmatization CSV, capturing transcription-level corrections to be applied back to a corpus .vert.tsv.
Typical use case: during lemmatization work, errors in the original transcription are spotted (wrong form, wrong Jefferson notation, wrong tokenisation).
This script extracts those corrections from the CSV and produces a patch that can be applied with git apply on the corpus module’s dev branch.
What is patched:
-
span,form,jefferson_feats— values are stripped of leading/trailing whitespace before comparison -
Token deletions — TSV tokens absent from the CSV are removed (e.g. the second half of a merged token);
Begin=/End=alignment is transferred automatically to the neighboring token
What is not patched:
-
lemma,upos— annotation-only, not in source TSV -
align,prolongations,pace,guesses,overlaps— TSV-only columns; preserved from the original except for the automatic alignment transfer -
Sub-token rows (token_id ending in a letter, e.g.
4-7a)
Example:
python make_patch.py \
../lemmatization-project/wip/KIP/BOA1007.csv \
../KIP/tsv/BOA1007.vert.tsv
Output:
Patch written: ../lemmatization-project/patches/KIP/BOA1007.vert.tsv.patch (+6 / -9 lines)
Recap written: ../lemmatization-project/patches/KIP/BOA1007.vert.tsv.recap.md (3 items)
The output paths are inferred automatically from the CSV location (wip/CORPUS/FILE.csv → patches/CORPUS/FILE.vert.tsv.patch).
A custom output path can be passed as a third argument.
The .recap.md file lists every structural change (drops, alignment transfers, columns needing manual attention).
Check it before applying the patch.
To apply:
cd /path/to/KIP
git checkout dev
git apply --check ../lemmatization-project/patches/KIP/BOA1007.vert.tsv.patch
git apply ../lemmatization-project/patches/KIP/BOA1007.vert.tsv.patch
git add tsv/BOA1007.vert.tsv
git commit -m "fix: transcription corrections from lemmatization of BOA1007"
See the lemmatization-project documentation for the full workflow, including how to handle recap warnings.
Design note
These scripts are intentionally lightweight and file-oriented. When behavior becomes complex enough to need a richer data model or reusable pipeline code, it lives in a dedicated module instead.
normalize.py is the first such module in this repository: it provides the validation and normalization pipeline for transcription unit text.
See Normalization pipeline for details.