Normalization pipeline (normalize.py)

normalize.py is a library module (no CLI) that validates and normalizes the annotation text of transcription units before further processing.

It replaces the hardcoded validation logic that previously lived inside kiparla_tools.TranscriptionUnit.post_init, making rules explicit, independently testable, and easy to extend.

Architecture

The pipeline is built around a simple registry pattern.

ValidationRule

Each rule is a ValidationRule dataclass with three fields:

@dataclass
class ValidationRule:
    name: str                    # used in warnings/errors output and config
    function: Callable           # the function that implements the rule
    enabled_by_default: bool     # whether the rule is active without explicit config

Two registries

Rules are organized into two ordered lists:

WARNING_RULES

Auto-fix rules. Each function has signature (str) → tuple[int, str], returning the number of substitutions made and the modified annotation. Rules are applied in order — each one receives the output of the previous.

ERROR_RULES

Check-only rules. Each function has signature (str) → bool, returning True if the annotation is valid and False if a problem is detected. Error rules run after all warning rules, on the final normalized text.

validate_and_normalize

validate_and_normalize(
    annotation: str,
    config: dict[str, bool] | None = None,
) -> tuple[str, dict[str, int], dict[str, bool]]

Runs all enabled rules on annotation and returns:

  • normalized — the (possibly modified) annotation string

  • warnings{rule_name: substitution_count} for every warning rule that made changes

  • errors{rule_name: True} for every error rule that detected a problem

If config is None or a key is absent, each rule falls back to its enabled_by_default value.

Rule reference

Warning rules (applied in order)

Name What it fixes On by default

SYMBOL_NOT_ALLOWED

Removes characters outside the Jefferson transcription alphabet

yes

META_TAG

Converts {…} and (.){P}

yes

UNEVEN_SPACES

Fixes spacing around brackets and punctuation; removes stray spaces inside °…° and <…> markers; removes spaces around = prosodic-link markers

yes

TRIM_PAUSES

Strips leading and trailing {P} pause markers

yes

TRIM_PROSODICLINKS

Strips leading and trailing = prosodic-link markers

yes

OVERLAP_PROLONGATION

Fixes malformed overlap+prolongation sequences (word:[: → [word::)

yes

MULTIPLE_SPACES

Collapses tabs, newlines, and repeated spaces into a single space

yes

ACCENTS

Normalizes Italian accent errors: -chè-ché family, po', però'però, può'può

yes

NUMBERS

Converts digit sequences to Italian words (2due, 23ventitré)

yes

SWITCHES

Moves intonation markers (. , ?) before prosodic/interruption symbols (: - ~) when they appear in the wrong order; moves NVB tags {x} outside bracket spans

yes

Error rules (checked after normalization)

Name What it checks On by default

UNBALANCED_DOTS

° count is even (all low-volume spans are closed)

yes

UNBALANCED_PACE

<…> and >…< pace markers are balanced and non-nested

yes

UNBALANCED_GUESS

(…) uncertain-transcription markers are balanced and non-nested

yes

UNBALANCED_OVERLAP

[…] overlap markers are balanced and non-nested

yes

NVB tags and bracket spans

Non-verbal behaviour tags ({laugh}, {ride}, {toss}, …) must appear outside bracket spans, not inside them. The SWITCHES warning rule enforces this automatically by relocating any NVB tag found immediately after an opening bracket or immediately before a closing bracket.

Exception: {P} (short pause) inside overlap brackets […] is left in place. A pause that coincides with an overlap is transcriptionally significant and must be preserved.

[{P}]        ← valid, no change
[{laugh}]    ← becomes {laugh} []
({P})        ← becomes {P} ()   (exception applies only to [ ])

Accent maps

The substitution tables used by the accent normalization rules are defined as module-level constants:

ACCENT_CHE_MAP

Words where -chè should become -ché (e.g. perchèperché, ). The patterns tolerate Jefferson markers interspersed between letters (e.g. per[chèper[ché).

ACCENT_PERO_MAP

Words where a trailing apostrophe-accent should become the proper accent (e.g. però'però, può'può).

Both maps are plain Python dictionaries and can be replaced or extended when YAML configuration is added (planned).

Planned: YAML configuration

Rule enable/disable and accent map overrides will be driven by a YAML config file in a future step. For now, all rules are controlled via the config dict argument to validate_and_normalize.

Adding a new rule

  1. Write the function in normalize.py, following the appropriate signature:

    # Warning rule
    def check_ends_with_symbol(annotation: str) -> tuple[int, str]:
        ...
    
    # Error rule
    def check_ends_with_symbol(annotation: str) -> bool:
        ...
  2. Append a ValidationRule entry to the appropriate registry. For warning rules, position in the list matters — place it where it fits logically in the normalization sequence:

    WARNING_RULES.append(
        ValidationRule("ENDS_WITH_SYMBOL", check_ends_with_symbol, enabled_by_default=False)
    )
    # or for error rules:
    ERROR_RULES.append(
        ValidationRule("ENDS_WITH_SYMBOL", check_ends_with_symbol, enabled_by_default=False)
    )
  3. Add tests in tests/test_normalize.py, then update this page with a row in the rule reference table.

That is the complete change surface. No other code needs to be modified.

Tests

The full test suite for this module lives in tests/test_normalize.py. See Testing for setup and how to run tests.

The tests cover:

  • pipeline infrastructure (empty registries, config enable/disable, rule ordering, warning count accumulation)

  • each normalization and validation function individually

  • the {P}-in-overlap exception for switch_NVB

  • the "space moves to the correct side" behaviour of check_spaces