Taiwanese is defined by the Tone Sandhi Circle. Unlike non-tonal languages such as English and Japanese, where intonation is the primary carrier of affect, Taiwanese requires ”Tonal-Safe Emotional Mapping.”
MTL is a phonetically consistent, multi-syllable orthography based on POJ. MTL is sandhi-aware, meaning the written form of a multi-syllable word reflects the tone of the front syllable after Tone Sandhi has applied. This overcame the homonym ambiguity in character-based writing system.
Historically, this system functioned as a structural checksum; our research suggests that native speakers utilize a form of ”Perfect Pitch” to decode these melodic headers. When universal AI models fail to respect this melodic continuity, intelligibility drops significantly.
By treating Taiwanese as a formal signal-processing system rather than a collection of statistical probabilities, we enable a ”Sovereign AI” that outperforms universal models in both accuracy and cultural resonance.
MTL-TTS model utilized new modeling technologies:
Computers don't inherently understand the musicality of Taiwanese. The TTS system needs a robust Part-of-Speech (POS) Tagger to label every word in the input text as a Noun, Adjective, Verb, etc., before applying the tone rules. Not only there is scarce resources of POS in Taiwanese, and it's complexity cannot be fully covered. Traditionally, applying sandhi-rules proves it's limitation. (70% of accuracy). MTL-Tag chose labeling sentences with binary labels. 0 (tone change) and 1 (citation tone). It acts as a "script supervisor" for the AI:
Precision Tone Sandhi: Taiwanese is famous for words changing tones based on context. MTLTag pre-calculates these shifts so the AI doesn't have to guess. Prosody Guidance: It marks where a sentence should breathe, pause, or rise in pitch, moving away from robotic, word-by-word synthesis.
Why This Project Matters for the Next Generation
Get Involved: We Need Your Tech Skills!
We have the data, we have the pipeline, and we have a model that is already "impressing the neighbors." But there is more to build.
Are you a Python developer? Help us refine our Class-based converters for KIP/POJ. Are you into ML/NLP? Help us optimize our RoBERTa/XLM tagging and loss functions (`loss_mel`, `loss_kl`). Are you a Heritage Speaker? Help us review and "slice" audio to ensure the highest quality training data. Or better yet, help to design a model that can review synthesized audio file for correctness.
"I can't do this alone, but together we can fix it." Let’s ensure the language of our parents and grandparents has a seat at the table in the future of AI.
Note: The latest best model was trained from a pool of three females and one male with 16,000 sentences.