Taiwanese is defined by the Tone Sandhi Circle. Unlike non-tonal languages such as English and Japanese, where intonation is the primary carrier of affect, Taiwanese requires ”Tonal-Safe Emotional Mapping.”
MTL is a phonetically consistent, multi-syllable orthography based on POJ. MTL is sandhi-aware, meaning the written form of a multi-syllable word reflects the tone of the front syllable after Tone Sandhi has applied. This overcame the homonym ambiguity in character-based writing system.
Historically, this system functioned as a structural checksum; our research suggests that native speakers utilize a form of ”Perfect Pitch” to decode these melodic headers. When universal AI models fail to respect this melodic continuity, intelligibility drops significantly.
By treating Taiwanese as a formal signal-processing system rather than a collection of statistical probabilities, we enable a ”Sovereign AI” that outperforms universal models in both accuracy and cultural resonance.
MTL-TTS model utilized new modeling technologies:
Computers don't inherently understand the musicality of Taiwanese. MTLTag is our custom labeling system that acts as a "script supervisor" for the AI:
Precision Tone Sandhi: Taiwanese is famous for words changing tones based on context. MTLTag pre-calculates these shifts so the AI doesn't have to guess. Prosody Guidance: It marks where a sentence should breathe, pause, or rise in pitch, moving away from robotic, word-by-word synthesis.
Why This Project Matters for the Next Generation
Get Involved: We Need Your Tech Skills!
We have the data, we have the pipeline, and we have a model that is already "impressing the neighbors." But there is more to build.
Are you a Python developer? Help us refine our Class-based converters for KIP/POJ. Are you into ML/NLP? Help us optimize our RoBERTa/XLM tagging and loss functions (`loss_mel`, `loss_kl`). Are you a Heritage Speaker? Help us review and "slice" audio to ensure the highest quality training data. Or better yet, help to design a model that can review synthesized audio file for correctness.
"I can't do this alone, but together we can fix it." Let’s ensure the language of our parents and grandparents has a seat at the table in the future of AI.
Note: This model was trained from a pool of two females and one male with 9000+ sentences for 36 hours with 160,000 steps of iteration(reading and listening 4 sentences at a time). For VITs, it is said that 350,000 steps are required to achieve a best model. We are sorting out and balancing the next training dataset. by changing the speed and pitch of one female and one male who are the minority of this dataset. By balancing multiple speakers training, we have better control. As a result, we are able to choose which person's voice for audio synthesis.