Giving Our Mother Tongue a Digital Future: The MTL-Tag & VITS Pipeline

Taiwanese is defined by the Tone Sandhi Circle. Unlike non-tonal languages such as English and Japanese, where intonation is the primary carrier of affect, Taiwanese requires ”Tonal-Safe Emotional Mapping.”

MTL is a phonetically consistent, multi-syllable orthography based on POJ. MTL is sandhi-aware, meaning the written form of a multi-syllable word reflects the tone of the front syllable after Tone Sandhi has applied. This overcame the homonym ambiguity in character-based writing system.

Historically, this system functioned as a structural checksum; our research suggests that native speakers utilize a form of ”Perfect Pitch” to decode these melodic headers. When universal AI models fail to respect this melodic continuity, intelligibility drops significantly.

By treating Taiwanese as a formal signal-processing system rather than a collection of statistical probabilities, we enable a ”Sovereign AI” that outperforms universal models in both accuracy and cultural resonance.

MTL-TTS model utilized new modeling technologies:

MTLTag — The AI’s Linguistic Map. We utilize XLM-RoBERTa for predicting the words should keep the lexical tone or not. With 20,000 labeled sentences, we are able to achieve 98% of accuracy. For simple conversation sentences, the best model achieved 99% of accuracy.
Computers don't inherently understand the musicality of Taiwanese. The TTS system needs a robust Part-of-Speech (POS) Tagger to label every word in the input text as a Noun, Adjective, Verb, etc., before applying the tone rules. Not only there is scarce resources of POS in Taiwanese, and it's complexity cannot be fully covered. Traditionally, applying sandhi-rules proves it's limitation. (70% of accuracy). MTL-Tag chose labeling sentences with binary labels. 0 (tone change) and 1 (citation tone). It acts as a "script supervisor" for the AI:
Precision Tone Sandhi: Taiwanese is famous for words changing tones based on context. MTLTag pre-calculates these shifts so the AI doesn't have to guess. Prosody Guidance: It marks where a sentence should breathe, pause, or rise in pitch, moving away from robotic, word-by-word synthesis.
MTL-Converter: A critical role in MTL-TTS's pipeline process for converting the input sentence to surface tone based tokens upon the predicted tags in MTLTag.
VITS Acoustic Synthesis — The "Soul" of the Voice.

Rhythmic Intelligence: VITS understands the "flow" of a sentence. It learns the natural speed and weight of human speech.
High Sensitivity to Tones 1-8: Our current model is so precise that if we make a single labeling mistake in the database, the AI will "call us out" by sounding off. This proves the model has truly mastered the 8-tone system of Taiwanese.
Example:
Is Sony's mobile-phone easy to use?
Note: The sentence includes a brand name 'Sony' which was converted to MTL tokens with two syllables (i.e.flat tone and high tone - 'y'). We are tagging them as <en>Sony</en> for the next phase of training.

Data-Driven Naturalness: By indexing over 16,000 sentences, we are training the AI to handle the natural pauses and transitions found in real conversation. We started with no speakers' information to train a VITS Taiwanese Model using Coqui TTS. And gradually expanded to four speakers mode after identify each speaker's id.
MTL-Whisper: There have been many Whisper-based AI models for Taiwanese. With the success of the above pipeline approach, we've trained MTL-based ASR (18% error ate) which fits better than 'harnji' based ASR. (30% error rate). This model can be used as a QA model for stress testing MTL-VITS synthesized audio file. MTL-Whisper generates MTL tokens after going through 13,000 sentences. Here is a report for comparing the predicated MTL tokens vs. GT(Ground Truth).
Why This Project Matters for the Next Generation
1. From "Canned" to "Conversational": We aren't just making a translator; we are recreating the soul of the language.
2. Tech-Forward Preservation: We are using RoBERTa and XLM (Cross-lingual Language Models) to help the AI "understand" Taiwanese at a semantic level, not just a phonetic one.
3. Building a Bridge: This project connects the deep linguistic knowledge of the 1st generation with the engineering skills of the 2nd generation.
Get Involved: We Need Your Tech Skills!
We have the data, we have the pipeline, and we have a model that is already "impressing the neighbors." But there is more to build.
Are you a Python developer? Help us refine our Class-based converters for KIP/POJ. Are you into ML/NLP? Help us optimize our RoBERTa/XLM tagging and loss functions (`loss_mel`, `loss_kl`). Are you a Heritage Speaker? Help us review and "slice" audio to ensure the highest quality training data. Or better yet, help to design a model that can review synthesized audio file for correctness.
"I can't do this alone, but together we can fix it." Let’s ensure the language of our parents and grandparents has a seat at the table in the future of AI.

Synthesized Audio Samples

Note: The latest best model was trained from a pool of three females and one male with 16,000 sentences.

Modern Taiwanese Language (MTL) | Washington DC Taiwan School