Why Welsh Voices Matter

When you ask Siri a question or use Google Assistant, you're interacting with sophisticated text-to-speech (TTS) technology powered by neural networks. These modern AI systems need one thing to work well: lots and lots of recorded speech data.

For Welsh—a language with fewer digital resources than major languages like English—this presents a challenge. Welsh speakers often code-switch, mixing Welsh and English mid-sentence for named entities, convenience, or when speaking with learners. Think: "Dwi'n mynd i Tesco heddiw" (I'm going to Tesco today). A truly useful Welsh voice assistant needs to handle both languages seamlessly.

Without adequate voice technology, languages like Welsh risk what researcher Georg Rehm calls "Digital Language Extinction"—being left behind as technology advances. In 2018, the Welsh Government released its Welsh Technology Action Plan, setting out plans for ensuring Welsh speakers can use their language across all forms of technology.

This is where BU-TTS comes in.

Standing on Shoulders

We weren't starting from scratch. Welsh language technology has been developing for years at Bangor University and beyond.

The WISPR project back in 2004 created one of the first Welsh TTS corpora—3 hours of recordings from a single speaker reading excerpts from the Bible and an undergraduate dissertation. By 2016, this same dataset was used to create an open-source voice for Macsen, the Welsh digital assistant, using the MaryTTS framework. More recently, the Lleisiwr project uses similar technology to create personalized synthetic voices for people at risk of losing their ability to speak.

These were important steps forward. But there was a problem: modern AI voice systems (the kind that power Alexa, Siri, and Google Assistant) are based on deep neural networks that need substantially more data than traditional TTS architectures. Three hours wasn't going to cut it.

We needed something bigger.

What We Built

BU-TTS (Bangor University Text-to-Speech Corpus) is a bilingual Welsh-English dataset specifically designed to work with modern neural TTS systems. Here's what we created:

  • 12,200 text prompts: 9,500 in Welsh and 2,700 in English, each between 4-14 words long
  • 9.8 hours of recordings: From 4 native Welsh speakers (2 female, 2 male)
  • Diverse accents: North and south Welsh accents represented
  • Phonetically balanced: Sentences carefully chosen to include all the sounds in Welsh
  • Fully open: Released under CC0 1.0 license (the most permissive open-source license)

The key difference? This corpus is large enough to train modern neural network architectures—making it the first open-source Welsh corpus to capitalize on advances in deep learning for voice synthesis.

How We Did It

The Plan

We started with text from Mozilla's Common Voice project. Our team of linguists and terminologists at Bangor had already curated tens of thousands of Welsh sentences from sources like Wikipedia, Twitter, Welsh language books, and even recipes (to fill gaps in under-represented categories).

They'd done the hard work of:

  • Segmenting longer texts into readable sentences
  • Removing offensive language
  • Updating old-fashioned vocabulary and orthography
  • Ensuring everything was appropriate for all ages

From this master list, we used the MaryTTS toolkit along with the Bangor University Pronunciation Dictionary to select prompts that would give us balanced phonetic coverage. This means we didn't just grab random sentences—we chose them specifically to ensure every Welsh sound would be represented in our dataset.

We divided these into 5 unique subsets, each individually phonetically balanced. This approach gives flexibility for creating different types of voices or using subsets for transfer learning (more on that later).

Recording Setup

The plan was straightforward: bring voice talents into Bangor University's language laboratories. These labs are specially built to isolate recordings from outside noise—perfect for creating clean voice data.

We built iOS and Android apps to make the recording process smooth. Talents would see prompts on screen, record themselves reading them naturally, and an API service would collect the recordings. A dashboard let us review recordings and flag any that needed re-recording.

We recruited 4 amateur voice talents from Bangor University students—people eager to contribute to Welsh language technology. We auditioned voices to get good diversity: 2 males (north and south accents) and 2 females (initially just north, while searching for a south Welsh female voice).

That was the plan.

Then COVID-19 Happened

In March 2020, the university labs closed. Our carefully designed recording setup became unavailable, and we had to pivot fast.

We sent each volunteer home with a mobile phone and a Sure MV88+ microphone. They could record in their own spaces using our apps, uploading recordings remotely.

It... didn't go great.

Without a supervisor present, maintaining consistent quality was tough. Recording rates slowed. Some recordings had background noise. Others had inconsistent tone or pacing. Remember, these were students, not professional voice actors—they needed guidance and direction to maintain the standard we needed.

We persevered, collecting what data we could. But we also realized these noisy, sparse recordings might actually be useful for Lleisiwr, where users typically record at home without professional equipment. Every challenge is an opportunity, right?

Still, we needed more—and we needed better quality.

Bringing in the Professionals

We made the decision to hire a professional voice actor and a recording company to complete the dataset. We provided them with all 12,200 prompts, each tagged with appropriate filenames for organization.

The professional talent was instructed to read in a neutral style, emphasizing punctuation (questions, exclamations) naturally. The recordings were checked for accuracy and trimmed of silence.

The difference was night and day. In a relatively short time, we had a complete, high-quality dataset. The professional recordings (speaker F1) form the backbone of BU-TTS, with the amateur recordings (F2, M1, M2) adding valuable accent diversity.

Making It Work

We validated the corpus by training actual TTS models using the VITS architecture (Conditional Variational Autoencoder with Adversarial Learning)—a state-of-the-art end-to-end neural TTS system. VITS is simpler than older two-stage architectures, making it easier to train and experiment with.

We trained on an NVIDIA RTX 3090 GPU for about 3 days, keeping audio quality high (44.1 kHz sampling). We used graphemes (letters) rather than phonemes for training, which simplified the process while we focused on data quality.

The results? The synthesized voices successfully demonstrated:

  • Code-switching capability: The model could read Welsh and English in the same sentence naturally
  • Better performance on longer text: News articles worked better than short phrases
  • Transfer learning potential: We showed promise in using the large dataset to bootstrap new voices with smaller amounts of data

We tested informally at live events and in-house. The feedback was mostly positive. The voices work best when reading well-formed sentences—like news articles—rather than fragmented text. When words have the same form in both languages, the model sometimes guesses wrong, but overall it handles the bilingual challenge well.

Open to Everyone

BU-TTS is released under a CC0 1.0 license. That's the "no rights reserved" license—the most permissive option available. Anyone can use it for any purpose: commercial, research, or personal.

Why so open?

For lesser-resourced languages, open data is essential. When you're working with limited resources, every dataset you don't have to rebuild from scratch is a win. By making BU-TTS freely available, we enable:

  • Researchers to experiment with Welsh TTS without collecting their own data
  • Companies to build Welsh voice products without prohibitive upfront costs
  • Other projects to use subsets for specific purposes (regional voices, specialized domains)
  • The broader ecosystem to build on this foundation

The dataset is available on Bangor University's Language Technologies portal, ready to download and use. It's also available on Hugging Face, making it easy for machine learning practitioners to integrate into their workflows.

What's Next

This is the "first instalment" of BU-TTS—we plan to expand it further. Future work might include:

  • Training with phonemes instead of graphemes (potentially higher quality)
  • Conducting formal Mean Opinion Score (MOS) tests to quantitatively measure voice quality
  • Adding more speakers and accent variations
  • Creating domain-specific voices for particular applications
  • Improving the model's handling of ambiguous words

The Welsh language technology ecosystem is growing. BU-TTS joins other resources like the CorCenCC corpus (11 million annotated tokens), Mozilla Common Voice (143+ hours of speech recognition data), and Cysill Ar-lein (400 million token spelling corpus) in providing the building blocks for Welsh language AI.

Acknowledgments

This work was funded by the Welsh Government as part of the Text, Speech and Translation Technologies for the Welsh Language project. I'm grateful to my co-authors Dewi Bryn Jones and Delyth Prys for their collaboration, to the professional voice talent and amateur contributors who donated their voices, and to the wider Language Technologies Unit team at Bangor University.

If you're interested in the technical details, you can read our full paper: BU-TTS: An Open-Source, Bilingual Welsh-English, Text-to-Speech Corpus, presented at the 4th Celtic Language Technology Workshop in Marseille, June 2022.

The dataset is available for download at Bangor University's Data Portal and on Hugging Face.

Want to hear it in action? You can try voices trained on this dataset at tts.techiaith.cymru.


Building technology for lesser-resourced languages requires creativity, collaboration, and persistence—especially during a pandemic. BU-TTS represents one step forward in ensuring Welsh speakers can use their language across all forms of technology.