voice.py

Code Explained

The provided code defines the PiperVoice class, which is part of a text-to-speech (TTS) system. This class is responsible for converting text into synthesized speech using an ONNX-based machine learning model. It includes methods for loading the model, phonemizing text, mapping phonemes to numerical IDs, and generating audio in WAV format. The class is structured to handle various configurations and supports both CPU and GPU execution.

The load method is a static factory method that initializes a PiperVoice instance by loading an ONNX model and its configuration. It accepts the model path, an optional configuration path, and a flag to enable GPU acceleration (use_cuda). If the configuration path is not provided, it defaults to a JSON file with the same name as the model. The method reads the configuration file, determines the appropriate execution providers (e.g., CPU or CUDA), and creates an ONNX inference session. This session, along with the parsed configuration, is used to instantiate the PiperVoice object.

The phonemize method converts input text into phonemes, grouped by sentences. Depending on the phoneme type specified in the configuration, it either uses the phonemize_espeak function (for languages like Arabic, with optional diacritization) or the phonemize_codepoints function (for plain text). If an unsupported phoneme type is encountered, the method raises an error. This step is crucial for breaking down text into the building blocks of speech.

The phonemes_to_ids method maps phonemes to their corresponding numerical IDs using a predefined mapping from the configuration. It adds special tokens for the beginning (BOS), padding (PAD), and end (EOS) of a sequence. If a phoneme is missing from the mapping, a warning is logged, but the method continues processing. This ensures robustness in handling incomplete mappings.

The synthesize method generates a WAV audio file from input text. It sets up the audio file’s properties, such as sample rate, bit depth, and channel count, and writes audio frames generated by the synthesize_stream_raw method. This method processes the text sentence by sentence, converting phonemes into raw audio data and appending silence between sentences if specified.

The synthesize_stream_raw method handles the conversion of text into raw audio data. It first phonemizes the text and then converts the phonemes into numerical IDs. These IDs are passed to the synthesize_ids_to_raw method, which interacts with the ONNX model to generate audio data. The method also adds silence between sentences, as determined by the sentence_silence parameter.

Finally, the synthesize_ids_to_raw method is the core of the synthesis process. It prepares the input data for the ONNX model, including phoneme IDs, their lengths, and scaling factors for parameters like noise and length. If the model supports multiple speakers, it includes a speaker ID in the input. The ONNX model processes this input and returns raw audio data, which is converted to 16-bit integers and returned as a byte stream.

Overall, the PiperVoice class provides a modular and extensible framework for text-to-speech synthesis. It supports multiple languages, phoneme types, and hardware configurations, while handling edge cases like missing phonemes or unsupported configurations. This makes it a robust and flexible solution for generating high-quality synthesized speech.

Source Code

import json
import logging
import wave
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union

import numpy as np
import onnxruntime
from piper_phonemize import phonemize_codepoints, phonemize_espeak, tashkeel_run

from .config import PhonemeType, PiperConfig
from .const import BOS, EOS, PAD
from .util import audio_float_to_int16

_LOGGER = logging.getLogger(__name__)


@dataclass
class PiperVoice:
    session: onnxruntime.InferenceSession
    config: PiperConfig

    @staticmethod
    def load(
        model_path: Union[str, Path],
        config_path: Optional[Union[str, Path]] = None,
        use_cuda: bool = False,
    ) -> "PiperVoice":
        """Load an ONNX model and config."""
        if config_path is None:
            config_path = f"{model_path}.json"

        with open(config_path, "r", encoding="utf-8") as config_file:
            config_dict = json.load(config_file)

        providers: List[Union[str, Tuple[str, Dict[str, Any]]]]
        if use_cuda:
            providers = [
                (
                    "CUDAExecutionProvider",
                    {"cudnn_conv_algo_search": "HEURISTIC"},
                )
            ]
        else:
            providers = ["CPUExecutionProvider"]

        return PiperVoice(
            config=PiperConfig.from_dict(config_dict),
            session=onnxruntime.InferenceSession(
                str(model_path),
                sess_options=onnxruntime.SessionOptions(),
                providers=providers,
            ),
        )

    def phonemize(self, text: str) -> List[List[str]]:
        """Text to phonemes grouped by sentence."""
        if self.config.phoneme_type == PhonemeType.ESPEAK:
            if self.config.espeak_voice == "ar":
                # Arabic diacritization
                # https://github.com/mush42/libtashkeel/
                text = tashkeel_run(text)

            return phonemize_espeak(text, self.config.espeak_voice)

        if self.config.phoneme_type == PhonemeType.TEXT:
            return phonemize_codepoints(text)

        raise ValueError(f"Unexpected phoneme type: {self.config.phoneme_type}")

    def phonemes_to_ids(self, phonemes: List[str]) -> List[int]:
        """Phonemes to ids."""
        id_map = self.config.phoneme_id_map
        ids: List[int] = list(id_map[BOS])

        for phoneme in phonemes:
            if phoneme not in id_map:
                _LOGGER.warning("Missing phoneme from id map: %s", phoneme)
                continue

            ids.extend(id_map[phoneme])
            ids.extend(id_map[PAD])

        ids.extend(id_map[EOS])

        return ids

    def synthesize(
        self,
        text: str,
        wav_file: wave.Wave_write,
        speaker_id: Optional[int] = None,
        length_scale: Optional[float] = None,
        noise_scale: Optional[float] = None,
        noise_w: Optional[float] = None,
        sentence_silence: float = 0.0,
    ):
        """Synthesize WAV audio from text."""
        wav_file.setframerate(self.config.sample_rate)
        wav_file.setsampwidth(2)  # 16-bit
        wav_file.setnchannels(1)  # mono

        for audio_bytes in self.synthesize_stream_raw(
            text,
            speaker_id=speaker_id,
            length_scale=length_scale,
            noise_scale=noise_scale,
            noise_w=noise_w,
            sentence_silence=sentence_silence,
        ):
            wav_file.writeframes(audio_bytes)

    def synthesize_stream_raw(
        self,
        text: str,
        speaker_id: Optional[int] = None,
        length_scale: Optional[float] = None,
        noise_scale: Optional[float] = None,
        noise_w: Optional[float] = None,
        sentence_silence: float = 0.0,
    ) -> Iterable[bytes]:
        """Synthesize raw audio per sentence from text."""
        sentence_phonemes = self.phonemize(text)

        # 16-bit mono
        num_silence_samples = int(sentence_silence * self.config.sample_rate)
        silence_bytes = bytes(num_silence_samples * 2)

        for phonemes in sentence_phonemes:
            phoneme_ids = self.phonemes_to_ids(phonemes)
            yield self.synthesize_ids_to_raw(
                phoneme_ids,
                speaker_id=speaker_id,
                length_scale=length_scale,
                noise_scale=noise_scale,
                noise_w=noise_w,
            ) + silence_bytes

    def synthesize_ids_to_raw(
        self,
        phoneme_ids: List[int],
        speaker_id: Optional[int] = None,
        length_scale: Optional[float] = None,
        noise_scale: Optional[float] = None,
        noise_w: Optional[float] = None,
    ) -> bytes:
        """Synthesize raw audio from phoneme ids."""
        if length_scale is None:
            length_scale = self.config.length_scale

        if noise_scale is None:
            noise_scale = self.config.noise_scale

        if noise_w is None:
            noise_w = self.config.noise_w

        phoneme_ids_array = np.expand_dims(np.array(phoneme_ids, dtype=np.int64), 0)
        phoneme_ids_lengths = np.array([phoneme_ids_array.shape[1]], dtype=np.int64)
        scales = np.array(
            [noise_scale, length_scale, noise_w],
            dtype=np.float32,
        )

        args = {
            "input": phoneme_ids_array,
            "input_lengths": phoneme_ids_lengths,
            "scales": scales
        }

        if self.config.num_speakers <= 1:
            speaker_id = None

        if (self.config.num_speakers > 1) and (speaker_id is None):
            # Default speaker
            speaker_id = 0

        if speaker_id is not None:
            sid = np.array([speaker_id], dtype=np.int64)
            args["sid"] = sid

        # Synthesize through Onnx
        audio = self.session.run(None, args, )[0].squeeze((0, 1))
        audio = audio_float_to_int16(audio.squeeze())
        return audio.tobytes()