Speak, LLaMA, Speak
A Study of Speech Tokenization and Modeling Approaches
Parth Sarthi
CS224S: Spoken Language Processing
Stanford University
Abstract.
In this paper, we present SpeakLlama, an extension of the Llama 3 language model to understand and output speech, comparing the performance of two tokenization methods: HuBERT and VQ-VAE. We train Llama 3 8B on Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and speech continuation tasks using both tokenizations. Our experiments reveal that HuBERT outperforms VQ-VAE in the transformer setting, with Llama 3 8B achieving a Word Error Rate (WER) of 24.7 in ASR, surpassing the baseline HuBERT ASR model's WER of 36.6. As a further extension, we train diffusion models conditioned on the same tokenizations and find that VQ-VAE achieves better loss values and reconstruction quality in the diffusion setting. Our findings suggest that the choice of tokenization method depends on the modeling architecture employed.
Overview
TTS Samples
| Text | HuBERT + Llama 3 (8B) | VQVAE + Llama 3 (8B) |
|---|---|---|
| And I also think about too, like if we attach it to like other things, like, uh, | ||
| For the past ten years, Conseil had gone with me wherever science beckoned. | ||
| There were only four stationers of any consequences in the town, and at each Holmes produced his pencil chips, and bid high for a duplicate. |