Table of Contents

Understanding BEST-RQ: A Simple Self-Supervised Pre-training Approach for Speech Recognition #

Have you ever wished your voice assistant understood you a little better? Automatic speech recognition (ASR) technology is constantly evolving, but achieving high accuracy can be complex. Today, we’ll explore a new approach called BEST-RQ (BERT-based Speech pre-Training with Random-projection Quantizer) introduced by Google Researchers in 2023 that has made waves in the field.

This innovative technique leverages the power of BERT, a popular language model, specifically adapted for speech recognition. But BEST-RQ takes it a step further by introducing a unique scalable pre-training method, allowing the model to learn effectively from vast amounts of unlabelled data, leading to significant improvements in accuracy.

Intrigued? In this short blog post, we’ll break down the key concepts behind BEST-RQ and how it’s transforming speech recognition. We’ll explore how it simplifies the learning process and unlocks the potential for even more accurate and robust ASR systems in the future!

Problem #

Speech recognition aims to convert spoken language into written text. Traditional methods rely on labeled data, where spoken words are paired with their corresponding written text. But this approach requires a vast amount of labeled data, which can be expensive and time-consuming to collect.

One common design principle of self-supervised learning for speech recognition centers around learning representations. Inspired by the success of BERT (Devlin et al., 2018), a powerful technique in natural language processing, one research trend in the speech community is to build BERT-inspired algorithms.

BERT is not directly applicable to speech because speech lacks these discrete tokens. We can’t directly feed the continuous speech signal into BERT, hence a need to bridge the gap between continuous speech signals and the discrete text tokens, and a solution for addressing this issue is through learning speech representation.

While speech representation learning is crucial, integrating it with self-supervised learning presents two challenges.

Model Architecture Limitation
The model must excel at both representation and downstream tasks, but optimal representation learning might not translate to efficient downstream processing (e.g., accessing future context for representation vs. low-latency requirements for recognition).
Increased Design Complexity
The misaligned objectives and intricate design process for these combined algorithms can hinder research progress, potentially favoring complex solutions over simpler alternatives.

The core problem BEST-RQ tackles is: how can we train a speech recognition model effectively without relying on a large amount of labeled data or needing to break speech down into discrete units like words, and at the same time keeping it all “SIMPLE” to be able to scale to larger new speech model architectures?

Occam’s Razor for ML: Simplest solution often wins [Ref. Image]

BEST-RQ’s Solution #

BEST-RQ tackles the above problem by offering a compelling alternative. It introduces a novel technique of self-supervised training using a combination of Random Projection Quantizer (RPQ) and Masked Language Modeling (MLM).

Fig 1: Overview of BEST-RQ. The approach applies random projections to project the input speech signals to a randomly initialized codebook, and map them to discrete labels through finding the nearest vector in the codebook. The pre-training objective is for the ASR encoder to take the masked input signals and predict the labels corresponding to the masked part provided by the random-projection quantizer. Figure taken from Ref. [1]

Random Projection Quantizer (RPQ) #

This is the heart of BEST-RQ and is the core innovation that bridges the gap between continuous speech and the discrete world BERT thrives in. RPQ has two key components - Projection matrix and Codebook both randomly initialized and not updated during training.

Projection matrix #

This projects the speech features (numerical representation of speech) into a lower dimension. The framework is described in Figure 1. This matrix is of size d x k, where:

d is the dimensionality of the original speech features (typically high, like hundreds or thousands).
k is the target dimensionality after projection (usually much lower than d).

Codebook #

To put simply, this is a collection of n code vectors, each of size k. These vectors represent the discrete code space.
The size n of the codebook is a hyper parameter that can be tuned based on the specific task and dataset.

The projection matrix A use Xavier initialization (Glorot & Bengio, 2010) and the codebook C use standard normal distribution for initialization, and the parameters are fixed during the pre-training process and therefore the quantizations are consistent during training.

Given an input vector x where x is a d-dimensional vector computed from speech signals, the random-projection quantizer maps x to discrete labels y through

where:
* A denotes a randomly initialized h × d matrix
* C = {c1, …, cn} is a set of randomly initialized hi-dimensional vectors
* norml2() is a function that normalize the vector to have unit L2 norm.

The input data is normalized to have 0 mean and standard deviation of 1. The normalization is critical for pre- Self-supervised Learning with Random-projection Quantizer for Speech Recognition preventing the random projection to collapse to a small subset of codes.

As shown in Figure 1, The BEST- RQ algorithm masks speech signals (explained below) and feeds them to the encoder part of the speech recognition model.
The encoder learns to predict the masked region based on the unmasked speech signals where the learning targets are labels provided by RPQ.
The RPQ projects speech signals to a randomly initialized matrix, and finds a nearest vector in a randomly initialized codebook. The index of that vector is the target label.

Masked Language Modeling (MLM)

Similar to how BERT works with text, BEST-RQ uses MLM for training. Speech segments are masked (replaced with silence or noise).

The approach applies masks directly on the speech signal, where the masking strategy samples at every frame whether to apply masks with a fixed probability. Each mask spans from the starting frame with a fixed length. The masked parts are replaced with a noise sampled from a normal distribution with 0 mean and 0.1 standard deviation.

The model, typically a Transformer architecture, is tasked with predicting the masked parts based on the surrounding context. Instead of predicting words like BERT, the model predicts the labels (codebook indices) of the masked speech using the RPQ predictions as targets.
The pre-training process adds a softmax layer on top of the ASR encoder to learn to predict the quantized speech labels.

Key Point: Frozen and Independent RPQ

Unlike other self-supervised methods, RPQ (projection matrix and codebook) is randomly initialized and not trained during the process. This removes the need for the model to learn the intricacies of the codebook, allowing it to focus solely on capturing meaningful speech representations.

Since the random-projection quantizer is independent of the ASR encoder, the pre-training is flexible and can work with different architectures of the ASR encoder.

Benefits of this approach #

Performance
BEST-RQ achieves competitive results compared to other methods, even with lower latency for real-time applications.
This approach shows similar WERs as the existing state-of-the-art results on LibriSpeech with non-streaming models, and outperform wav2vec 2.0 and w2vBERT on LibriSpeech with streaming models and on multilingual tasks with non-streaming models.
On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.
Flexibility
The model architecture is independent of the RPQ design. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and
is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture.
Simplicity and Focus on Core Task
By avoiding complex representation learning, BEST-RQ spends less time grappling with representation learning and can concentrate on the core task — predicting the masked parts of the speech using the provided codebook labels.

Here’s an analogy: Imagine you’re lost in a giant forest (high-dimensional speech data) and need to find a specific landmark (desired speech representation).

Traditional methods: You meticulously learn the entire forest layout (complex representation learning) to navigate and find the landmark.

BEST-RQ approach: You’re given a helicopter (random projection matrix) that takes you high above the forest (lower dimension) and a pre-defined map (codebook) with landmarks marked. You simply find the location that looks most similar to your current view (nearest codebook vector) — faster and with less effort!

Further Exploration #

Research is ongoing to understand how well RPQ captures speech information compared to learned quantizers.
Exploring different types of quantizers and experiment designs might yield further improvements.

Thank you very much for reading this and I hope I was able to clarify a few notions to the people who are just starting to get into Speech Models Pre-training World. Please feel free to suggest edits if you find any mistakes !