Researchers are tackling the challenge of accurately translating rare and domain-specific terminology in simultaneous speech translation (SST). Jiaxuan Luo from Johns Hopkins University, Siqi Ouyang and Lei Li from Carnegie Mellon University, et al., present a novel framework, RASST, that integrates cross-modal retrieval into the SST pipeline. This research is significant because current speech large language models often falter with specialised vocabulary, and RASST addresses this by efficiently providing relevant terminology hints during the incremental translation process. By training a lightweight speech-text retriever and synthesising targeted training data, the team demonstrate improvements of up to 16% in terminology translation accuracy and 3 BLEU points in overall translation quality on the ACL 60/60 dev set.
This research introduces Retrieval-Augmented Simultaneous Speech Translation, or RASST, a system designed to integrate cross-modal retrieval directly into the SST pipeline.
The team achieved this by training a lightweight speech-text retriever and employing efficient sliding-window retrieval to provide relevant terminology hints to a Speech LLM. Crucially, the researchers also synthesised training data to enable the Speech LLM to effectively utilise these retrieved terms during incremental translation generation.
The study addresses a key limitation of current SST systems, which often struggle with accurate translation of specialised language, including technical jargon and proper nouns. RASST overcomes this by rapidly and accurately matching incoming speech to textual glossary entries, even with partial and continuously arriving input.
Experiments conducted on the ACL 60/60 dev set, across three language directions, English to Chinese, German, and Japanese, demonstrate substantial improvements. Specifically, RASST enhances terminology translation accuracy by up to 16% and boosts overall translation quality by as much as 3 BLEU points.
This breakthrough relies on a novel architecture, as depicted in Figure 1, where a streaming speech encoder processes audio chunks and a lightweight retriever identifies relevant glossary terms. The Speech LLM then leverages both the speech features and retrieved terms, alongside previously generated translations, to produce the next segment of translated text.
By carefully designing the training data and retrieval mechanism, the researchers ensured the LLM learns to selectively apply retrieved terms at the optimal moment during the incremental translation process. The work opens avenues for more accurate and fluent SST systems, with potential applications in international conferences, cross-lingual conversations, and real-time communication.
Cross-modal terminology retrieval for incremental speech translation remains a challenging research problem
Scientists developed Retrieval-Augmented Simultaneous Speech Translation (RASST) to improve terminology handling in streaming translation scenarios. The study pioneers a tight integration of cross-modal retrieval into the SST pipeline, addressing the challenge of translating rare and domain-specific terms incrementally from partial speech input.
Researchers trained a lightweight speech-text retriever and employed efficient sliding-window retrieval to provide chunkwise terminology hints to a Speech LLM. The team engineered a system where, at each step, a fixed-duration speech chunk is received and processed. A retriever selects a subset of relevant terminology entries from a glossary, conditioned on the accumulated speech input and previously generated translations.
This process leverages a FAISS index built from encoded glossary terms, enabling efficient cosine similarity retrieval. Experiments employed a sliding window of length W with stride δ over incoming speech chunks to compute speech features and identify the top K1 nearest glossary terms. Scientists then aggregated candidates from all windows within a newly arrived chunk, retaining the top K2 terms with the highest cosine similarities.
The study synthesised training data to teach the Speech LLM to precisely leverage these retrieved terms during simultaneous translation. The retriever encodes each glossary term ei into a feature vector f e i, while the speech encoder generates features f and f” from the speech chunks. This approach enables the model to produce a partial translation yi at each step, assigning a delay of i m to all tokens, corresponding to the elapsed time when the speech chunk si becomes available.
Experiments on three language directions of the ACL 60/60 dev set demonstrated that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablation studies confirming the contribution of each component. This innovative methodology directly addresses the limitations of existing offline translation approaches by integrating retrieval directly into the incremental, streaming translation process.
RASST enhances terminology translation accuracy across English, Chinese, German and Japanese languages
Scientists achieved up to a 16% improvement in terminology translation accuracy using a new system called RASST, demonstrated through experiments on three language directions, English to Chinese, English to German, and English to Japanese. The research, conducted on the ACL 60/60 development set, reveals that RASST consistently surpasses the InfiniSST baseline in both terminology recall and overall translation quality across all languages tested.
Data shows that RASST increases overall translation quality by up to 3 BLEU points, indicating a significant enhancement in the fluency and accuracy of translated speech. Experiments measured terminology accuracy using both a tagged glossary and a glossary automatically extracted from research papers, consistently showing RASST’s superior performance.
Results demonstrate that RASST achieves over a 10% absolute gain in terminology translation accuracy at nearly all latency levels, with peak improvements reaching 16%. The team measured BLEU scores to assess general translation quality, recording at least a 2-point increase on English to Chinese translations across various latency settings.
The study details that the retriever component introduces negligible computational overhead compared to Speech LLM inference, with the largest observed overhead being 0.16 at the smallest chunk size. Tests prove that a sliding window length of 1.92 seconds optimizes retriever recall, as most term spans in the training data are short, averaging approximately 0.6 seconds, with the 99th percentile reaching only 1.5 seconds.
Measurements confirm that RASST outperforms offline speech translation, highlighting the importance of incorporating domain-specific glossaries for reliable terminology translation, as speech context alone is often insufficient. The breakthrough delivers a system capable of fast and accurate cross-modal translation, even with partial and continually arriving speech input.
Terminology accuracy gains via retrieval augmented speech translation are significant
Scientists have developed a retrieval-augmented simultaneous speech translation (RASST) framework to improve the accuracy of domain-specific terminology translation. This new approach tightly integrates cross-modal retrieval into the speech translation pipeline, utilising a lightweight speech-text retriever to provide chunkwise terminology hints to a Speech LLM.
The researchers also synthesised training data designed to help the Speech LLM effectively leverage these terms during incremental translation. Experiments conducted on the ACL 60/60 dev set demonstrate that RASST enhances terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points.
Ablation studies confirmed the contribution of each component within the framework. Furthermore, the system remained effective even when using glossaries automatically extracted from conference papers, suggesting potential for practical application without the need for manually curated term lists. The authors acknowledge that performance diminishes with challenging negative examples, indicating a need for improved supervision techniques. Future work could explore more sophisticated methods for generating hard negatives and refining the noise-robust training process to further enhance the system’s resilience to retrieval errors.