Deploying Speech-to-Speech on Hugging Face

Speech-to-Speech (S2S) is an exciting new project from Hugging Face that combines several advanced models to create a seamless, almost magical experience: you speak, and the system responds with a synthesized voice.

The project implements a cascaded pipeline leveraging models available through the Transformers library on the Hugging Face hub. The pipeline consists of the following components:

  1. Voice Activity Detection (VAD)
  2. Speech to Text (STT)
  3. Language Model (LM)
  4. Text to Speech (TTS)

What’s more, S2S has multi-language support! It currently supports English, French, Spanish, Chinese, Japanese, and Korean. You can run the pipeline in single-language mode

 

 

 

To finish reading, please visit source site