由 DALL·E 3 生成,prompt:A person and a machine are engaged in two-way communication through a microphone and speakers. The person, standing on the left, speaks into the microphone while the machine on the right, resembling a sleek, futuristic robot, responds through speakers. The setting is a modern, well-lit room with a professional atmosphere. The person looks focused and engaged, and the machine's digital display shows sound waves indicating speech.
語音交互系統(tǒng)簡(jiǎn)介
語音交互系統(tǒng)主要由自動(dòng)語音識(shí)別(Automatic Speech Recognition, 簡(jiǎn)稱 ASR)、自然語言處理(Natural Language Processing, 簡(jiǎn)稱 NLP)和文本到語音合成(Text to Speech,簡(jiǎn)稱 TTS)三個(gè)環(huán)節(jié)構(gòu)成。ASR 相當(dāng)于人的聽覺系統(tǒng),NLP 相當(dāng)于人的大腦語言區(qū)域,TTS 相當(dāng)于人的發(fā)聲系統(tǒng)。
如何構(gòu)建語音對(duì)話機(jī)器人
本文將完全利用開源方案構(gòu)建語音對(duì)話機(jī)器人。
ASR 采用 OpenAI Whisper,同時(shí)支持中、英文。更多技術(shù)細(xì)節(jié)可以看這篇《跟著 Whisper 學(xué)說正宗河南話》;
NLP 采用 DeepSeek v2,由于本地運(yùn)行所需的 GPU 資源不足,我們調(diào)用云端 API 實(shí)現(xiàn)這一步;
TTS 采用 ChatTTS,它是專門為對(duì)話場(chǎng)景設(shè)計(jì)的文本轉(zhuǎn)語音模型,支持英文和中文兩種語言。
本文基于 Gradio 實(shí)現(xiàn)的交互界面如圖:
你可以基于系統(tǒng)麥克風(fēng)采集音頻,通過 Whisper 轉(zhuǎn)錄為文本,調(diào)用 DeepSeek v2 API 后,再將對(duì)話輸出經(jīng)過 ChatTTS 合成為語音,點(diǎn)擊播放即可聽到來自機(jī)器人的聲音。
硬件環(huán)境:RTX 3060, 12GB 顯存
軟件環(huán)境信息(Miniconda3 + Python 3.8.19):
pip list
Package Version
----------------------------- --------------
absl-py 2.0.0
accelerate 0.25.0
aiofiles 23.2.1
aiohttp 3.8.6
aiosignal 1.3.1
altair 5.1.2
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
anyio 4.0.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
astunparse 1.6.3
async-lru 2.0.4
async-timeout 4.0.3
attrs 23.1.0
audioread 3.0.1
Babel 2.15.0
backcall 0.2.0
backports.zoneinfo 0.2.1
beautifulsoup4 4.12.3
bitarray 2.8.2
bitsandbytes 0.41.1
bleach 6.1.0
blinker 1.6.3
cachetools 5.3.1
cdifflib 1.2.6
certifi 2023.7.22
cffi 1.16.0
charset-normalizer 2.1.1
click 8.1.7
colorama 0.4.6
comm 0.2.2
contourpy 1.1.1
cpm-kernels 1.0.11
cycler 0.12.1
Cython 3.0.3
debugpy 1.8.1
decorator 5.1.1
defusedxml 0.7.1
distro 1.9.0
dlib 19.24.2
edge-tts 6.1.8
editdistance 0.8.1
einops 0.8.0
einx 0.2.2
encodec 0.1.1
exceptiongroup 1.1.3
executing 2.0.1
face-alignment 1.4.1
fairseq 0.12.2
faiss-cpu 1.7.4
fastapi 0.108.0
fastjsonschema 2.19.1
ffmpeg 1.4
ffmpeg-python 0.2.0
ffmpy 0.3.1
filelock 3.12.4
Flask 2.1.2
Flask-Cors 3.0.10
flatbuffers 23.5.26
fonttools 4.43.1
fqdn 1.5.1
frozendict 2.4.4
frozenlist 1.4.0
fsspec 2023.9.2
future 0.18.3
gast 0.4.0
gitdb 4.0.10
GitPython 3.1.37
google-auth 2.23.3
google-auth-oauthlib 1.0.0
google-pasta 0.2.0
gradio 4.32.2
gradio_client 0.17.0
grpcio 1.59.0
h11 0.14.0
h5py 3.10.0
httpcore 0.18.0
httpx 0.25.0
huggingface-hub 0.23.2
hydra-core 1.0.7
idna 3.4
imageio 2.31.5
importlib-metadata 6.8.0
importlib-resources 6.1.0
inflect 7.2.1
ipykernel 6.29.4
ipython 8.12.3
ipywidgets 8.1.3
isoduration 20.11.0
itsdangerous 2.1.2
jedi 0.19.1
Jinja2 3.1.2
joblib 1.3.2
json5 0.9.25
jsonpointer 2.4
jsonschema 4.19.1
jsonschema-specifications 2023.7.1
jupyter 1.0.0
jupyter_client 8.6.2
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.10.0
jupyter-lsp 2.2.5
jupyter_server 2.14.1
jupyter_server_terminals 0.5.3
jupyterlab 4.2.1
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.2
jupyterlab_widgets 3.0.11
keras 2.13.1
kiwisolver 1.4.5
langdetect 1.0.9
latex2mathml 3.77.0
lazy_loader 0.3
libclang 16.0.6
librosa 0.9.1
llvmlite 0.41.0
loguru 0.7.2
lxml 4.9.3
Markdown 3.5
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.7.3
matplotlib-inline 0.1.7
mdtex2html 1.2.0
mdurl 0.1.2
mistune 3.0.2
more-itertools 10.1.0
mpmath 1.3.0
multidict 6.0.4
nbclient 0.10.0
nbconvert 7.16.4
nbformat 5.10.4
nemo_text_processing 1.0.2
nest-asyncio 1.6.0
networkx 3.1
notebook 7.2.0
notebook_shim 0.2.4
numba 0.58.0
numpy 1.22.4
oauthlib 3.2.2
omegaconf 2.3.0
onnx 1.14.1
onnxoptimizer 0.3.13
onnxsim 0.4.33
openai 1.6.1
openai-whisper 20230918
opencv-python 4.8.1.78
opt-einsum 3.3.0
orjson 3.9.9
overrides 7.7.0
packaging 23.2
pandas 2.0.3
pandocfilters 1.5.1
parso 0.8.4
peft 0.7.1
pickleshare 0.7.5
Pillow 10.0.1
pip 24.0
pkgutil_resolve_name 1.3.10
platformdirs 3.11.0
playsound 1.3.0
pooch 1.7.0
portalocker 2.8.2
praat-parselmouth 0.4.3
prometheus_client 0.20.0
prompt_toolkit 3.0.45
protobuf 4.25.1
psutil 5.9.5
pure-eval 0.2.2
pyarrow 13.0.0
pyasn1 0.5.0
pyasn1-modules 0.3.0
PyAudio 0.2.12
pycparser 2.21
pydantic 2.5.3
pydantic_core 2.14.6
pydeck 0.8.1b0
pydub 0.25.1
Pygments 2.16.1
pynini 2.1.5
pynvml 11.5.0
PyOpenGL 3.1.7
pyparsing 3.1.1
python-dateutil 2.8.2
python-json-logger 2.0.7
python-multipart 0.0.9
pytz 2023.3.post1
PyWavelets 1.4.1
pywin32 306
pywinpty 2.0.13
pyworld 0.3.0
PyYAML 6.0.1
pyzmq 26.0.3
qtconsole 5.5.2
QtPy 2.4.1
referencing 0.30.2
regex 2023.10.3
requests 2.32.3
requests-oauthlib 1.3.1
resampy 0.4.2
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.6.0
rpds-py 0.10.4
rsa 4.9
ruff 0.4.7
sacrebleu 2.3.1
sacremoses 0.1.1
safetensors 0.4.3
scikit-image 0.18.1
scikit-learn 1.3.1
scikit-maad 1.3.12
scipy 1.7.3
semantic-version 2.10.0
Send2Trash 1.8.3
sentencepiece 0.1.99
setuptools 69.5.1
shellingham 1.5.4
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
sounddevice 0.4.5
SoundFile 0.10.3.post1
soupsieve 2.5
sse-starlette 1.8.2
stack-data 0.6.3
starlette 0.32.0.post1
streamlit 1.29.0
sympy 1.12
tabulate 0.9.0
tenacity 8.2.3
tensorboard 2.13.0
tensorboard-data-server 0.7.1
tensorboardX 2.6.2.2
tensorflow 2.13.0
tensorflow-estimator 2.13.0
tensorflow-intel 2.13.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor 2.3.0
terminado 0.18.1
threadpoolctl 3.2.0
tifffile 2023.7.10
tiktoken 0.3.3
timm 0.9.12
tinycss2 1.3.0
tokenizers 0.19.1
toml 0.10.2
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.0
torch 2.1.0+cu121
torchaudio 2.1.0+cu121
torchcrepe 0.0.22
torchvision 0.16.0+cu121
tornado 6.3.3
tqdm 4.63.0
traitlets 5.14.3
transformers 4.41.2
transformers-stream-generator 0.0.4
trimesh 4.0.0
typeguard 4.3.0
typer 0.12.3
types-python-dateutil 2.9.0.20240316
typing_extensions 4.12.0
tzdata 2023.3
tzlocal 5.1
uri-template 1.3.0
urllib3 2.2.1
uvicorn 0.25.0
validators 0.22.0
vector_quantize_pytorch 1.14.8
vocos 0.1.0
watchdog 3.0.0
wcwidth 0.2.13
webcolors 1.13
webencodings 0.5.1
websocket-client 1.8.0
websockets 11.0.3
Werkzeug 3.0.0
WeTextProcessing 0.1.12
wget 3.2
wheel 0.43.0
widgetsnbextension 4.0.11
win32-setctime 1.1.0
wrapt 1.15.0
yarl 1.9.2
zipp 3.17.0
WebUI 代碼如下(目前只是演示基本功能,比較簡(jiǎn)陋):
import gradio as gr
from transformers import pipeline
import numpy as np
from ChatTTS.experimental.llm import llm_api
import ChatTTS
chat = ChatTTS.Chat()
chat.load_models(compile=False) # 設(shè)置為True以獲得更快速度
API_KEY = 'sk-xxxxxxxx' # 需要自行到 https://platform.deepseek.com/api_keys 申請(qǐng)
client = llm_api(api_key=API_KEY,
base_url='https://api.deepseek.com',
model='deepseek-chat')
transcriber = pipeline('automatic-speech-recognition', model='openai/whisper-base')
def transcribe(audio):
sr, y = audio
y = y.astype(np.float32)
y /= np.max(np.abs(y))
user_question = transcriber({'sampling_rate': sr, 'raw': y})['text']
text = client.call(user_question, prompt_version = 'deepseek')
wav = chat.infer(text, use_decoder=True)
audio_data = np.array(wav[0]).flatten()
sample_rate = 24000
return (sample_rate, audio_data)
demo = gr.Interface(
transcribe,
gr.Audio(sources=['microphone']),
'audio',
)
demo.launch()
在此基礎(chǔ)上,可以增加更多功能:
ASR 模型這里只使用 openai/whisper-base,可以在頁面上選擇多種模型;
DeepSeek v2 API 使用了默認(rèn)參數(shù)配置,可以在頁面上增加一些額外參數(shù),如 temperature 和 system prompt 等;
ChatTTS 可以增加如 speaker 身份,打斷和笑聲控制,實(shí)現(xiàn)更豐富的輸出;
支持流式對(duì)話,像 GPT-4o 那樣自然打斷;
如果環(huán)境搭建遇到困難,可以私信獲取完整項(xiàng)目。
點(diǎn)擊下方卡片,關(guān)注“慢慢學(xué)AIGC”
聯(lián)系客服