AI-语音生成之Qwen3-TTS

前言

最近网上冲浪的时候又发现阿里推出了一个新的语音生成模型Qwen3-TTS，第一反应是有点奇怪，因为阿里巴巴之前已经推出了CosyVoice这个语音模型，为什么要推出两个在我看来功能都一样的模型呢？同时这个新推出的Qwen3-TTS效果如何呢？有什么区别呢？待着疑问来体验和了解一下

环境：windows10

显卡：N卡3060TI

环境：conda python3.12

output_20260208_153458_帮我生成一个qwen3-tts的功能介绍图尽量卡通一些我

安装&体验

首先还是老样子clone一下仓库到本地

1	git clone https://github.com/QwenLM/Qwen3-TTS

然后这次使用conda来体验一下环境隔离

Download Anaconda Distribution | Anaconda

下载后一路install即可，安装完后在顺手配置一个PATH环境变量（个人不太喜欢使用Anaconda Prompt那个命令行）

例如：

1
2
3

G:\conda
G:\conda\Library\bin;
G:\conda\Scripts;

配置完后，就可以开始新建一个Qwen3-TTS专用的Python环境了

# 配置环境
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

# 安装依赖
cd Qwen3-TTS
pip install -e .

安装完后，还要根据情况看下本地环境配置是否有Sox相关组件，没有的话还要下载一下，不然运行会报错

https://sourceforge.net/projects/sox/

下载完后同理，丢到PATH环境变量里

默认pip install的torch依赖是CPU版本的，为了提高处理速度，要特定安装下3060TI的torch版本

# 卸载
pip uninstall -y torch torchvision torchaudio

# 安装
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

安装完后试一下是否正常

import torch
print("torch:", torch.__version__)
print("cuda in torch:", torch.version.cuda)   # None 表示 CPU 版
print("is_available:", torch.cuda.is_available())

# 输出
torch: 2.5.1+cu121
cuda in torch: 12.1
is_available: True

来试下生成的效果

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

# single inference
wavs, sr = model.generate_custom_voice(
    text="真的学不动了啊，我好想打游戏呀",
    language="Chinese", # Pass `Auto` (or omit) for auto language adaptive; if the target language is known, set it explicitly.
    speaker="Vivian",
    instruct="用撒娇的语气说", # Omit if not needed.
)
sf.write("output_custom_voice20261.wav", wavs[0], sr)