用Whisper将语音备忘录转化成文字

Dec 16, 2022

之前都是在用iPhone或者Android上的语音输入来记笔记，可以这种方法识别错误率还蛮高的，如果不及时修改的话过一段时间自己都想不起来当时究竟记了什么。语音备忘录是个好东西，大不了可以重新把语音播放一遍，可以能试试转化成文字比较好用的也只有Pixel Phone和三星，而且他们转化成的文字也不方便导出惯管理，我希望的是最好都直接通过Apple Watch记笔记，然后转化的文字也可以自动导出管理。

最近OpenAI出了一个可以识别语音的language model，正好可以用来解决这个问题，本文的灵感来源于https://piszek.com/2022/10/23/voice-memos-whisper/, 根据我的实际情况做了不少修改。

首先要打开iCloud Sync, 这样手机上的voice memo就会自动同步到mac上的这个文件夹

/Users/USERNAME/Library/Application Support/com.apple.voicememos/Recordings/

但是里面的文件名是一些id，并不是我给备忘录起的名字，不过还好里面有一个sqlite文件存着文件名和备忘录名的对应关系，用python打开处理一下吧

conn = sqlite3.connect('/Users/USERNAME/Library/Application Support/com.apple.voicememos/Recordings/CloudRecordings.db')
c = conn.cursor()
c.execute("select ZPATH, ZCUSTOMLABEL from ZCLOUDRECORDING")
res = c.fetchall()

for (filePath, fileComment) in res:
    print(fileComment)
conn.close()

接下来就可以进行转化了，有若干种方式，可以通过huggingface, 代码如下

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
from transformers import pipeline

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "zh", task = "transcribe")

// device = torch.device("mps")

speech_recognizer = pipeline("automatic-speech-recognition", model="openai/whisper-small", chunk_length_s=30)
speech_recognizer.model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language = "zh", task = "transcribe")
result = speech_recognizer("/Users/USERNAME/Desktop/output.aac")["text"]
print(result)

这个方法的缺点就是huggingface的library不支持苹果备忘录的m4a格式，得先用ffmpeg之类的包转成aac

还可以直接调用huggingface的API，这个好处是非常快，用的服务器的CPU/GPU加速，缺点就是不能指定语言，而且如果是一个中文的备忘录会被他们自动翻译成英语（虽然不得不说翻译的还挺好），下面是中英文对照（也就是说用不太通顺的中文翻译出了非常流利的英文，神奇！）

"Then brother also recommended that the Lost Coast Trail, but it was to go to the draw in October of each year, and then, yeah, it's very popular. It will also happen in the summer, and then you have to be careful of the tide. Oh, that's good. Then recommend this. We went to the Deep Sea and let's go back to the Deep Raven. There is a very special narrow tree tunnel, and then up there is the nature's ranger station. The view is particularly good."

然后大哥也很推荐那个Lost Coast Trail但就是得每年的10月份去抽气然后对就是很popular夏天也会遇到下雨的情况然后可能对就要注意潮汐什么的别的还好然后推荐这个我们走着Deep Sea让我们回来了走那个Deep Raven上去然后说有个特别纳维的Treat Tunnel然后上面上去有Nation Ranger Station View特别好然后我们要录录回来OK

而且同样用必须转成aac的问题

curl --location --request POST 'https://api-inference.huggingface.co/models/openai/whisper-small' \
--header 'Authorization: Bearer TOKEN' \
--header 'Content-Type: audio/x-aac' \
--data-binary '@/Users/USERNAME/Desktop/output.aac'

而且他们的hosted model会有一个“预热”的过程，如果长时间不使用必然会出现503的问题，所以可以加一个hack来判断一下

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    if response.status_code == 503:
        time.sleep(40)
        response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

最稳定的方法就是在本机调用了

model = whisper.load_model("small")

audio = whisper.load_audio(filePath)
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")
args["language"] = detected_language
result = model.transcribe(filePath, temperature=temperature, **args)

个人觉得用小模型就可以了，大模型的时间开销更大，但是效果其实没有好很多

小模型 
whisper 20220302\ 081656-C887C68F.m4a --language Chinese
[00:00.000 --> 00:07.160] 这个Scotman赛的和那个don't worry and start living life有一些个预计同工之庙就是说
[00:07.160 --> 00:27.160] 就是 make a plan instead of worry

中模型
whisper 20220302\ 081656-C887C68F.m4a --language Chinese --model medium
100%|█████████████████████████████████████| 1.42G/1.42G [01:17<00:00, 19.8MiB/s]
[00:00.000 --> 00:11.400] Scoutman said和Don't Worry and Start Living Life有一些遇体同工之妙,就是Make a plan instead of worry

大模型
whisper 20220302\ 081656-C887C68F.m4a --language Chinese --model large
100%|█████████████████████████████████████| 2.87G/2.87G [02:15<00:00, 22.9MiB/s]
[00:00.000 --> 00:07.000] 这个Scott Monsad和那个Don't Worry and Start Living Life有一些个一体同工之妙
[00:07.000 --> 00:32.000] 就是Make a Plan Instead of Worry

虽然pytorch理论上可以在M1 mac上进行GPU硬件加速，但经过我的实测还是不能用的, cuda才是久经考验，mps只能跑跑benchmark什么的，https://github.com/pytorch/pytorch/issues/82645

最后完整的代码在这里 https://github.com/yuntianlong2002/whisper-transcribe/blob/main/transcribe.py

tianlong’s Newsletter

Discussion about this post