第八阶段：多模态 Agent | AI Learning Path

[日期] 预计学习时间：15-20 小时

8.1 视觉理解

现代 LLM 都支持图像输入。Agent 可以"看"截图、文档、UI 界面。

8.2 音频与语音

8.3 实时语音 Agent

[!] 腾讯混元特别关注多模态，招聘"Hunyuan Multimodal Algorithm Researcher"

学习资料清单

DeepLearning.AI《Multimodal RAG》：https://www.deeplearning.ai/short-courses/multimodal-rag-llamaindex/
openai/whisper — 开源语音识别：https://github.com/openai/whisper