第七篇:视觉大模型时代#

从多模态基础模型到视觉AGI的演进之路

篇章概述#

视觉大模型(Vision-Language Model, VLM)是2023-2024年计算机视觉领域最重要的技术突破。本篇深入讲解:

多模态基础模型(CLIP、BLIP、LLaVA)
前沿视觉大模型(Florence-2、GPT-4V、Gemini)
3D视觉与视频理解新进展

为什么学习视觉大模型?#

范式转变: 从单一任务模型到统一多模态模型
零样本能力: 无需训练即可完成新任务
产业应用: 正在重塑计算机视觉应用格局
技术前沿: 是通向AGI的重要路径

章节组织#

第16章:多模态基础模型 #

核心主题: CLIP、BLIP、LLaVA三大基础模型

16.1 CLIP:视觉-语言对比学习
- 对比学习原理与双编码器架构
- 零样本分类、图像检索
- transformers库实战
16.2 BLIP系列:视觉问答
- BLIP-2架构:Q-Former设计
- 图像描述、VQA任务
- 量化优化与部署
16.3 LLaVA:大语言模型+视觉
- 视觉指令微调方法
- 多模态对话系统
- LLaVA 1.5/1.6新特性
16.4 实战:多模态理解应用
- 商品图像搜索
- 智能客服机器人
- 图像内容审核

技术栈: transformers, torch, PIL, accelerate

代码文件:

code/clip_zero_shot.py - CLIP零样本分类
code/blip2_vqa.py - BLIP-2视觉问答
code/llava_chat.py - LLaVA多模态对话
code/multimodal_app.py - 综合应用示例

第17章:视觉大模型前沿 #

核心主题: 工业级VLM与商业API

17.1 Florence-2:微软视觉基础模型
- 统一提示词范式
- 支持10+视觉任务
- 开源可商用(MIT协议)
17.2 GPT-4V/GPT-4o:多模态GPT
- Vision API调用方法
- 提示词工程技巧
- 实际应用案例
17.3 Gemini Vision:Google多模态
- Gemini 1.5/2.0对比
- 原生多模态能力
- 视频理解特性
17.4 实战:VLM API调用与应用
- 文档理解与OCR
- 视频分析应用
- 成本优化策略

技术栈: openai, google-generativeai, anthropic

代码文件:

code/florence2_demo.py - Florence-2多任务演示
code/gpt4v_api.py - GPT-4V API调用
code/gemini_vision.py - Gemini Vision使用
code/vlm_comparison.py - VLM性能对比

第18章:3D视觉与视频理解 #

核心主题: 从2D到3D/4D的扩展

18.1 NeRF:神经辐射场
- 隐式3D表示
- 体渲染原理
- Instant-NGP加速
18.2 Gaussian Splatting:3D重建新范式
- 显式3D高斯表示
- 实时渲染能力
- 与NeRF对比
18.3 Video Understanding:视频分类与检测
- TimeSformer、VideoMAE
- 视频VLM(Video-LLaVA)
- 时序动作检测
18.4 实战:3D重建项目
- 手机拍摄到3D模型
- 场景重建流程
- Web可视化展示

技术栈: torch, trimesh, open3d, gradio

代码文件:

code/nerf_basic.py - NeRF基础实现
code/gaussian_splatting.py - 3DGS演示
code/video_vlm.py - 视频理解模型
code/3d_reconstruction.py - 3D重建流程

技术路线图#

传统CV              多模态基础           视觉大模型              未来方向
  |                    |                    |                      |
ImageNet          CLIP(2021)         Florence-2(2024)      视觉AGI
ResNet            BLIP(2022)         GPT-4V(2023)          具身智能
Detection    -->  LLaVA(2023)   -->  Gemini(2024)    -->   世界模型
Segmentation      BLIP-2(2023)       GPT-4o(2024)          多模态推理

环境配置#

基础依赖#

# 核心库
pip install transformers>=4.35.0 torch torchvision
pip install accelerate bitsandbytes  # 量化加速
pip install pillow requests datasets

# API客户端
pip install openai>=1.0.0
pip install google-generativeai
pip install anthropic

# 3D/视频
pip install open3d trimesh
pip install opencv-python decord

GPU要求#

模型	最小显存	推荐显存	量化方案
CLIP	2GB	4GB	-
BLIP-2	6GB	12GB	int8/int4
LLaVA-7B	14GB	24GB	4bit量化
Florence-2	4GB	8GB	float16

学习建议#

学习路径#

Week 1-2: 第16章多模态基础
- 理解对比学习原理
- 掌握CLIP、BLIP使用
- 完成零样本分类实验
Week 3: 第17章前沿VLM
- 学习Florence-2统一范式
- 实践商业API调用
- 对比不同VLM性能
Week 4: 第18章3D/视频
- 理解NeRF/3DGS原理
- 尝试3D重建项目
- 探索视频VLM应用

实践项目推荐#

初级: CLIP图像搜索引擎
中级: LLaVA多模态客服
高级: Florence-2通用视觉助手
进阶: 3D场景重建系统

参考资源#

论文必读#

CLIP: Learning Transferable Visual Models From Natural Language Supervision (ICML 2021)
BLIP-2: Bootstrapping Language-Image Pre-training (ICML 2023)
LLaVA: Visual Instruction Tuning (NeurIPS 2023)
Florence-2: Advancing a Unified Representation (CVPR 2024)
NeRF: Representing Scenes as Neural Radiance Fields (ECCV 2020)
3DGS: 3D Gaussian Splatting (SIGGRAPH 2023)

开源项目#

Hugging Face Transformers - VLM统一接口
LLaVA Official - 视觉指令微调
Florence-2 Demo - 微软官方模型
Nerfstudio - NeRF工具箱
Gaussian Splatting - 官方实现

在线资源#

关键技术对比#

VLM模型选型指南#

模型	开源	参数量	优势	适用场景
CLIP	✅	0.4B	零样本分类强	图像搜索、检索
BLIP-2	✅	4B	VQA性能优秀	视觉问答、描述生成
LLaVA	✅	7-13B	对话能力强	多模态助手
Florence-2	✅	0.77B	统一多任务	通用视觉API
GPT-4V	❌	-	综合能力最强	复杂推理、文档理解
Gemini	❌	-	原生多模态、视频理解	长视频分析、多模态生成

应用场景匹配#

电商搜索: CLIP(以图搜图) + Florence-2(商品属性提取)
智能客服: LLaVA(多轮对话) + GPT-4V(复杂问题)
内容审核: Florence-2(快速检测) + Gemini(视频审核)
文档理解: GPT-4V(表格/图表) + Florence-2(OCR)
3D重建: NeRF/3DGS(场景重建) + VLM(语义理解)

本篇特色#

全栈覆盖: 从开源模型到商业API
代码可运行: 所有示例基于最新版本
实战导向: 每章包含完整应用案例
性能对比: 详细评测数据与成本分析
前沿跟踪: 涵盖2024年最新进展

学习提示:

视觉大模型是快速发展的领域,建议关注Hugging Face和ArXiv最新论文
商业API(GPT-4V/Gemini)需要付费,可先用开源模型学习
3D视觉部分计算密集,建议在GPU环境运行

下一步: 开始学习第16章:多模态基础模型,掌握CLIP、BLIP、LLaVA核心技术!

第16章:多模态基础模型#

CLIP、BLIP、LLaVA - 连接视觉与语言的桥梁

本章概述#

多模态基础模型是视觉大模型时代的基石。本章深入讲解三个里程碑式的模型:

CLIP: OpenAI的对比学习范式,开启零样本视觉新时代
BLIP系列: Salesforce的视觉问答专家
LLaVA: 将大语言模型与视觉完美结合的开创者

通过本章学习,你将掌握多模态模型的核心原理、使用方法和实际应用。

16.1 CLIP:视觉-语言对比学习#

核心思想#

CLIP(Contrastive Language-Image Pre-training)通过对比学习将图像和文本映射到同一语义空间,实现零样本分类。

关键创新:

在4亿图像-文本对上训练
双编码器架构(Image Encoder + Text Encoder)
对比损失函数建立视觉-语言对齐

模型架构#

Image Input                Text Input
     |                          |
Image Encoder              Text Encoder
(ViT-L/14)                (Transformer)
     |                          |
  Image                      Text
Embedding                 Embedding
     |                          |
     +--------Cosine Sim--------+
              (Similarity Score)

技术细节:

Image Encoder: Vision Transformer (ViT-L/14, 400M参数)
Text Encoder: Transformer with masked self-attention
训练目标: InfoNCE对比损失
输出维度: 768维特征向量

使用transformers库#

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image

# 加载模型(自动下载)
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# 准备输入
image = Image.open("image.jpg")
text_candidates = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]

# 编码
inputs = processor(
    text=text_candidates,
    images=image,
    return_tensors="pt",
    padding=True
)

# 推理
with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image  # (1, 3)
    probs = logits_per_image.softmax(dim=1)  # (1, 3)

# 结果
print(f"概率分布: {probs[0].tolist()}")
predicted_label = text_candidates[probs.argmax()]
print(f"预测类别: {predicted_label}")

零样本分类原理#

CLIP无需任何训练即可对新类别分类:

文本提示工程: 将类别转换为描述性文本

# 基础提示
texts = [f"a photo of a {label}" for label in class_names]

# 高级提示(提升性能)
templates = [
    "a photo of a {}",
    "a rendering of a {}",
    "a cropped photo of a {}",
    # ... 80个模板集成
]

相似度计算: 图像与所有文本的余弦相似度
Softmax归一化: 得到概率分布

性能表现#

数据集	准确率	备注
ImageNet	76.2%	零样本(top-1)
CIFAR-10	94.9%	零样本
CIFAR-100	77.4%	零样本
Oxford-Pets	93.8%	零样本

对比: ResNet-50在ImageNet上需要100万标注样本才能达到76%。

应用场景#

以图搜图: 将图像和商品描述对齐
内容审核: 零样本检测不适内容
图像标注: 自动生成标签
跨模态检索: 文本搜索图像或反之

代码实战#

参见 code/chapter16_multimodal/clip_zero_shot.py - 完整的零样本分类示例,包括:

多类别分类
自定义提示模板
批量图像处理
可视化结果

16.2 BLIP系列:视觉问答#

BLIP-2架构#

BLIP-2(Bootstrapping Language-Image Pre-training v2)是Salesforce在2023年推出的视觉-语言模型。

核心创新: Q-Former

Frozen Image       Q-Former        Frozen LLM
Encoder          (Learnable)      (OPT-2.7B)
   |                 |                 |
 ViT-g      32个Query Tokens      Language
(1.4B)           (762M)            Model
   |                 |                 |
   +-------Cross Attention---------+
                     |
                Text Output

三大优势:

参数高效: 只训练Q-Former(762M),冻结图像/文本编码器
任务通用: 支持图像描述、VQA、对话
性能强大: 在多个基准上超越Flamingo(80B参数)

模型规格#

模型变体	参数量	LLM基座	显存需求
blip2-opt-2.7b	4B	OPT-2.7B	14GB
blip2-flan-t5-xl	4B	Flan-T5	15GB
blip2-opt-6.7b	8B	OPT-6.7B	26GB

使用示例#

图像描述生成:

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

# 加载模型(推荐float16)
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16,
    device_map="auto"
)

# 生成描述
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt").to("cuda", torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=50)
caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"图像描述: {caption}")

视觉问答(VQA):

# 提问
question = "What is the color of the car?"
inputs = processor(image, question, return_tensors="pt").to("cuda", torch.float16)

# 生成答案
generated_ids = model.generate(**inputs, max_new_tokens=20)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"回答: {answer}")

量化优化#

8-bit量化(显存减半):

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    load_in_8bit=True,
    device_map="auto"
)
# 显存: 14GB -> 7GB

4-bit量化(显存1/4):

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    quantization_config=quantization_config,
    device_map="auto"
)
# 显存: 14GB -> 3.5GB

性能基准#

任务	数据集	BLIP-2	Flamingo-80B
VQA	VQAv2	82.2	82.0
图像描述	COCO	144.5	138.1
视觉推理	NLVR2	85.3	84.0

代码实战#

参见 code/chapter16_multimodal/blip2_vqa.py - BLIP-2视觉问答完整示例:

图像描述生成
多轮问答对话
量化加载与性能对比
批量处理优化

16.3 LLaVA:大语言模型+视觉#

模型概述#

LLaVA(Large Language and Vision Assistant)是威斯康星大学在2023年提出的开源视觉对话模型。

核心思想: 将视觉编码器与大语言模型通过简单的线性层连接,在GPT-4生成的多模态指令数据上微调。

架构设计#

Image Input
    |
Vision Encoder ────────> Projection Layer ────> LLM
(CLIP ViT-L/14)         (Linear Layer)      (Vicuna-7B/13B)
    |                         |                    |
768D Embedding  ────>  4096D Embedding  ──> Text Generation

关键组件:

Vision Encoder: 预训练CLIP ViT-L/14(冻结)
Projection Layer: 简单线性层(768→4096),唯一训练的连接层
LLM: Vicuna-7B/13B(LoRA微调)

训练流程#

两阶段训练:

Stage 1: 特征对齐 (预训练)
- 数据: 595K图像-文本对(CC3M过滤)
- 训练: 只训练Projection Layer
- 目标: 将视觉特征映射到LLM空间
Stage 2: 指令微调
- 数据: 158K多模态指令(GPT-4生成)
- 训练: Projection + LLM(LoRA)
- 目标: 提升对话和推理能力

使用transformers库#

LLaVA 1.5官方模型:

from transformers import AutoProcessor, LlavaForConditionalGeneration
import torch

# 加载模型
model_id = "llava-hf/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# 构建对话
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "描述这张图片中的内容。"}
        ]
    }
]

# 生成回复
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)

output = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output[0], skip_special_tokens=True)
print(response)

LLaVA 1.5 vs 1.6 对比#

特性	LLaVA 1.5	LLaVA 1.6 (Next)
发布时间	2023年10月	2024年1月
基座LLM	Vicuna	Mistral/Nous
图像分辨率	336×336	672×672
多图支持	❌	✅
性能(MMBench)	67.7	72.3

推荐: 生产环境使用LLaVA 1.5(更稳定),研究尝试1.6。

4-bit量化部署#

from transformers import BitsAndBytesConfig

# 配置4-bit量化
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# 加载量化模型
model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)
# 显存: 28GB -> 5GB (减少82%)

性能基准#

任务	数据集	LLaVA 1.5	GPT-4V
视觉问答	VQAv2	78.5	77.2
视觉推理	GQA	62.0	-
多模态基准	MMBench	67.7	75.1
OCR	TextVQA	58.2	78.0

亮点: 作为开源模型,LLaVA在某些任务上接近甚至超过闭源GPT-4V。

应用示例#

智能图像助手

questions = [
    "图片中有什么物体?",
    "这些物体的位置关系是什么?",
    "根据图片内容,这可能是什么场景?"
]

视觉内容审核

prompt = "请判断这张图片是否包含不适内容,并说明原因。"

教育辅助

prompt = "这是一道数学题的图片,请解答并说明步骤。"

代码实战#

参见 code/chapter16_multimodal/llava_chat.py - LLaVA多模态对话系统:

单轮/多轮对话
图像理解与推理
量化部署方案
Gradio界面集成

16.4 实战:多模态理解应用#

项目1: 商品图像搜索引擎#

需求: 用户上传商品图片,搜索相似商品。

技术方案:

# 1. 使用CLIP构建图像索引
image_features = model.get_image_features(pixel_values=images)
# 存入向量数据库(FAISS/Milvus)

# 2. 查询时编码并检索
query_features = model.get_image_features(pixel_values=query_image)
similar_indices = faiss_index.search(query_features, k=10)

# 3. BLIP-2生成商品描述
description = blip2_model.generate(query_image)

完整代码: code/chapter16_multimodal/multimodal_app.py

项目2: 智能客服机器人#

需求: 用户发送商品图片+文字问题,AI回答。

技术方案:

# 使用LLaVA处理多模态输入
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "这个商品如何使用?"}
        ]
    }
]
response = llava_model.chat(conversation)

增强功能:

多轮对话记忆
商品知识库检索(RAG)
情感分析与意图识别

项目3: 图像内容审核系统#

需求: 自动检测不适内容(暴力、色情、政治敏感)。

多模型集成:

# 1. CLIP快速初筛(零样本)
labels = ["正常内容", "暴力内容", "色情内容", "政治敏感"]
probs = clip_classify(image, labels)

# 2. 高置信度直接通过/拒绝
if probs.max() > 0.95:
    return probs.argmax()

# 3. 低置信度用LLaVA详细分析
analysis = llava_analyze(image, "请详细分析这张图片是否包含不适内容。")

性能优化:

CLIP处理: 10ms/张(GPU)
LLaVA处理: 500ms/张(仅5%需要)
平均延迟: ~30ms/张

关键技术点#

1. 批量处理

# CLIP批量推理(提升10x)
inputs = processor(text=texts, images=images, return_tensors="pt", padding=True)
outputs = model(**inputs)

2. 特征缓存

# 预计算文本特征(类别固定时)
with torch.no_grad():
    text_features = model.get_text_features(**text_inputs)
    # 缓存,后续只需计算图像特征

3. 混合精度

# 使用torch.autocast加速
with torch.autocast(device_type='cuda', dtype=torch.float16):
    outputs = model(**inputs)

部署建议#

场景	推荐模型	硬件配置	并发能力
图像搜索	CLIP	1×T4(16GB)	100 QPS
视觉问答	BLIP-2 (4bit)	1×A10(24GB)	10 QPS
对话系统	LLaVA (4bit)	1×A100(40GB)	5 QPS
高并发场景	CLIP + API	Serverless	1000+ QPS

成本对比:

自建GPU: $1-3/小时(云服务器)
OpenAI GPT-4V: $0.01-0.03/图像
Google Gemini: $0.0025-0.01/图像

本章总结#

核心要点#

CLIP: 零样本分类的开创者,适合快速原型和图像检索
BLIP-2: VQA专家,Q-Former架构实现参数高效训练
LLaVA: 开源对话模型,接近闭源GPT-4V性能

技术选型建议#

需求	推荐模型	理由
零样本分类	CLIP	快速、简单、效果好
图像描述生成	BLIP-2	专门优化,生成质量高
复杂视觉推理	LLaVA	强大的LLM推理能力
生产环境(性能优先)	CLIP + API	自建CLIP + 调用GPT-4V
生产环境(成本优先)	LLaVA 4bit	开源可控,显存需求低

学习路径#

初学者: 从CLIP零样本分类开始,理解对比学习
进阶: 尝试BLIP-2的VQA任务,掌握量化技术
高级: 部署LLaVA对话系统,优化推理性能

扩展资源#

CLIP论文: https://arxiv.org/abs/2103.00020
BLIP-2论文: https://arxiv.org/abs/2301.12597
LLaVA项目: https://github.com/haotian-liu/LLaVA
Hugging Face模型库: https://huggingface.co/models?pipeline_tag=image-text-to-text

下一章预告: 第17章:视觉大模型前沿 - 探索Florence-2、GPT-4V、Gemini等前沿VLM,学习商业API调用与提示词工程!

第17章:视觉大模型前沿#

Florence-2、GPT-4o、Gemini - 工业级VLM的巅峰之作

本章概述#

本章深入讲解2024-2025年最前沿的视觉大模型:

Florence-2: 微软开源的统一视觉基础模型
GPT-4o: OpenAI的原生多模态GPT
Gemini: Google的原生多模态模型

这些模型代表了当前VLM的最高水平,掌握它们的使用方法对于实际应用至关重要。

17.1 Florence-2:微软视觉基础模型#

Florence-2是微软2024年发布的开源视觉基础模型,采用统一的提示词范式处理10+视觉任务。

17.1.1 核心特性#

关键优势:

MIT开源协议: 可商用,无license限制
统一任务范式: 一个模型完成所有视觉任务
提示词驱动: 通过不同prompt切换任务
高效参数: 0.77B参数,性能优异

17.1.2 模型变体#

模型	参数量	用途	推荐场景
Florence-2-base	0.23B	预训练基础	资源受限环境
Florence-2-large	0.77B	预训练基础	通用场景
Florence-2-base-ft	0.23B	任务微调版	快速部署
Florence-2-large-ft	0.77B	任务微调版	最佳性能

17.1.3 支持的任务与提示词#

视觉理解任务:

任务提示词	功能	输出格式	示例用途
`<CAPTION>`	基础描述	文本	图像标注
`<DETAILED_CAPTION>`	详细描述	文本	内容分析
`<MORE_DETAILED_CAPTION>`	全面分析	文本	深度理解
`<OD>`	目标检测	bbox + 类别	物体定位
`<DENSE_REGION_CAPTION>`	区域级描述	区域+描述	细粒度分析
`<REGION_PROPOSAL>`	候选区域	bbox列表	检测预处理

定位与文字识别任务:

任务提示词	功能	输出格式	示例用途
`<CAPTION_TO_PHRASE_GROUNDING>`	短语定位	文本→bbox	视觉定位
`<OCR>`	文字识别	文本	文档OCR
`<OCR_WITH_REGION>`	OCR+位置	文本+四边形	票据识别
`<REFERRING_EXPRESSION_SEGMENTATION>`	指称分割	mask	交互分割
`<OPEN_VOCABULARY_DETECTION>`	开放词汇检测	bbox	灵活检测

17.1.4 完整使用示例#

基础设置:

from transformers import AutoProcessor, AutoModelForCausalLM
import torch
from PIL import Image

# 加载模型(推荐使用large-ft版本)
model_id = "microsoft/Florence-2-large-ft"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def run_florence(image, task_prompt, text_input=None):
    """通用Florence-2推理函数"""
    if text_input:
        prompt = task_prompt + text_input
    else:
        prompt = task_prompt

    inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda", torch.float16)

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3
    )

    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_result = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
    return parsed_result

任务示例:

image = Image.open("example.jpg")

# 1. 图像描述
caption = run_florence(image, "<CAPTION>")
print(f"基础描述: {caption}")
# 输出: {'<CAPTION>': 'A cat sitting on a red sofa in a living room.'}

detailed = run_florence(image, "<DETAILED_CAPTION>")
print(f"详细描述: {detailed}")

# 2. 目标检测
detection = run_florence(image, "<OD>")
print(f"检测结果: {detection}")
# 输出: {'<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['cat', 'sofa', ...]}}

# 3. OCR文字识别
ocr_result = run_florence(image, "<OCR>")
print(f"识别文字: {ocr_result}")

# 4. OCR带位置
ocr_with_region = run_florence(image, "<OCR_WITH_REGION>")
print(f"文字+位置: {ocr_with_region}")

# 5. 短语定位(Grounding)
grounding = run_florence(image, "<CAPTION_TO_PHRASE_GROUNDING>", "a cat")
print(f"定位结果: {grounding}")

# 6. 开放词汇检测
open_det = run_florence(image, "<OPEN_VOCABULARY_DETECTION>", "cat, dog, person")
print(f"开放检测: {open_det}")

# 7. 区域级描述
dense_caption = run_florence(image, "<DENSE_REGION_CAPTION>")
print(f"区域描述: {dense_caption}")

17.1.5 结果可视化#

import matplotlib.pyplot as plt
import matplotlib.patches as patches

def visualize_detection(image, result, task="<OD>"):
    """可视化检测结果"""
    fig, ax = plt.subplots(1, figsize=(12, 8))
    ax.imshow(image)

    data = result[task]
    bboxes = data.get('bboxes', [])
    labels = data.get('labels', [])

    colors = plt.cm.Set3(range(len(bboxes)))

    for bbox, label, color in zip(bboxes, labels, colors):
        x1, y1, x2, y2 = bbox
        rect = patches.Rectangle(
            (x1, y1), x2-x1, y2-y1,
            linewidth=2, edgecolor=color, facecolor='none'
        )
        ax.add_patch(rect)
        ax.text(x1, y1-5, label, fontsize=10, color='white',
                bbox=dict(boxstyle='round', facecolor=color, alpha=0.8))

    ax.axis('off')
    plt.tight_layout()
    plt.savefig('detection_result.png', dpi=150, bbox_inches='tight')
    plt.show()

# 使用
detection_result = run_florence(image, "<OD>")
visualize_detection(image, detection_result)

17.1.6 性能基准#

任务	数据集	Florence-2-L	对比模型
图像描述	COCO	135.6 CIDEr	BLIP-2: 144.5
目标检测	COCO	37.5 mAP	-
VQA	VQAv2	81.7% (ft)	LLaVA: 78.5%
OCR	TextVQA	63.0%	-

优势: 单一模型实现多任务,部署简单,资源需求低。

17.1.7 实战应用场景#

1. 智能文档处理:

def process_document(image_path):
    """文档智能处理流程"""
    image = Image.open(image_path)

    # OCR识别
    text = run_florence(image, "<OCR>")

    # 带位置的OCR(用于表格等)
    ocr_regions = run_florence(image, "<OCR_WITH_REGION>")

    # 图表/图像检测
    objects = run_florence(image, "<OD>")

    return {
        "text": text,
        "regions": ocr_regions,
        "objects": objects
    }

2. 电商图像分析:

def analyze_product_image(image_path):
    """电商商品图分析"""
    image = Image.open(image_path)

    # 商品描述
    description = run_florence(image, "<DETAILED_CAPTION>")

    # 检测商品主体
    detection = run_florence(image, "<OD>")

    # 提取商品上的文字(品牌、规格等)
    text_info = run_florence(image, "<OCR>")

    return {
        "description": description,
        "objects": detection,
        "text": text_info
    }

17.2 GPT-4o:多模态GPT#

GPT-4o(omni)是OpenAI于2024年5月发布的原生多模态模型,将文本、视觉、音频能力融合到单一模型中。

17.2.1 核心特性#

相比GPT-4V的改进:

速度: 响应速度提升2倍
成本: API价格降低50%
能力: 视觉理解能力显著提升
多模态: 原生支持文本+图像+音频

模型选择:

模型	特点	成本	推荐场景
gpt-4o	最强能力	$5/1M tokens	复杂推理
gpt-4o-mini	性价比高	$0.15/1M tokens	日常任务
gpt-4-turbo	旧版本	$10/1M tokens	兼容需求

17.2.2 基础使用#

方式1: URL图像:

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def analyze_image_url(image_url, question):
    """使用URL分析图像"""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url}
                    }
                ]
            }
        ],
        max_tokens=500
    )
    return response.choices[0].message.content

# 使用示例
result = analyze_image_url(
    "https://example.com/image.jpg",
    "请详细描述这张图片中的内容"
)
print(result)

方式2: Base64编码图像:

import base64

def encode_image(image_path):
    """将本地图像编码为base64"""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_local_image(image_path, question):
    """分析本地图像"""
    base64_image = encode_image(image_path)

    # 自动检测图像格式
    if image_path.lower().endswith('.png'):
        media_type = "image/png"
    elif image_path.lower().endswith('.gif'):
        media_type = "image/gif"
    elif image_path.lower().endswith('.webp'):
        media_type = "image/webp"
    else:
        media_type = "image/jpeg"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{media_type};base64,{base64_image}"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    return response.choices[0].message.content

# 使用
result = analyze_local_image("product.jpg", "这个商品的主要特点是什么?")

17.2.3 多图像分析#

def compare_images(image_paths, question):
    """多图像对比分析"""
    content = [{"type": "text", "text": question}]

    for path in image_paths:
        base64_image = encode_image(path)
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
            }
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=1500
    )
    return response.choices[0].message.content

# 对比两张图片
result = compare_images(
    ["before.jpg", "after.jpg"],
    "请对比这两张图片的差异,描述发生了什么变化"
)

17.2.4 图像细节控制#

GPT-4o支持控制图像处理的精细度:

def analyze_with_detail(image_path, question, detail="auto"):
    """
    控制图像分析精度
    detail参数:
    - "low": 512x512固定,65 tokens,快速便宜
    - "high": 最高2048x2048,详细分析,更多tokens
    - "auto": 自动选择(默认)
    """
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": detail  # "low", "high", "auto"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    return response.choices[0].message.content

# 快速预览(省钱)
quick_result = analyze_with_detail("doc.jpg", "这是什么文档?", detail="low")

# 详细分析(精确)
detailed_result = analyze_with_detail("doc.jpg", "请提取文档中的所有文字", detail="high")

17.2.5 结构化输出#

import json

def extract_structured_info(image_path, schema_description):
    """提取结构化信息"""
    base64_image = encode_image(image_path)

    prompt = f"""分析这张图片,按照以下格式返回JSON:
{schema_description}

只返回JSON,不要其他内容。"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    }
                ]
            }
        ],
        max_tokens=1000,
        response_format={"type": "json_object"}  # 强制JSON输出
    )

    return json.loads(response.choices[0].message.content)

# 提取商品信息
schema = """
{
    "product_name": "商品名称",
    "brand": "品牌",
    "price": "价格(如果可见)",
    "features": ["特点1", "特点2"],
    "category": "类别"
}
"""
product_info = extract_structured_info("product.jpg", schema)

17.2.6 已知限制#

GPT-4o Vision的局限性:

空间推理: 复杂位置关系可能出错
计数: 大量物体计数不准确
细小文字: 图像中的小字体可能识别不清
医疗图像: 不应用于医疗诊断
CAPTCHA: 明确拒绝处理验证码

17.2.7 成本优化策略#

class GPT4VisionOptimizer:
    """GPT-4o Vision成本优化器"""

    def __init__(self):
        self.client = OpenAI()

    def preprocess_image(self, image_path, max_size=1024):
        """预处理图像以减少tokens"""
        from PIL import Image

        img = Image.open(image_path)

        # 调整大小
        if max(img.size) > max_size:
            ratio = max_size / max(img.size)
            new_size = (int(img.width * ratio), int(img.height * ratio))
            img = img.resize(new_size, Image.LANCZOS)

        # 转换为JPEG(通常更小)
        import io
        buffer = io.BytesIO()
        img.convert('RGB').save(buffer, format='JPEG', quality=85)
        return base64.b64encode(buffer.getvalue()).decode('utf-8')

    def smart_analyze(self, image_path, question, use_mini=True):
        """智能分析,根据任务选择模型"""
        # 简单任务用mini
        model = "gpt-4o-mini" if use_mini else "gpt-4o"
        base64_image = self.preprocess_image(image_path)

        response = self.client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": question},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}",
                                "detail": "low"  # 先用低精度
                            }
                        }
                    ]
                }
            ],
            max_tokens=300
        )
        return response.choices[0].message.content

# 使用
optimizer = GPT4VisionOptimizer()
result = optimizer.smart_analyze("image.jpg", "图片中有什么?", use_mini=True)

17.3 Gemini Vision#

Gemini是Google于2023年底发布的原生多模态模型,在视频理解方面具有独特优势。

17.3.1 模型系列#

模型	特点	上下文窗口	推荐场景
gemini-2.5-flash	最新快速版	1M tokens	日常任务
gemini-2.5-pro	最强推理	1M tokens	复杂分析
gemini-2.0-flash	平衡版	1M tokens	通用场景
gemini-1.5-pro	稳定版	2M tokens	超长上下文

17.3.2 基础使用#

安装与配置:

pip install google-generativeai

基础图像分析:

from google import genai
from google.genai import types
import os

# 初始化客户端
client = genai.Client(api_key=os.environ.get("GOOGLE_API_KEY"))

def analyze_image_gemini(image_path, prompt):
    """使用Gemini分析图像"""
    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    # 检测MIME类型
    if image_path.lower().endswith('.png'):
        mime_type = 'image/png'
    elif image_path.lower().endswith('.webp'):
        mime_type = 'image/webp'
    else:
        mime_type = 'image/jpeg'

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=[
            types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
            prompt
        ]
    )
    return response.text

# 使用
result = analyze_image_gemini("photo.jpg", "描述这张照片中的场景")
print(result)

17.3.3 使用File API(大文件)#

def analyze_large_image(image_path, prompt):
    """使用File API处理大文件(推荐)"""
    # 上传文件
    uploaded_file = client.files.upload(file=image_path)

    # 等待处理完成
    import time
    while uploaded_file.state.name == "PROCESSING":
        time.sleep(1)
        uploaded_file = client.files.get(name=uploaded_file.name)

    # 生成内容
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[uploaded_file, prompt]
    )

    # 可选:删除上传的文件
    # client.files.delete(name=uploaded_file.name)

    return response.text

# 使用
result = analyze_large_image("high_res_image.jpg", "详细分析这张图片")

17.3.4 多图像分析#

def compare_images_gemini(image_paths, prompt):
    """多图像对比分析"""
    contents = []

    for path in image_paths:
        with open(path, 'rb') as f:
            image_bytes = f.read()
        contents.append(types.Part.from_bytes(data=image_bytes, mime_type='image/jpeg'))

    contents.append(prompt)

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=contents
    )
    return response.text

# 使用
result = compare_images_gemini(
    ["img1.jpg", "img2.jpg", "img3.jpg"],
    "比较这三张图片的异同点"
)

17.3.5 视频理解(Gemini独有优势)#

Gemini原生支持视频理解,这是其独特优势:

def analyze_video(video_path, prompt):
    """分析视频内容"""
    # 上传视频
    video_file = client.files.upload(file=video_path)

    # 等待处理
    import time
    while video_file.state.name == "PROCESSING":
        time.sleep(2)
        video_file = client.files.get(name=video_file.name)

    if video_file.state.name == "FAILED":
        raise ValueError("视频处理失败")

    # 分析视频
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[video_file, prompt]
    )
    return response.text

# 使用
result = analyze_video("demo.mp4", "总结这个视频的主要内容")
print(result)

视频时间戳查询:

def query_video_timestamp(video_path, question):
    """查询视频特定时间点的内容"""
    video_file = client.files.upload(file=video_path)

    # 等待处理
    import time
    while video_file.state.name == "PROCESSING":
        time.sleep(2)
        video_file = client.files.get(name=video_file.name)

    prompt = f"""观看这个视频并回答问题。
如果问题涉及特定场景,请指出大概的时间点。

问题: {question}"""

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=[video_file, prompt]
    )
    return response.text

# 使用
result = query_video_timestamp("lecture.mp4", "讲师什么时候开始讲解神经网络?")

17.3.6 高级功能:目标检测与分割#

Gemini 2.0+支持目标检测,Gemini 2.5+支持分割:

def detect_objects_gemini(image_path, objects_to_detect):
    """使用Gemini进行目标检测"""
    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    prompt = f"""检测图片中的以下物体: {objects_to_detect}

返回每个检测到的物体的边界框坐标,格式为:
物体名称: [x_min, y_min, x_max, y_max] (归一化到0-1000)"""

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=[
            types.Part.from_bytes(data=image_bytes, mime_type='image/jpeg'),
            prompt
        ]
    )
    return response.text

# 使用
result = detect_objects_gemini("street.jpg", "人, 车, 红绿灯")

17.3.7 成本与Token计算#

图像Token计算规则:

图像 ≤384px(两边): 258 tokens
更大图像: 按768×768 tiles切分,每tile 258 tokens

视频Token计算:

每秒视频约263 tokens(1fps采样)
1分钟视频 ≈ 15,780 tokens

价格对比(2024年):

模型	输入价格	输出价格
Gemini 2.5 Flash	$0.075/1M	$0.30/1M
Gemini 2.5 Pro	$1.25/1M	$10/1M
GPT-4o	$2.50/1M	$10/1M
GPT-4o-mini	$0.15/1M	$0.60/1M

结论: Gemini在图像/视频处理上比GPT-4o便宜约60-70%。

17.4 实战:VLM API调用与应用#

17.4.1 统一接口封装#

from abc import ABC, abstractmethod
from PIL import Image
import base64
import io

class VLMInterface(ABC):
    """VLM统一接口"""

    @abstractmethod
    def analyze(self, image_path: str, prompt: str) -> str:
        pass

    def _encode_image(self, image_path: str) -> str:
        with open(image_path, 'rb') as f:
            return base64.b64encode(f.read()).decode('utf-8')

class GPT4oVLM(VLMInterface):
    def __init__(self, api_key: str):
        from openai import OpenAI
        self.client = OpenAI(api_key=api_key)

    def analyze(self, image_path: str, prompt: str) -> str:
        base64_image = self._encode_image(image_path)
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
                ]
            }],
            max_tokens=1000
        )
        return response.choices[0].message.content

class GeminiVLM(VLMInterface):
    def __init__(self, api_key: str):
        from google import genai
        self.client = genai.Client(api_key=api_key)

    def analyze(self, image_path: str, prompt: str) -> str:
        from google.genai import types
        with open(image_path, 'rb') as f:
            image_bytes = f.read()
        response = self.client.models.generate_content(
            model='gemini-2.5-flash',
            contents=[types.Part.from_bytes(data=image_bytes, mime_type='image/jpeg'), prompt]
        )
        return response.text

class Florence2VLM(VLMInterface):
    def __init__(self):
        from transformers import AutoProcessor, AutoModelForCausalLM
        import torch
        self.model = AutoModelForCausalLM.from_pretrained(
            "microsoft/Florence-2-large-ft",
            torch_dtype=torch.float16,
            trust_remote_code=True
        ).to("cuda")
        self.processor = AutoProcessor.from_pretrained(
            "microsoft/Florence-2-large-ft",
            trust_remote_code=True
        )

    def analyze(self, image_path: str, prompt: str) -> str:
        image = Image.open(image_path)
        inputs = self.processor(text=prompt, images=image, return_tensors="pt").to("cuda")
        output = self.model.generate(**inputs, max_new_tokens=1024)
        return self.processor.batch_decode(output, skip_special_tokens=True)[0]

# 使用示例
def analyze_with_fallback(image_path, prompt, vlm_list):
    """带fallback的分析"""
    for vlm in vlm_list:
        try:
            return vlm.analyze(image_path, prompt)
        except Exception as e:
            print(f"{vlm.__class__.__name__} 失败: {e}")
            continue
    raise RuntimeError("所有VLM都失败了")

17.4.2 文档理解应用#

class DocumentAnalyzer:
    """文档智能分析器"""

    def __init__(self, vlm: VLMInterface):
        self.vlm = vlm

    def extract_text(self, image_path: str) -> str:
        """提取文档中的文字"""
        prompt = "提取图片中的所有文字,保持原有格式和布局"
        return self.vlm.analyze(image_path, prompt)

    def analyze_table(self, image_path: str) -> dict:
        """分析表格内容"""
        prompt = """分析图片中的表格,返回JSON格式:
{
    "headers": ["列1", "列2", ...],
    "rows": [["数据1", "数据2", ...], ...]
}
只返回JSON,不要其他内容。"""
        result = self.vlm.analyze(image_path, prompt)
        import json
        return json.loads(result)

    def summarize_document(self, image_path: str) -> dict:
        """文档摘要"""
        prompt = """分析这份文档,返回JSON格式:
{
    "type": "文档类型(如:发票/合同/报告)",
    "title": "文档标题",
    "date": "日期(如果有)",
    "summary": "主要内容摘要(100字以内)",
    "key_info": ["关键信息1", "关键信息2", ...]
}"""
        result = self.vlm.analyze(image_path, prompt)
        import json
        return json.loads(result)

# 使用
vlm = GPT4oVLM(api_key="your-key")
analyzer = DocumentAnalyzer(vlm)

# 分析发票
invoice_info = analyzer.summarize_document("invoice.jpg")
print(invoice_info)

17.4.3 视频分析应用(Gemini)#

class VideoAnalyzer:
    """视频智能分析器(仅支持Gemini)"""

    def __init__(self, api_key: str):
        from google import genai
        self.client = genai.Client(api_key=api_key)

    def _upload_video(self, video_path: str):
        """上传并等待视频处理完成"""
        import time
        video_file = self.client.files.upload(file=video_path)
        while video_file.state.name == "PROCESSING":
            time.sleep(2)
            video_file = self.client.files.get(name=video_file.name)
        return video_file

    def generate_summary(self, video_path: str) -> str:
        """生成视频摘要"""
        video_file = self._upload_video(video_path)
        response = self.client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[video_file, "生成这个视频的详细摘要,包括主要内容、关键场景和结论"]
        )
        return response.text

    def extract_key_frames(self, video_path: str) -> list:
        """提取关键帧描述"""
        video_file = self._upload_video(video_path)
        prompt = """分析视频中的关键场景,返回JSON格式:
[
    {"timestamp": "MM:SS", "description": "场景描述"},
    ...
]
只返回JSON列表。"""
        response = self.client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[video_file, prompt]
        )
        import json
        return json.loads(response.text)

    def answer_question(self, video_path: str, question: str) -> str:
        """视频问答"""
        video_file = self._upload_video(video_path)
        response = self.client.models.generate_content(
            model="gemini-2.5-flash",
            contents=[video_file, f"观看视频并回答问题: {question}"]
        )
        return response.text

# 使用
analyzer = VideoAnalyzer(api_key="your-google-key")
summary = analyzer.generate_summary("lecture.mp4")
print(summary)

17.4.4 VLM性能对比#

import time

def benchmark_vlms(image_path: str, prompt: str, vlms: dict) -> dict:
    """VLM性能基准测试"""
    results = {}

    for name, vlm in vlms.items():
        try:
            start = time.time()
            response = vlm.analyze(image_path, prompt)
            latency = time.time() - start

            results[name] = {
                "success": True,
                "latency": round(latency, 2),
                "response_length": len(response),
                "response_preview": response[:200] + "..."
            }
        except Exception as e:
            results[name] = {
                "success": False,
                "error": str(e)
            }

    return results

# 运行基准测试
vlms = {
    "GPT-4o": GPT4oVLM(api_key="..."),
    "Gemini": GeminiVLM(api_key="..."),
    "Florence-2": Florence2VLM()
}

results = benchmark_vlms("test_image.jpg", "描述这张图片", vlms)
for name, result in results.items():
    print(f"{name}: {result}")

本章小结#

核心要点#

Florence-2: 开源统一视觉模型,通过提示词切换10+任务,适合部署和定制
GPT-4o: 综合能力最强,适合复杂推理和高精度需求
Gemini: 视频理解独特优势,成本最低,超长上下文

模型选择指南#

需求	推荐模型	理由
开源可控	Florence-2	MIT协议,可商用
最强推理	GPT-4o	综合能力最佳
视频分析	Gemini	原生视频支持
成本敏感	Gemini Flash	最便宜
快速原型	GPT-4o-mini	性价比高
本地部署	Florence-2	无API依赖

最佳实践#

成本优化: 简单任务用mini模型,复杂任务用标准模型
图像预处理: 压缩图像减少tokens消耗
批量处理: 使用异步调用提升吞吐
Fallback策略: 主模型失败时切换备用模型
结果缓存: 相同图像+prompt缓存结果

参考资源#

第18章:3D视觉与视频理解#

从2D到3D/4D的视觉扩展

本章概述#

本章探索计算机视觉的前沿方向:

NeRF: 神经辐射场,隐式3D表示
3D Gaussian Splatting: 显式3D重建新范式,实时渲染
Video Understanding: 视频分类与理解
Video VLM: 视频版大语言模型

这些技术代表了CV从2D向3D/4D演进的重要方向。

18.1 NeRF:神经辐射场#

NeRF(Neural Radiance Fields)是2020年ECCV提出的突破性工作,通过神经网络隐式表示3D场景,并获得了ECCV 2020最佳论文荣誉提名。

18.1.1 核心原理#

基本思想: 用一个神经网络学习从5D输入(3D位置 + 2D视角方向)到颜色和密度的映射。

输入: (x, y, z, θ, φ) - 空间位置 + 观察方向
       ↓
    MLP网络
       ↓
输出: (r, g, b, σ) - 颜色 + 密度

体渲染(Volume Rendering): 沿着每条射线积分颜色和密度,生成最终像素值:

C(r) = ∫ T(t) · σ(r(t)) · c(r(t), d) dt

其中:
- T(t): 透射率,光线到达t点的概率
- σ: 体积密度
- c: 颜色
- d: 视角方向

18.1.2 网络架构#

import torch
import torch.nn as nn

class NeRF(nn.Module):
    """简化版NeRF网络"""

    def __init__(self, D=8, W=256, input_ch=63, input_ch_views=27):
        """
        Args:
            D: 网络深度
            W: 隐藏层宽度
            input_ch: 位置编码后的位置维度
            input_ch_views: 位置编码后的方向维度
        """
        super().__init__()
        self.D = D
        self.W = W
        self.input_ch = input_ch
        self.input_ch_views = input_ch_views

        # 位置编码后的位置输入 -> 特征
        self.pts_linears = nn.ModuleList(
            [nn.Linear(input_ch, W)] +
            [nn.Linear(W, W) if i != 4 else nn.Linear(W + input_ch, W)
             for i in range(D - 1)]
        )

        # 方向相关的颜色预测
        self.views_linears = nn.ModuleList([nn.Linear(input_ch_views + W, W // 2)])

        # 输出层
        self.feature_linear = nn.Linear(W, W)
        self.alpha_linear = nn.Linear(W, 1)  # 密度
        self.rgb_linear = nn.Linear(W // 2, 3)  # 颜色

    def forward(self, x):
        # 分离位置和方向
        input_pts, input_views = torch.split(
            x, [self.input_ch, self.input_ch_views], dim=-1
        )

        h = input_pts
        for i, layer in enumerate(self.pts_linears):
            h = layer(h)
            h = torch.relu(h)
            if i == 4:
                h = torch.cat([input_pts, h], -1)

        # 密度输出(与视角无关)
        alpha = self.alpha_linear(h)

        # 颜色输出(与视角相关)
        feature = self.feature_linear(h)
        h = torch.cat([feature, input_views], -1)

        for layer in self.views_linears:
            h = layer(h)
            h = torch.relu(h)

        rgb = torch.sigmoid(self.rgb_linear(h))

        return torch.cat([rgb, alpha], -1)

18.1.3 位置编码(Positional Encoding)#

NeRF使用位置编码帮助网络学习高频细节:

class PositionalEncoding:
    """位置编码:将低维输入映射到高维空间"""

    def __init__(self, L=10):
        """
        Args:
            L: 编码频率数量
        """
        self.L = L
        self.freq_bands = 2.0 ** torch.linspace(0, L - 1, L)

    def encode(self, x):
        """
        Args:
            x: 输入坐标 [..., C]
        Returns:
            编码后的坐标 [..., C * (2L + 1)]
        """
        out = [x]
        for freq in self.freq_bands:
            out.append(torch.sin(freq * x))
            out.append(torch.cos(freq * x))
        return torch.cat(out, dim=-1)

# 使用示例
pos_encoder = PositionalEncoding(L=10)  # 位置用L=10
dir_encoder = PositionalEncoding(L=4)   # 方向用L=4

# 3D位置: 3 -> 3*(2*10+1) = 63
# 2D方向: 3 -> 3*(2*4+1) = 27

18.1.4 使用Nerfstudio#

Nerfstudio是最流行的NeRF工具箱,提供了统一的训练和可视化接口:

安装:

pip install nerfstudio

# 或从源码安装(获取最新功能)
git clone https://github.com/nerfstudio-project/nerfstudio.git
cd nerfstudio
pip install -e .

数据准备(使用COLMAP):

# 从视频提取帧并估计相机位姿
ns-process-data video --data ./input_video.mp4 --output-dir ./data/my_scene

# 或从图像文件夹处理
ns-process-data images --data ./images/ --output-dir ./data/my_scene

训练:

# 训练Nerfacto模型(推荐)
ns-train nerfacto --data ./data/my_scene

# 训练Instant-NGP(更快)
ns-train instant-ngp --data ./data/my_scene

# 指定输出目录
ns-train nerfacto --data ./data/my_scene --output-dir ./outputs/

可视化:

# 启动交互式查看器
ns-viewer --load-config outputs/my_scene/nerfacto/config.yml

# 渲染视频
ns-render camera-path --load-config outputs/my_scene/nerfacto/config.yml \
    --camera-path-filename camera_path.json \
    --output-path renders/output.mp4

18.1.5 Instant-NGP:1000倍加速#

NVIDIA的Instant-NGP使用多分辨率哈希编码,将NeRF训练从小时级缩短到分钟级:

核心创新:

多分辨率哈希编码: 用哈希表替代大型MLP,大幅加速查询
小型MLP: 只需要2层MLP(原版需要8层)
CUDA优化: 高度优化的CUDA实现

使用方式:

# 下载并解压Instant-NGP
# 从 https://github.com/NVlabs/instant-ngp/releases 下载

# 启动GUI
./instant-ngp

# 拖拽数据文件夹到窗口即可开始训练
# 或使用命令行
./instant-ngp ./data/nerf/fox

性能对比:

方法	训练时间	渲染FPS	质量(PSNR)
原版NeRF	1-2天	0.03	31.0
Instant-NGP	5分钟	60+	33.0
Nerfacto	30分钟	1-5	32.5

18.1.6 NeRF的局限性#

训练慢: 即使Instant-NGP也需要几分钟
渲染慢: 体渲染计算密集(除了Instant-NGP)
静态场景: 原版只能处理静态场景
采集要求: 需要高质量的多视角图像

18.2 3D Gaussian Splatting#

3D Gaussian Splatting(3DGS)是2023年SIGGRAPH的突破性工作,实现了高质量实时3D重建。

18.2.1 核心思想#

与NeRF的隐式表示不同,3DGS使用显式的3D高斯点云表示场景:

每个高斯的属性:

位置: 3D中心点 (x, y, z)
协方差: 3×3矩阵,定义高斯的形状和方向
不透明度: α值
球谐系数: 表示视角相关的颜色

场景 = {G₁, G₂, ..., Gₙ}
Gᵢ = (μᵢ, Σᵢ, αᵢ, SHᵢ)

其中:
- μ: 位置 (3D)
- Σ: 协方差矩阵 (表示为缩放+旋转)
- α: 不透明度
- SH: 球谐系数 (颜色)

18.2.2 渲染流程#

3DGS使用可微分光栅化而非体渲染:

# 伪代码:3DGS渲染流程
def render_gaussians(gaussians, camera):
    """
    1. 将3D高斯投影到2D
    2. 按深度排序
    3. Alpha混合
    """
    # 1. 投影
    projected_2d = project_to_2d(gaussians, camera)

    # 2. 排序(按深度)
    sorted_gaussians = sort_by_depth(projected_2d)

    # 3. 光栅化(Alpha混合)
    image = torch.zeros(H, W, 3)
    for gaussian in sorted_gaussians:
        contribution = gaussian.alpha * gaussian.color
        image = image * (1 - gaussian.alpha) + contribution

    return image

18.2.3 训练流程#

# 伪代码:3DGS训练流程
def train_3dgs(images, cameras, sfm_points):
    """
    Args:
        images: 训练图像
        cameras: 相机参数
        sfm_points: SfM稀疏点云(初始化)
    """
    # 1. 初始化高斯点
    gaussians = initialize_from_sfm(sfm_points)

    optimizer = torch.optim.Adam(gaussians.parameters(), lr=0.001)

    for iteration in range(30000):
        # 随机选择视角
        camera = random.choice(cameras)
        gt_image = images[camera]

        # 前向渲染
        rendered = render_gaussians(gaussians, camera)

        # 计算损失
        loss = l1_loss(rendered, gt_image) + ssim_loss(rendered, gt_image)

        # 反向传播
        loss.backward()
        optimizer.step()

        # 自适应密度控制(关键!)
        if iteration % 100 == 0:
            densify_and_prune(gaussians)  # 分裂/克隆/删除高斯

    return gaussians

18.2.4 安装与使用#

环境要求:

CUDA 11.0+
Python 3.8+
PyTorch 2.0+

安装:

# 克隆仓库(注意递归克隆)
git clone https://github.com/graphdeco-inria/gaussian-splatting --recursive
cd gaussian-splatting

# 安装依赖
pip install -r requirements.txt

# 安装子模块
pip install submodules/diff-gaussian-rasterization
pip install submodules/simple-knn

数据准备:

# 使用COLMAP处理数据
# 需要:images/文件夹下的图像

python convert.py -s ./data/my_scene

训练:

# 基础训练
python train.py -s ./data/my_scene

# 指定输出和迭代次数
python train.py -s ./data/my_scene -m ./output/my_scene --iterations 30000

# 使用稀疏Adam优化器(2.7倍加速)
python train.py -s ./data/my_scene --optimizer_type sparse_adam

渲染:

# 渲染训练视角
python render.py -m ./output/my_scene

# 交互式查看器
# 需要安装SIBR viewer
./SIBR_viewers/install/bin/SIBR_gaussianViewer_app -m ./output/my_scene

18.2.5 Python API使用#

import torch
from scene import Scene, GaussianModel
from gaussian_renderer import render

# 加载训练好的模型
gaussians = GaussianModel(3)  # sh_degree=3
gaussians.load_ply("output/my_scene/point_cloud/iteration_30000/point_cloud.ply")

# 设置相机
from utils.graphics_utils import getProjectionMatrix, getWorld2View2
viewpoint_camera = create_camera(...)  # 创建相机对象

# 渲染
rendering = render(viewpoint_camera, gaussians, pipe, background)
image = rendering["render"]  # [3, H, W]

# 保存图像
from torchvision.utils import save_image
save_image(image, "rendered.png")

18.2.6 3DGS vs NeRF对比#

特性	NeRF	3D Gaussian Splatting
表示方式	隐式(MLP)	显式(点云)
渲染方法	体渲染(射线采样)	光栅化(Splatting)
训练时间	小时级	分钟级(~30min)
渲染速度	慢(~0.1 FPS)	实时(100+ FPS)
质量	高	更高
编辑性	困难	容易(点云操作)
存储大小	小(~5MB)	大(~100MB+)
动态场景	需要扩展	需要扩展

选择建议:

需要实时渲染 → 3DGS
存储空间有限 → NeRF
需要编辑场景 → 3DGS
研究/学习目的 → 两者都尝试

18.2.7 3DGS扩展与应用#

1. Dynamic 3DGS(动态场景):

# 动态高斯:每个高斯有时间相关属性
class DynamicGaussian:
    def __init__(self):
        self.position = nn.Parameter(...)  # 基础位置
        self.deformation = DeformationNetwork()  # 变形网络

    def get_position(self, time):
        # 根据时间获取当前位置
        delta = self.deformation(self.position, time)
        return self.position + delta

2. SuGaR(网格提取): 从3DGS中提取可编辑的网格:

# 安装SuGaR
git clone https://github.com/Anttwo/SuGaR
cd SuGaR

# 训练并提取网格
python train.py -s ./data/my_scene -r "density" --export_obj

3. GaussianEditor(场景编辑):

# 删除特定区域的高斯
mask = create_mask_from_text("remove the car")
gaussians.prune_by_mask(mask)

# 复制高斯
new_gaussians = gaussians.clone()
new_gaussians.translate([1, 0, 0])  # 移动

18.3 视频理解(Video Understanding)#

视频理解是将VLM能力扩展到时序数据的重要方向。

18.3.1 视频理解任务#

任务	描述	输出
视频分类	识别视频类别	类别标签
动作识别	识别人体动作	动作类别
时序动作检测	检测动作起止时间	时间段+类别
视频描述	生成视频描述	文本
视频问答	回答关于视频的问题	文本
视频摘要	提取关键片段	视频片段

18.3.2 VideoMAE:视频自监督学习#

VideoMAE是视频版的MAE,通过掩码自编码学习视频表示:

安装:

pip install transformers decord av

视频分类:

from transformers import VideoMAEForVideoClassification, VideoMAEImageProcessor
import torch
import numpy as np
import av

def read_video_pyav(video_path, num_frames=16):
    """使用PyAV读取视频帧"""
    container = av.open(video_path)
    stream = container.streams.video[0]

    # 计算采样间隔
    total_frames = stream.frames
    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    frames = []
    for i, frame in enumerate(container.decode(video=0)):
        if i in indices:
            frames.append(frame.to_ndarray(format="rgb24"))
        if len(frames) >= num_frames:
            break

    return np.stack(frames)  # [T, H, W, C]

# 加载模型
model_name = "MCG-NJU/videomae-base-finetuned-kinetics"
processor = VideoMAEImageProcessor.from_pretrained(model_name)
model = VideoMAEForVideoClassification.from_pretrained(model_name)

# 读取视频
video = read_video_pyav("video.mp4", num_frames=16)

# 预处理
inputs = processor(list(video), return_tensors="pt")

# 推理
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# 获取预测类别
predicted_class_idx = logits.argmax(-1).item()
print(f"预测类别: {model.config.id2label[predicted_class_idx]}")

VideoMAE性能:

模型	Kinetics-400	Something-Something V2
videomae-base	81.5%	70.8%
videomae-large	85.2%	75.3%
videomae-huge	86.6%	77.4%

18.3.3 TimeSformer:时空Transformer#

TimeSformer将ViT扩展到视频,使用分离的时空注意力:

from transformers import TimesformerModel, AutoImageProcessor
import torch

# 加载模型
processor = AutoImageProcessor.from_pretrained("facebook/timesformer-base-finetuned-k400")
model = TimesformerModel.from_pretrained("facebook/timesformer-base-finetuned-k400")

# 准备视频(假设已读取为numpy数组)
# video: [T, H, W, C], T通常为8或16帧
video_frames = read_video_pyav("video.mp4", num_frames=8)

# 预处理
inputs = processor(list(video_frames), return_tensors="pt")

# 特征提取
with torch.no_grad():
    outputs = model(**inputs)
    features = outputs.last_hidden_state  # [B, T*patches+1, D]

# 使用CLS token作为视频表示
video_embedding = features[:, 0]  # [B, D]

18.3.4 Video-LLaVA:视频语言模型#

Video-LLaVA是LLaVA的视频版本,支持视频问答和描述:

安装:

pip install transformers accelerate

使用示例:

from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration
import torch
import numpy as np
import av

def read_video_for_llava(video_path, num_frames=8):
    """为Video-LLaVA读取视频"""
    container = av.open(video_path)
    stream = container.streams.video[0]
    total_frames = stream.frames

    indices = np.linspace(0, total_frames - 1, num_frames, dtype=int)

    frames = []
    container.seek(0)
    for i, frame in enumerate(container.decode(video=0)):
        if i in indices:
            frames.append(frame.to_ndarray(format="rgb24"))
        if len(frames) >= num_frames:
            break

    return np.stack(frames)

# 加载模型
model_id = "LanguageBind/Video-LLaVA-7B-hf"
processor = VideoLlavaProcessor.from_pretrained(model_id)
model = VideoLlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

# 读取视频
video = read_video_for_llava("demo.mp4", num_frames=8)

# 构建对话
prompt = "USER: <video>请详细描述这个视频的内容。 ASSISTANT:"

# 处理输入
inputs = processor(
    text=prompt,
    videos=video,
    return_tensors="pt"
).to(model.device, torch.float16)

# 生成
output = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7
)

# 解码
response = processor.decode(output[0], skip_special_tokens=True)
print(response.split("ASSISTANT:")[-1].strip())

视频问答:

def video_qa(video_path, question):
    """视频问答"""
    video = read_video_for_llava(video_path, num_frames=8)
    prompt = f"USER: <video>{question} ASSISTANT:"

    inputs = processor(
        text=prompt,
        videos=video,
        return_tensors="pt"
    ).to(model.device, torch.float16)

    output = model.generate(**inputs, max_new_tokens=128)
    response = processor.decode(output[0], skip_special_tokens=True)
    return response.split("ASSISTANT:")[-1].strip()

# 使用
answer = video_qa("cooking.mp4", "视频中的人在做什么菜?用了哪些食材?")
print(answer)

18.3.5 视频分析最佳实践#

1. 帧采样策略:

def uniform_sample(video_path, num_frames=16):
    """均匀采样"""
    # 最常用,适合大多数任务
    pass

def keyframe_sample(video_path, num_frames=16):
    """关键帧采样"""
    # 适合快速变化的视频
    import cv2
    cap = cv2.VideoCapture(video_path)
    frames = []
    prev_frame = None

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        if prev_frame is not None:
            # 计算帧差
            diff = cv2.absdiff(frame, prev_frame).mean()
            if diff > threshold:
                frames.append(frame)

        prev_frame = frame

    cap.release()
    return select_frames(frames, num_frames)

def scene_based_sample(video_path, num_frames=16):
    """场景切换采样"""
    # 适合包含多个场景的视频
    pass

2. 长视频处理:

def process_long_video(video_path, chunk_duration=30):
    """分块处理长视频"""
    import av
    container = av.open(video_path)
    duration = container.duration / 1000000  # 秒

    results = []
    for start in range(0, int(duration), chunk_duration):
        end = min(start + chunk_duration, duration)
        chunk = extract_chunk(video_path, start, end)

        # 处理每个块
        chunk_result = analyze_video_chunk(chunk)
        results.append({
            "start": start,
            "end": end,
            "result": chunk_result
        })

    # 合并结果
    return merge_results(results)

3. 批量推理:

def batch_video_inference(video_paths, batch_size=4):
    """批量视频推理"""
    results = []

    for i in range(0, len(video_paths), batch_size):
        batch_paths = video_paths[i:i + batch_size]
        batch_videos = [read_video(p) for p in batch_paths]

        # 批量处理
        inputs = processor(
            videos=batch_videos,
            return_tensors="pt",
            padding=True
        ).to(device)

        with torch.no_grad():
            outputs = model(**inputs)

        results.extend(process_outputs(outputs))

    return results

18.4 实战:3D重建项目#

18.4.1 完整3D重建流程#

手机拍摄 → 图像预处理 → SfM位姿估计 → 3DGS/NeRF训练 → 渲染/导出

18.4.2 数据采集指南#

拍摄技巧:

覆盖全面: 围绕物体/场景拍摄,覆盖所有角度
重叠率高: 相邻图像重叠70%以上
光照一致: 避免强烈阴影和高光
稳定清晰: 避免运动模糊
数量适中: 50-150张图像为佳

使用手机:

# 使用Record3D(iOS)直接导出
# 支持LiDAR深度数据

# 或使用Polycam等App

18.4.3 使用COLMAP进行SfM#

# 安装COLMAP
# macOS
brew install colmap

# Ubuntu
sudo apt install colmap

# 或下载预编译版本

自动流程:

# 自动重建流程
colmap automatic_reconstructor \
    --workspace_path ./workspace \
    --image_path ./images \
    --camera_model OPENCV \
    --single_camera 1

分步流程(更多控制):

# 1. 特征提取
colmap feature_extractor \
    --database_path ./database.db \
    --image_path ./images \
    --ImageReader.camera_model OPENCV \
    --ImageReader.single_camera 1

# 2. 特征匹配
colmap exhaustive_matcher \
    --database_path ./database.db

# 3. 稀疏重建
mkdir sparse
colmap mapper \
    --database_path ./database.db \
    --image_path ./images \
    --output_path ./sparse

# 4. 导出为文本格式(用于3DGS)
colmap model_converter \
    --input_path ./sparse/0 \
    --output_path ./sparse_txt \
    --output_type TXT

18.4.4 3DGS训练脚本#

# train_3dgs.py
import os
import subprocess
import argparse

def prepare_data(image_folder, output_folder):
    """使用COLMAP准备数据"""
    os.makedirs(output_folder, exist_ok=True)

    # 运行COLMAP
    subprocess.run([
        "colmap", "automatic_reconstructor",
        "--workspace_path", output_folder,
        "--image_path", image_folder,
        "--camera_model", "OPENCV",
        "--single_camera", "1"
    ])

def train_gaussians(data_path, output_path, iterations=30000):
    """训练3D Gaussian Splatting"""
    subprocess.run([
        "python", "train.py",
        "-s", data_path,
        "-m", output_path,
        "--iterations", str(iterations),
        "--save_iterations", "7000", "15000", "30000"
    ])

def render_video(model_path, output_video):
    """渲染视频"""
    subprocess.run([
        "python", "render.py",
        "-m", model_path,
        "--skip_train",
        "--skip_test"
    ])

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--images", required=True, help="图像文件夹路径")
    parser.add_argument("--output", required=True, help="输出路径")
    parser.add_argument("--iterations", type=int, default=30000)
    args = parser.parse_args()

    # 1. 准备数据
    print("步骤1: 准备数据...")
    data_path = os.path.join(args.output, "data")
    prepare_data(args.images, data_path)

    # 2. 训练
    print("步骤2: 训练3DGS...")
    model_path = os.path.join(args.output, "model")
    train_gaussians(data_path, model_path, args.iterations)

    # 3. 渲染
    print("步骤3: 渲染结果...")
    render_video(model_path, os.path.join(args.output, "video.mp4"))

    print(f"完成! 结果保存在: {args.output}")

if __name__ == "__main__":
    main()

18.4.5 Web可视化#

使用Three.js展示3DGS结果:

<!-- index.html -->
<!DOCTYPE html>
<html>
<head>
    <title>3D Gaussian Splatting Viewer</title>
    <script src="https://cdn.jsdelivr.net/npm/three@0.150.0/build/three.min.js"></script>
    <script src="https://cdn.jsdelivr.net/npm/three@0.150.0/examples/js/controls/OrbitControls.js"></script>
</head>
<body>
    <canvas id="canvas"></canvas>
    <script>
        // 初始化场景
        const scene = new THREE.Scene();
        const camera = new THREE.PerspectiveCamera(75, window.innerWidth / window.innerHeight, 0.1, 1000);
        const renderer = new THREE.WebGLRenderer({canvas: document.getElementById('canvas')});
        renderer.setSize(window.innerWidth, window.innerHeight);

        // 控制器
        const controls = new THREE.OrbitControls(camera, renderer.domElement);

        // 加载PLY点云(简化版本)
        const loader = new THREE.PLYLoader();
        loader.load('point_cloud.ply', function(geometry) {
            const material = new THREE.PointsMaterial({
                size: 0.01,
                vertexColors: true
            });
            const points = new THREE.Points(geometry, material);
            scene.add(points);
        });

        camera.position.z = 5;

        // 渲染循环
        function animate() {
            requestAnimationFrame(animate);
            controls.update();
            renderer.render(scene, camera);
        }
        animate();
    </script>
</body>
</html>

使用专业查看器:

SuperSplat - 在线编辑器
Luma AI - 在线查看
SIBR Viewer - 官方查看器

18.4.6 导出与部署#

导出为网页格式:

# 导出为压缩的.splat格式
def export_splat(gaussians, output_path):
    """导出为Web友好的格式"""
    import numpy as np
    import struct

    # 提取高斯属性
    positions = gaussians.get_xyz.detach().cpu().numpy()
    colors = gaussians.get_features.detach().cpu().numpy()
    opacities = gaussians.get_opacity.detach().cpu().numpy()
    scales = gaussians.get_scaling.detach().cpu().numpy()
    rotations = gaussians.get_rotation.detach().cpu().numpy()

    # 按不透明度排序
    sorted_indices = np.argsort(-opacities.flatten())

    # 写入二进制文件
    with open(output_path, 'wb') as f:
        for idx in sorted_indices:
            # 位置 (float32 x 3)
            f.write(struct.pack('fff', *positions[idx]))
            # 缩放 (float32 x 3)
            f.write(struct.pack('fff', *scales[idx]))
            # 颜色 (uint8 x 4)
            rgb = (colors[idx, :3] * 255).astype(np.uint8)
            alpha = (opacities[idx] * 255).astype(np.uint8)
            f.write(struct.pack('BBBB', *rgb, alpha))
            # 旋转 (int8 x 4)
            rot = (rotations[idx] * 127).astype(np.int8)
            f.write(struct.pack('bbbb', *rot))

    print(f"导出完成: {output_path}")

本章小结#

核心知识点#

NeRF: 隐式3D表示,体渲染,Instant-NGP加速
3DGS: 显式高斯表示,实时渲染,可编辑
VideoMAE: 视频自监督学习,掩码自编码
Video-LLaVA: 视频语言模型,问答和描述

技术对比#

技术	优势	劣势	适用场景
NeRF	高质量,存储小	训练慢,渲染慢	高质量重建
3DGS	实时渲染,可编辑	存储大	实时应用
VideoMAE	自监督,预训练	仅分类	视频分类
Video-LLaVA	视频理解	资源需求大	视频问答

实践建议#

3D重建:
- 新手先用Nerfstudio
- 需要实时渲染用3DGS
- 注意数据采集质量
视频理解:
- 分类任务用VideoMAE
- 问答描述用Video-LLaVA
- 长视频分块处理

参考资源#

第七篇总结#

学习成果#

通过本篇学习,你已掌握:

多模态基础模型: CLIP、BLIP、LLaVA的原理和使用
前沿VLM: Florence-2、GPT-4o、Gemini的API调用
3D视觉: NeRF和3DGS的原理与实践
视频理解: VideoMAE和Video-LLaVA的应用

技术栈总结#

图像理解: CLIP → BLIP → LLaVA → GPT-4o/Gemini
3D重建:  COLMAP → NeRF/3DGS → 渲染/导出
视频理解: VideoMAE → Video-LLaVA → Gemini Video

下一步#

深入研究特定领域(医疗、自动驾驶)的VLM应用
探索多模态生成(图像+视频生成)
关注具身智能(Embodied AI)的发展
学习第八篇:生产实践与工程化