gpt4o 图像生成的技术讨论(自回归模型又好起来了?)

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

gpt4o 图像生成的特点是，生成时从上到下逐渐清晰化（并不只是显示技巧）

如果使用 diffusion 进行生成，它的过程可能是这样的

但已知的是 gpt4o 图像生成（似乎）已经转向 autoregressive(自回归模型)+transformer

目前外网也对 gpt4o 的技术进行了猜测，但也没讨论出个结果来（大多是认同转向了 ar 模型）

自回归模型是要打败 diffusion ，并在多模态领域又好用起来了吗？

另外，目前开源界似乎还没有什么动静，国内的字节跳动在 ar 的图像生成领域探索得还挺多（发了不少 paper ）

GPT4o

autoregressive

transformer

6 条回复 • 2025-04-10 21:25:07 +08:00

mxT52CRuqR6o5

3 天前

我看有分析说是纯前端特效啊

zzq825924

3 天前

@mxT52CRuqR6o5 不是前端特效，刷新过程中，图片的 url 一直在变

lthero

3 天前

@mxT52CRuqR6o5 #1 有前端特效，但图片也会发生变化（可能一共发了 4 张图过来）

kneo

3 天前

不太理解这种技术。按照常理说图片的上下没有逻辑上的依赖关系。从上往下还是从下往上，不应该就是一个参数的事吗？

halberd

3 天前

有 diffusion ，混合结构。ClosedAI 虽然连技术报告都不发了，但最后一点良心体现在这张生成样例图里

https://imgur.com/a/YGzxVIp

官网给出的 Prompt:
```
A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.

The text reads:

(left)
"Transfer between Modalities:

Suppose we directly model
p(text, pixels, sound) [equation]
with one big autoregressive transformer.

Pros:
* image generation augmented with vast world knowledge
* next-level text rendering
* native in-context learning
* unified post-training stack

Cons:
* varying bit-rate across modalities
* compute not adaptive"

(Right)
"Fixes:
* model compressed representations
* compose autoregressive prior with a powerful decoder"

On the bottom right of the board, she draws a diagram:
"tokens -> [transformer] -> [diffusion] -> pixels"
```

已经有论文给出详细猜测：
https://arxiv.org/abs/2504.02782

lthero

3 天前

@halberd #5 感谢，下午也刷到这论文了，刚把它看完；"tokens -> [transformer] -> [diffusion] -> pixels"算是开发者留下的彩蛋了