← Back to context

Comment by whilefalse

14 hours ago

So I’m not an expert, this post was just based on my understanding, but as I understand it: the prompt embedding space and the latent image space are different “spaces”, so there is no single “point” in the latent image space that represents a given prompt. There are regions that are more or less consistent with the prompt, and due to cross-attention between the text embedding vector and the latent image vector, it’s able to guide the diffusion process in a suitable direction.

So different seeds lead to slightly different end points, because you’re just moving closer to the “consistent region” at each step, but approaching from a different angle.