← Back to context

Comment by Genego

7 days ago

No I am using my own workflows and software for this. I made nano-banana accept my bounding boxes. Everything is possible with some good prompting: https://edwin.genego.io/blog/lpa-studio < there are some videos of an earlier version there while I am editing a story. Either send the coords and describe the location well, or draw a box around the bb and tell it to return the image without the drawn bb, and only the requested changes.

It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.

- normal image editing response: 12-14s

- image editing response with Claude meta-prompting: 20-25s

- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s

(I use Replicate though, so the actual API may be much faster).

This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.

Thanks, that makes sense. I'll have to give the "red bounding box overlay" a shot when there are a great deal of similar objects in the existing image.

I also have a custom pipeline/software that takes in a given prompt, rewrites it using an LLM into multiple variations, sends it to multiple GenAI models, and then uses a VLM to evaluate them for accuracy. It runs in an automated REPL style, so I can be relatively hands-off, though I do have a "max loop limiter" since I'd rather not spend the equivalent of a small country's GDP.

  • Automated generator-critique loops for evaluation may be really useful for creating your own style libraries, because its easy for an LLM-agent to evaluate how close an image is from a reference style or scene. So you end up with a series of base prompts, and now can replicate that style across a whole franchise of stories. Most people still do it with reference images, and it doesn't really create very stable results. If you do need some help with bounding boxes for nano-banana, feel free to send me a message!

What framework are you using to generate your documentation? It looks amazing.

  • I am using Django, HTML (JS - AlpineJS & HTMX). Each page is just created from scratch rather than from some CMS or template, I use Claude code for that (with mem0.ai as MCP) and build my entire development workspace and workflow around / into my website.