Comment by Genego
8 days ago
I have been generating a few dozen images per day for storyboarding purposes. The more I try to perfect it, the easier it becomes to control these outputs and even keep the entire visual story as well as their characters consistent over a few dozen different scenes; while even controlling the time of day throughout the story. I am currently working with 7 layers prompts to control for environment, camera, subject, composition, light, colors and overall quality (it might be overkill, but it’s also experimenting).
I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
We use nano banana extensively to build video storyboards, which we then turn into full motion video with a combination of img2vid models. It sounds like we're doing similar things, trying to keep images/characters/setting/style consistent across ~dozens of images (~minutes of video). You might like the product depending on what you're doing with the outputs! https://hypernatural.ai
The website lets you type in an entire prompt, then tells you to login, then dumps your prompt and leaves you with nothing. Lame.
Got caught out by this just today. Just wanted to try out because it appeared as if I can try it out.
I noticed ChatGPT and others do exactly the same once you run out of anonymous usage. Insanely annoying.
5 replies →
Your "Dracula" character is possibly the least vampiric Dracula I've ever seen tbh
If anything, the ubiquity of AI has just revealed how many people have 0 taste. It also highlights the important role that these human-centred jobs were doing to keep these people from contributing to the surface of any artistic endeavour in "culture".
28 replies →
That looks exactly like the photos on a Spirit Halloween costume.
3 replies →
I agree. Bruhcula? Something like that. He's a vampire, but also models and does stunts for Baywatch - too much color and vitality. Joan of Arc is way more pale.
Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.
1 reply →
He looks like Dracula on LinkedIn
Having a Statue of Liberty character available is for some reason so funny to me.
2 replies →
[dead]
Yes we are definitely doing the same! For now I’m just familiarizing myself in this space technically and conceptually. https://edwin.genego.io/blog
I dont get how these tools are considered good when they cant even do a simple thing decribing this scene.
> i was to bring awareness to the dangers of dressing up like a seal while surfboarding (ie. wearing black wetsuites, arms hanging over the board). Create a scene from the perspective of a shark looking up from the bottom of the ocean into a clear blue sky with silhouettes of a seal and a surfer and fishing boat with line dangling in the water and show how the shark contemplates attacking all these objects because they look so similiar.
I havnt found a model yet that can process that description, or any varition, into a scene that usable and makes sense visually to anyone older the a 1st grader. They will never place the seal, surfer, shark or boat in the correct location to make sense visually. Typically everyone is under water, sizing of everything is wrong. You tell them to the image is wrong, to place the person ontop of the water, and they cant. Please can someone link to a model that is capable or tell me what i am doing wrong? How can you claim to process words into images in a repeatable way when these systems cant deal with multiple contraints at once?
You'll have somewhat better luck if you fix the spelling errors.
https://lmarena.ai/c/019a84ec-db09-7f53-89b1-3b901d4dc6be
https://gemini.google.com/share/da93030f131b
Obviously neither are good but it is better.
I think image models could be producing a lot more editable outputs if eg they output multi-layer PSDs.
> The more I try to perfect it, the easier it becomes I have the opposite experience, once it goes off track, its nearly impossible to bring it back on message
How much have you experimented with it? For some stories I may generate 5 image variations of 10-20 different scenes and then spend time writing down what worked and what did not; and running the generation again (this part is mostly for research). It’s certainly advancing my understanding over time and being able to control the output better. But I’m learning that it takes a huge amount of trial and error. So versioning prompts is definitely recommended, especially if you find some nuances that work for you.
> I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api)
Are you talking about Automatic1111 / ComfyUI inpainting masks? Because Nano doesn't accept bounding boxes as part of its API unless you just stuffed the literal X/Y coordinates into the raw prompt.
You could do something where you draw a bounding box and when you get the response back from Nano, you could mask that section back back over the original image - using a decent upscaler as necessary in the event that Nano had to reduce the size of the original image down to ~1MP.
No I am using my own workflows and software for this. I made nano-banana accept my bounding boxes. Everything is possible with some good prompting: https://edwin.genego.io/blog/lpa-studio < there are some videos of an earlier version there while I am editing a story. Either send the coords and describe the location well, or draw a box around the bb and tell it to return the image without the drawn bb, and only the requested changes.
It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.
- normal image editing response: 12-14s
- image editing response with Claude meta-prompting: 20-25s
- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s
(I use Replicate though, so the actual API may be much faster).
This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.
Thanks, that makes sense. I'll have to give the "red bounding box overlay" a shot when there are a great deal of similar objects in the existing image.
I also have a custom pipeline/software that takes in a given prompt, rewrites it using an LLM into multiple variations, sends it to multiple GenAI models, and then uses a VLM to evaluate them for accuracy. It runs in an automated REPL style, so I can be relatively hands-off, though I do have a "max loop limiter" since I'd rather not spend the equivalent of a small country's GDP.
1 reply →
What framework are you using to generate your documentation? It looks amazing.
1 reply →
You can literally just open the image up in Preview or whatever and add a red box, circle etc and then say "in the area with the red square make change foo" and it will normally get rid of the red box on the generated image. Whether or not it actually makes the change you want to see is another matter though. It's been very hit or miss for me.
Yeah I could see that being useful if there were a lot of similar elements in the same image.
I also had similar mixed results wrt Nano-banana especially around asking it to “fix/restore” things (a character’s hand was an anatomical mess for example)
That sounds intriguing. 7 layers - do you mean its one prompt composed of 7 parts, like different paragraphs for each aspect? How do you send bounding box info to banana? Does it understand something like that? What does claude add to that process? Makes your prompt more refined? Thanks
Yes, the prompt is composed of 7 different layers, where I group together coherent visual and temporal responsibilities. Depending on the scene, I usually only change 3-5 layers, but the base layers still stay the same; so the scenes all appear within the same story universe and same style. If something feels off, or feels like it needs to be improved, I just adjust one layer after the other to experiment with the results on the entire story, but also on individual scene level. Over time, I have created quite some 7-Layer style profiles, that work well, and I can cast onto different story universes. Keep in mind this is heavy experimentation, it may just be that there is a much easier way to do this, but I am seeing success with this. https://edwin.genego.io/blog/lpa-studio - at any point I may throw this all out and start over; depending on how well my understanding of this all develops.
Bounding boxes: I actually send an image with a red box around where the requested change is needed. And 8 out of 10 times it works well. But if it doesn't work, I use Claude to make the prompt more refined. The Claude API call that I make, can see the image + the prompt, as well understanding the layering system. This is one of the 3 ways I edit, there is another one where I just sent the prompt to Claude without it looking at the image. Right now this all feels like dial-up. With a minimum of 0.035$ per image generation (0.0001$ if I just use a LoRa though) and a minimum of 12-14 seconds wait on each edit/generation.
This is beautiful and inspiring, This is exactly what we need right now - tools to empower artists and builders leveraging the novel technologies. Claude Code is a great example IMHO and it's the tip of the iceberg - the future consists of a whole new world, new mental model and set of constraints and capabilities, so different that I can't really imagine it.
Who has thought that we reach this uncharted territory with so many opportunities for pioneering and innovation? Back in 2019 it felt like nothing was new under the sun, today it feels like there is a whole new world under the sun, for us to explore!
1 reply →
> Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
I keep hearing advocates of AI video generation talking at length about how easy the tools are to use and how great the results are, but I've yet to see anyone produce something meaningful that's coherent, consistent, and doesn't look like total slop.
Bots in the Hall. Neural Viz. The Meat Dept video for Igorrr's ADHD. More will come.
You need talented people to make good stuff, but at this time most of them still fear the new tools.
I watched the most popular and most recent videos of each channel to compare, and they were all awful:
> Bots in the Hall
* voices don't match the mouth movements * mouth movements are poorly animated * hand/body movements are "fuzzy" with weird artifacts * characters stare in the wrong direction when talking * characters never move * no scenes over 3 seconds in length between cuts
> Neural Viz
* animations and backgrounds are dull * mouth movements are uncanny * "dead eyes" when showing any emotions * text and icons are poorly rendered
> The Meat Dept video for Igorrr's ADHD
This one I can excuse a bit since it's a music video, and for the sake of "artistic interpretation", but:
* continuation issues between shots * inconsistent visual style across shots * no shots longer then 4 seconds between cuts * rendered text is illegible/nonsensical * movement artifacts
1 reply →
You'll have to wait for actual talented artists to start using these tools.
Almost every talented artist with a public presence that has spoken on AI art, has spoken against it's generation, the use of AI tools, and the harm it's causing to their communities. The few established artists who are proponents of AI art (Lioba Brueckner comes to mind) have a financially incentive to do so, since they sell tools or courses teaching others with less/no talent to do the same.
4 replies →
I don’t think that is the problem (as someone that has been described in that bracket), it’s the tooling and control that is missing. I believe that will be solved over time.
[flagged]
What is the point of bringing these silly comments that say nothing from over cheap news sites to Hacker News?
I'm sorry but there are already tons of similarly "imported" comments here that disparage AI and AI artists that similarly add no value to the discussion.
My intention was solely to support the parent in the face of prevalent general critique of what he dabbles in.