Comment by Karrot_Kream
1 day ago
What if you could pilot your video editing tool through voice? Have a multimodal LLM convert your instructions into some structured data instruction that gets used by the editor to perform actions.
1 day ago
What if you could pilot your video editing tool through voice? Have a multimodal LLM convert your instructions into some structured data instruction that gets used by the editor to perform actions.
Compare pinch zoom to the tedious scene in Bladerunner where Deckard is asking the computer to zoom in to a picture.
Zooming is a bad example (because pinch zoom is just so much better than that scene hah.) Instead "go back 5 frames, and change the color grading. Make the mood more pensive and bring out blues and magentas and fewer yellows and oranges." That's a lot faster than fiddling with 2-3 different sliders IMO.
> Zooming is a bad example (because pinch zoom is just so much better than that scene hah.) Instead "go back 5 frames, and change the color grading. Make the mood more pensive and bring out blues and magentas and fewer yellows and oranges." That's a lot faster than fiddling with 2-3 different sliders IMO.
Eh. That's not as good as being skilled enough to know exactly what you want and have the tools to make that happen.
There's something to be said for tools that give you the power of manipulating something efficiently, than systems that do the manipulation for you.
4 replies →
Training LLMs to generate some internal command structure for a tool is conceptually similar to what we've done with them already, but the training data for it is essentially non-existent, and would be hard to generate.
My experience has been that generating structured output with zero, one, and few-shot prompts works quite well. We've used it at $WORK for zero-shot stuff and it's been good enough. I've done few-shot prompting for some personal projects and it's been solid. JSON Schema based enforcement of responses with temperature 0 settings works quite well. Sometimes LLMs hallucinate their responses but if you keep output formats fairly constrained (e.g. structured dicts of booleans) it decreases hallucinations and even when they do hallucinate, at temperature 0 it seems to stay within < 0.1% of responses even with zero-shot prompting. (At least with datasets and prompts I've considered.)
(Though yes, keep in mind that 0.1% hallucination = 99.9% correctness which is really not that high when we're talking about high reliability things. With zero-shot that far exceeded my expectations though.)
Deckard. Blade Runner.