← Back to context

Comment by PaulDavisThe1st

15 hours ago

Tell me, how does doing any of the things you've suggested help with the huge range of computer-driven tasks that have nothing to do with language? Video editing, audio editing, music composition, architectural and mechanical design, the list is vast and nearly endless.

LLMs have no role to play in any of that, because their job is text generation. At best, they could generate excerpts from a half-imagined user manual ...

Because some LLMs are now multimodal—they can process and generate not just text, but also sound and visuals. In other words, they’re beginning to handle a broader range of human inputs and outputs, much like we do.

  • Those are not LLMs. They use the same foundational technology (pick what you like, but I'd say transformers) to accomplish tasks that require entirely different training data and architectures.

    I was specifically asking about LLMs because the comment I replied to only talked about LLMs - Large Language Models.

    • At this point in time calling a multimodal LLM an LLM is pretty uncontroversial. Most of the differences lie in the encoders and embedding projections. If anything I'd think MoE models are actually more different from a basic LLM than a multimodal LLM is from a regular LLM.

      Bottom line is that when folks are talking about LLM applications, multimodal LLMs, MoE LLMs, and even agents are all in the general umbrella.

Everything has to do with language! Language is a way of stating intention, of expression something before it exists, of talking about goals and criteria. Everything example you give can be described in language. You are caught up in the mechanisms of these tools, not the underlying intention.

You can describe your intention in any of these tools. And it can be whatever you want... maybe your intention in an audio editor is "I need to finish this before the deadline in the morning but I have no idea what the client wants" and that's valid, that's something an LLM can actually work with.

HOW the LLM is involved is an open question, something that hasn't been done very well, and may not work well when applied to existing applications. But an LLM can make sense of events and images in addition to natural language text. You can give an LLM a timestamped list of UI events and it can actually infer quite a bit about what the user is actually doing. What does it do with that understanding? We're going to have to figure that out! These are exciting times!

What if you could pilot your video editing tool through voice? Have a multimodal LLM convert your instructions into some structured data instruction that gets used by the editor to perform actions.

  • Compare pinch zoom to the tedious scene in Bladerunner where Deckard is asking the computer to zoom in to a picture.

    • Zooming is a bad example (because pinch zoom is just so much better than that scene hah.) Instead "go back 5 frames, and change the color grading. Make the mood more pensive and bring out blues and magentas and fewer yellows and oranges." That's a lot faster than fiddling with 2-3 different sliders IMO.

      2 replies →

  • Training LLMs to generate some internal command structure for a tool is conceptually similar to what we've done with them already, but the training data for it is essentially non-existent, and would be hard to generate.

    • My experience has been that generating structured output with zero, one, and few-shot prompts works quite well. We've used it at $WORK for zero-shot stuff and it's been good enough. I've done few-shot prompting for some personal projects and it's been solid. JSON Schema based enforcement of responses with temperature 0 settings works quite well. Sometimes LLMs hallucinate their responses but if you keep output formats fairly constrained (e.g. structured dicts of booleans) it decreases hallucinations and even when they do hallucinate, at temperature 0 it seems to stay within < 0.1% of responses even with zero-shot prompting. (At least with datasets and prompts I've considered.)

      (Though yes, keep in mind that 0.1% hallucination = 99.9% correctness which is really not that high when we're talking about high reliability things. With zero-shot that far exceeded my expectations though.)