← Back to context

Comment by bigfishrunning

9 hours ago

Why would you want an LLM to fly a drone? Seems like the wrong tool for the job -- it's like saying "Only one power drill can pound roofing nails". Maybe that's true, but just get a hammer

There are almost endless reasons why. It's like asking why would you want a self-driving car. Having a drone to transport things would be amazing, or to patrol an area. LLMs can be helpful with object identification, reacting to different events, and taking commands from users.

The first thought I had was those security guard robots that are popping up all over the place. if they were drones instead, and LLM talked to people asking them to do/not-do things, that would be an improvement.

Or an waiter drone, that takes your order in a restaurant, flies to the kitchen, picks up a sealed and secured food container, flies it back to the table, opens it, and leaves. It will monitor for gestures and voice commands to respond to diners and get their feedback, abuse, take the food back if it isn't satisfactory,etc...

This is the type of stuff we used to see in futuristic movies. It's almost possible now. glad to see this kind of tinkering.

  • You could have a program, not LLM-based but could be ANN, for flying and an LLM for overseeing; the LLM could give the program instructions to the pilot program as a (x,y,z) directions. I mean currently autopilots are typically not LLMs, right?

    You describe why it would be useful to have an LLM in a drone to interact with it but do not explain why it is the very same LLM that should be doing the flying.

    • I'm not OP, I don't know what specific roles the LLM should be using, but LLMs are great with object recognition, and using both text (street signs,notices,etc..) and visual cues to predict the correct response. The actual motor control i'm sure needs no LLMs, but the decision making could use any number of solutions, I agree that an LLM-only solution sounds bad, but I didn't do the testing and comparison to be confident in that assessment.

  • The point is that you don't need an LLM to pilot the thing, even if you want to integrate an LLM interface to take a request in natural language.

    • An LLM that can't understand the environment properly can't properly reason about which command to give in response to a user's request. Even if the LLM is a very inefficient way to pilot the thing, being able to pilot means the LLM has the reasoning abilities required to also translate a user's request into commands that make sense for the more efficient, lower-level piloting subsystem.

    • That’s a pretty boring point for what looks like a fun project. Happy to see this project and know I am not the only one thinking about these kinds of applications.

    • We don't need a lot of things, but new tech should also address what people want, not just needs. I don't know how to pilot drones, nor do I care to learn how to, but I want to do things with drones, does that qualify as a need? Tech is there to do things for us we're too lazy to do.

      10 replies →

  • Both of those proposed uses are bad things that are worse than what they would replace.

Because we’re interested in AGI (emphasis on general) and LLM’s are the closest thing to AGI that we have right now.

Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).

But I don't see how it's the wrong tool given the goal.

  • SOTA typically refers to achieving the best performance, not using the trendiest thing regardless of performance. There is some subtlety here. At some point an LLM might give the best performance in this task, but that day is not today, so an LLM is not SOTA, just trendy. It's kinda like rewriting something in Rust and calling it SOTA because that's the trend right now. Hope that makes sense.

    • >Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

      >SOTA typically refers to achieving the best performance

      Multimodal Transformers are the best way to turn plain text instructions to embodied world behavior. Nothing to do with being 'trendy'. A Vision Language Action model would probably have done much better but really the only difference between that and the models trialed above is training data. Same technology.

    • I don’t think trendy is really the right word and maybe it’s not state of the art but a lot of us in the industry are seeing emerging capabilities that might make it SOTA. Hope that makes sense.

      2 replies →

> Why would you want an LLM to fly a drone?

We are on HACKER news. Using tools outside the scope is the ethos of a hacker.

It's a great feature to tell my drone to do a task in English. Like "a child is lost in the woods around here. Fly a search pattern to find her" or "film a cool panorama of this property. Be sure to get shots of the water feature by the pool." While LLMs are bad at flying, better navigation models likely can't be prompted in natural language yet.

  • What you're describing is still ultimately the "view" layer of a larger autopilot system, that's not what OP is doing. He's getting the text generator to drive the drone. An LLM can handle parsing input, but the wayfinding and driving would (in the real world) be delegated to modern autopilot.

The system prompt for the drone is hilarious to me. These models are horrible at spatial reasoning tasks:

https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...

I've been working with integrating GPT-5.2 in Unity. It's fantastic at scripting but completely worthless at managing transforms for scene objects. Even with elaborate planning phases it's going to make a complete jackass of itself in world space every time.

LLMs are also wildly unsuitable for real-time control problems. They never will be. A PID controller or dedicated pathfinding tool being driven by the LLM will provide a radically superior result.

  • Agreed. I’ve found the only reliable architecture for this is treating the LLM purely as a high-level planner rather than a controller.

    We use a state machine (LangGraph) to manage the intent and decision tree, but delegate the actual transform math to deterministic code. You really want the model deciding the strategy and a standard solver handling the vectors, otherwise you're just burning tokens to crash into walls.

What’s the right tool then?

This looks like a pretty fun project and in my rough estimation a fun hacker project.

  • The right tool would likely be some conventional autopilot software; if you want AI cred you could train a Neural Network which maps some kind of path to the control features of the drone. LLMs are language models -- good for language, but not good for spacial reasoning or navigation or many of the other things you need to pilot a drone.

    • So you are suggesting building a full featured package that is nontrivial compared to this fun excitement?

      Vision models do a pretty decent job with spatial reasoning. It’s not there yet but you’re dismissing some interesting work going on.

Why would you want an LLM to identify plants and animals? Well, they're often better than bespoke image classification models at doing just that. Why would you want a language model to help diagnose a medical condition?

It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.

Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."