← Back to context

Comment by ra7

1 day ago

> The insight driving the program, Naga said, is that the limiting factor for AV development is no longer the underlying technology. “The bottleneck is data,” he said. “[Companies like Waymo] need to go around and collect the data, collect different scenarios. You may be able to say: in San Francisco, ‘At this school intersection, I want some data at this time of day so I can train my models.’ The problem for all these companies is access to that data, because they don’t have the capital to deploy the cars and go collect all this information.”

You can’t be the CTO of Uber wanting to do AVs, and get the data collection requirement shockingly wrong.

Waymo’s bottleneck has never been data. When they want data about a school intersection in SF at a certain time of day, they just... synthetically generate it and simulate: https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

Waymo is able to deploy with less (but targeted and high quality) data collection by having world class simulation capabilities. Not that they haven't collected huge amounts of data as it's no doubt important (I've heard their onboard storage is transferred and emptied every few days), it's just not a bottleneck. They have the most efficient operation in the AV industry.

The best example of why data collection isn’t the bottleneck is Tesla. They boast about billions of miles of data, yet they’re struggling to put out fully autonomous vehicles.

> When they want data about a school intersection in SF at a certain time of day, they just... synthetically generate it and simulate

I think it's more about detecting changes to the world. You need boots on the ground, so to speak, to see that new speed limit sign or the new lane paint. The Waymo vehicle can no doubt react to changes in the world when it encounters them, relaying them back to the mothership, but it's better to know about them in advance.

  • Most AVs, definitely Waymo vehicles, are self mapping. They can detect environment changes and relay it to the entire fleet. That's because they map using the same vehicles as the fleet.

  • >You need boots on the ground, so to speak, to see that new speed limit sign or the new lane paint.

    It'll shock you to know that you can simply get this from governments, some even provide this in API form

    • It probably won't shock you to know that those sources of data can be months to even years delayed from what's actually out in the world.

    • no visual data, you need picture data for that. companies like NC tech do it for like $1m a city. or thereabouts.

    • > or the new lane paint.

      I'd be surprised if this is a thing outside the biggest US (and European, for that matter) cities, judging from Google StreetView there are lots of streets in US cities/towns with almost no paint lines at all.

      5 replies →

  • That’s dumb then. It shows it’s just brute force rather than AI.

    A human doesn’t need to be shown every single road that exists in order to drive.

    • That's true, but the human can do a much better job planning for the journey if they know what to expect along the way.

      One example, from the end of the journey: knowing in advance where the actual entrance to the business is, or the specific curb cut that leads to the residence, makes it easier and far less error prone to decide exactly where the journey should end. Even humans have a hard time figuring out the right access point for a business or residence. This is a job for an offline process, fed by as many data sources as possible.

Yeah I'm not so sure this CTO is on the mark here, but to be fair, I do think some of this IRL long tail/edge case data is important for Waymo. The simulation software is super interesting to me - the real world can be so chaotic, and even if they could generate every possible real life case, there needs to be validation on whether the Waymo driver is responding in the optimal way. They certainly haven't solved this problem, you can see some of their growing pains in all of these articles - floods in Austin, more and more interactions with emergency vehicles that first responders seem to believe are getting worse, etc.

Tesla on the other hand has billions of miles of data, yet because there is a limit to camera-only techniques, that data isn't that useful is it? They have no ground truth data to evaluate their camera system on, which is why sometimes you see those Teslas driving around with lidar rigs mounted on them. Going camera-only is just asking for trouble.

  • I agree real world data is important for Waymo. I didn't mean to say it wasn't, so I've edited my comment to reflect that. It's just that data is not some magic bullet to achieve self driving like Tesla and others suggest.

    Of course, Waymo still has much more room for improvement. But it's much more efficient to supplement less but higher quality IRL data with large amounts of synthetic data, than to run a million data collection vehicles 24x7 because most IRL data is boring and useless.

    Waymo said 6 years ago they simulate 20 million miles every single day [1]. Clearly, it's working for them given their scale of deployment right now.

    [1] https://waymo.com/blog/2020/04/off-road-but-not-offline--sim...

    • Although most of the real-world data is probably boring, collecting more of it likely makes discovering rare edge cases more likely. But since they happen rarely, I imagine that after discovering them, they would then need to figure out how to simulate them.

> The best example of why data collection isn’t the bottleneck is Tesla.

Exactly. plus any delivery company/dashcam company can provide a bunch of data where ever there is any sizeable population.

About 8 years ago, that data would have been really valuable, but at best its nice to have.

the only thing that is valuable is the breadth of different cars, but even then its not that much of a differentiator.

The biggest difference, is Uber has vehicles around the world. So there's more data from countries with different rules from the US. Signage is definitely different between the US and Europe.

I.. am amused by the confidence on display, but I can't say that I am not concerned that people are confidently stating that real world data is not useful, because it can be just simulated. One would think that, by now at least, we know that simulation is at best an imperfect copy.

And I don't like the idea of even more data being harvested and used.. I just find the dismissal.. odd.

  • “Real world data is not the bottleneck” != “Real world data is not useful”

    No one is suggesting the latter.

    • Parent's post noted that it is not a bottleneck, because it can be readily simulated ( and thus not useful ). I am not sure if QED is too much in this case, but I stand by my amusement. Or are you arguing that real world data is somehow less useful than simulated data? It is very confusing. I would accuse of nitpicking, but I just noticed you are the parent:D You can certainly speak for yourself.

      2 replies →

> The best example of why data collection isn’t the bottleneck is Tesla. They boast about billions of miles of data, yet they’re struggling to put out fully autonomous vehicles.

Well, TBF, the tesla data was complete garbage with earlier vehicles. They had cheap and somewhat bad cameras in the earlier vehicles that was only somewhat recently updated. And even then, I don't think Tesla is at the end of their hardware journey. I think they don't think that either, which is why they've gone to a subscription only model for self driving vehicles.

Waymo, on the other hand, has gathered less data, but more high quality data. They do the expensive mapping of a city which is a big part of why their vehicles have early on been able to do some pretty impressive feats. The drawback is getting that high quality data takes a lot of time and resources.

  • > And even then, I don't think Tesla is at the end of their hardware journey.

    I dunno about that. Tesla seems completely adrift, pretending to pivot with random forays into humanoid robotics or whatever, to the point that I wouldn't be surprised if they exited the consumer vehicle space altogether within the next decade. They have no answer for Chinese competitors.

    • I recently watched some videos related to the production of cybercab, which has now started public testing. They’ve still done some great engineering, to the point that the car is now assembled like a matchbox car. All the drive components are contained in a single package for a FWD configuration that the body just drops down on. The car now has no controls besides the screen and door pulls. The materials are all lower cost and they even found a way to skip painting the cars. All of this should help them cut costs significantly.

      As far as the self driving, they may be far off still, it’s hard for me to get a read on that and this vehicle is a bet that they will be able to achieve it - right down to the braille in the cabin, so maybe that’s why they still fail. The thing I will say is that despite the PR disaster that the CEO is, which gives us that feeling that the company has lost its mind, it seems they are still quietly doing some advanced engineering.

    • Well, let me rephrase, the previous stated goals of Tesla around self driving cars isn't complete with the current hardware.

Didn't they need the data from the 200 million miles or so from actual driving before they could get to the generative model though? Data isn't everything, as you point out with Telsa (mainly because they decided to forego using lidar it would seem), but it is pretty fundamental.

"You can’t be the CTO of Uber wanting to do AVs, and get the data collection requirement shockingly wrong."

Problem 1: Cost and privacy constrain limit data collection.

Problem 2: It makes not much sense to collect and store data that you already have. Yet you don't know that when collecting if it is useful or not.

Problem 3: P2P in urban setting fails at edge cases which by definition are rare to collect.

All of these problems limit AV scaling.

Waymo might very well be missing specific kinds of data (e.g more incidents/accidents, near-collisions etc)

Also, Uber’s data might be useful for eval, not training (e.g « here is how Waymo would behave vs human drivers therefore it is safer »)

  • > Waymo might very well be missing specific kinds of data (e.g more incidents/accidents, near-collisions etc)

    Accidents and near-collisions are exactly the kind of scenarios perfect for simulation. You don't test them out in the real world and risk injuries/deaths. You need to have confidence they're handled before you deploy.

    • Again, how do you know you've handled it correctly without ground truth? Simulation without ground truth is a garbage in garbage out situation.

I find the idea of learning from simulated data so unintuitive. How can you radically improve your model with just your model? I take it people do it, so it must work, but i just don’t understand it at all.

  • Well there's a world simulation model and then the driving model.

    You can imagine improving i.e. a specialized math model (problem in, theorem out) with a normal LLM that knows lots of problems and theorems generally.

  • I think people are skipping over the fact that Google has had cars driving around taking photos for 20 years. I imagine that was used to build the world model in the first place.

  • They're two different models - you can use the world model to train (or test like Wayve) a different car-driving model.

    The world model is basically intended as a more true-to-life simulator.

Yes, the way to make these things safer is to make up data and simulate on that.

Do you hear yourself?

  • That’s literally how it works right now, so yeah.

    • >Mapping out every intersection, sign, and signal Before our Waymo Driver begins operating in a new area, we first map the territory with incredible detail, from lane markers to stop signs to curbs and crosswalks. Then, instead of relying solely on external data such as GPS which can lose signal strength, the Waymo Driver uses these highly detailed custom maps, matched with real-time sensor data and artificial intelligence (AI) to determine its exact road location at all times.

      https://waymo.com/waymo-driver/

      That AI part is doing a lot of heavy lifting. They're using real data. We already know synthetic data is dangerous. Explains a lot of if you think it's more reliant on that than real data.

      2 replies →