Comment by DannyBee

2 years ago

TL;DR I would not assume they are not using it, or that this is about 3.125% memory/cache usage.

Longer answer:

Google folks were responsible for pushing on Hardware MTE in the first place - It originally came from the folks who also did work on ASAN, syzkaller, etc. They are not in Android, but it was done with the help and support of folks in Android. That's the Google side, it was obviously a partnership with ARM/etc as well.

I was the director for the teams that created/pushed on it, way back when - this was years ago at this point because of the lead times on a hardware/architecture feature like this.

So i'm very familiar with the tradeoffs.

It is more than just the memory usage or cache. The post is correct that it was designed to be able to be enabled/disabled dynamically, and needed to have expected perf cost ~0, but the main use case at the time was sampling based bug finding.

That is, if you turn MTE on (whether servers, phones, whatever) for 1% of the time for your entire fleet, and you have a large enough fleet, you will find basically all bugs very quickly. You can do this during dogfooding, etc.

Put another way - the goal was to make it possible to use have the equivalent of ASAN be flipped on and off when you want it.

Keeping it on all the time as a security mitigation was a secondary possibility, and has issues besides memory overhead.

For example, you will suddenly cause tons of user-visible crashes. But not even consistently. You will crash on phones with MTE, but not without it (which is most of them).

This is probably not the experience you want for a user.

For a developer, you would now have to force everyone to test on MTE enabled phones when there are ~1 of them. This is not likely to make developers happy.

Are there security exploits it will mitigate? Yes, they will crash instead of be exploitable. Are there harmless bugs it will catch? Yes.

But keep in mind what i said at the beginning - i would not assume they don't use it.

I would instead assume they don't necessarily use it in production on all the time.

Anyone who has experience on hardware feature bringup of this kind will tell you the fact that they can boot and run the system and only when they do this one thing does it crash under MTE is actually a very good sign they do use it.

Otherwise it would have probably crashed a million times :)

As an aside - It's also not obvious it's the best choice for run-time mitigation.

We didn't propose enabling MTE for apps not opting into it any time soon. We proposed enabling it for the base OS by default. Pixels are already testing with HWASan and MTE so there are few issues found by it in the base OS. Enabling it for the base OS and apps opting into it would be a great start. Requiring working MTE support for ARMv9 in the CDD is entirely doable, and then devs will have devices with it, and it can be made into a default for apps at a new target API level with opt-out instead of opt-in. It can then be made into a mandatory feature at a future target API level. Android makes dramatically more aggressive backwards incompatible changes via target API levels than detecting memory corruption without false positives.

We know they're actively testing HWASan and MTE builds, but not with enough real world usage. If they tested it a lot on actual devices used by Google employees, they'd have fixed this Bluetooth LE audio issue before the release.

Thanks for the insight.

>As an aside - It's also not obvious it's the best choice for run-time mitigation.

What are some of the current contenders/arguments?