Comment by jitl
2 days ago
> This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.
Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:
- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.
- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.
- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.
- Finally, the Service Control application is written in a language that allows for null pointer references.
If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.
Example of one that does: https://github.com/stepchowfun/typical
No comments yet
Contribute on Hacker News ↗