← Back to context

Comment by NortySpock

18 hours ago

I keep thinking I have a possible use case for property -based testing, and then I am up to my armpits in trying to understand the on-the-ground problem and don't feel like I have time to learn a DSL for describing all possible inputs and outputs when I already had an existing function (the subject-under-test) that I don't understand.

So rather than try to learn to black boxes at the same time , I fall back to "several more unit tests to document more edge cases to defensibly guard against"

Is there some simple way to describe this defensive programming iteration pattern in Hypothesis? Normally we just null-check and return early and have to deal with the early-return case. How do I quickly write property tests to check that my code handles the most obvious edge cases?

In addition to what other people have said:

> [...] time to learn a DSL for describing all possible inputs and outputs when I already had an existing function [...]

You don't have to describe all possible inputs and outputs. Even just being able to describe some classes of inputs can be useful.

As a really simple example: many example-based tests have some values that are arbitrary and the test shouldn't care about them, like eg employees names when you are populating a database or whatever. Instead of just hard-coding 'foo' and 'bar', you can have hypothesis create arbitrary values there.

Just like learning how to write (unit) testable code is a skill that needs to be learned, learning how to write property-testable code is also a skill that needs practice.

What's less obvious: retro-fitting property-based tests on an exiting codebase with existing example-based tests is almost a separate skill. It's harder than writing your code with property based tests in mind.

---

Some common properties to test:

* Your code doesn't crash on random inputs (or only throws a short whitelist of allowed exceptions).

* Applying a specific functionality should be idempotent, ie doing that operation multiple times should give the same results as applying it only once.

* Order of input doesn't matter (for some functionality)

* Testing your prod implementation against a simpler implementation, that's perhaps too slow for prod or only works on a restricted subset of the real problem. The reference implementation doesn't even have to be simpler: just having a different approach is often enough.

  • But let's say employee names fail on apostrophe. Won't you just have a unit test that sometimes fail, but only when the testing tool randomly happens to add an apostrophe in the employee name?

    • You can either use the @example decorator to force Hypothesis to check an edge case you've thought of, or just let Hypothesis uncover the edge cases itself. Hypothesis won't fail a test once and then pass it next time, it keeps track of which examples failed and will re-run them. The generated inputs aren't uniformly randomly distributed and will tend to check pathological cases (complex symbols, NaNs, etc) with priority.

      You shouldn't think of Hypothesis as a random input generator but as an abstraction over thinking about the input cases. It's not perfect: you'll often need to .map() to get the distribution to reflect the usage of the interface being tested and that requires some knowledge of the shrinking behaviour. However, I was really surprised how easy it was to use.

    • Hypothesis keeps a database of failures to use locally and you can add a decorator to mark a specific case that failed. So you run it, see the failure, add it as a specific case and then that’s committed to the codebase.

      The randomness can bite a little if that test failure happens on an unrelated branch, but it’s not much different to someone just discovering a bug.

      edit - here's the relevant part of the hypothesis guide https://hypothesis.readthedocs.io/en/latest/tutorial/replayi...

      1 reply →

    • As far as I remember, hypothesis tests smartly. Which means that possibly problematic strings are tested first. It then narrows down which exact part of the tested strings caused the failure.

      So it might as well just throw the kitchen sink at the function, if it handles that: Great, if not: That string will get narrowed down until you arrive at a minimal set of failing inputs.

    • > But let's say employee names fail on apostrophe. Won't you just have a unit test that sometimes fail, but only when the testing tool randomly happens to add an apostrophe in the employee name?

      If you just naively treat it as a string and let hypothesis generate values, sure. Which is better than if you are doing traditional explicit unit testing and haven’t explicitly defined apostrophes as a concern.

      If you do have it (or special characters more generally) as a concern, that changes how you specify your test.

    • If you know it will fail on apostrophe you should have a specific test for that. However if that detail is burried in some function 3 levels deep that you don't even realize is used you wouldn't write the test or handle it even though it matters. This should find those issuses too.

    • Either your code shouldn’t fail or the apostrophe isn’t a valid case.

      In the former, hypothesis and other similar frameworks are deterministic and will replay the failing test on request or remember the failing tests in a file to rerun in the future to catch regressions.

      In the latter, you just tell the framework to not generate such values or at least to skip those test cases (better to not generate in terms of testing performance).

      6 replies →

    • No, Hypothesis iterates on test failures to isolate the simplest input that triggers it, so that it can report it to you explicitly.

The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same.

A more complex kind of PBT is if you have two implementations of an algorithm or data structure, one that's fast but tricky and the other one slow but easy to verify. (Say, quick sort vs bubble sort.) Generate data or operations randomly and ensure the results are the same.

  • > The simplest practical property-based tests are where you serialize some randomly generated data of a particular shape to JSON, then deserialize it, and ensure that the output is the same.

    Testing that f(g(x)) == x for all x and some f and g that are supposed to be inverses of each other is a good test, but it's probably not the simplest.

    The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly.

    For things like sorting, testing against an oracle is great. But even when you don't have an oracle, there's lots of other possibilities:

    * Test that sorting twice has the same effect as sorting once.

    * Start with a known already in-order input like [1, 2, 3, ..., n]; shuffle it, and then check that your sorting algorithm re-creates the original.

    * Check that the output of your sorting algorithm is in-order.

    * Check that input and output of your sorting algorithm have the same elements in the same multiplicity. (If you don't already have a datastructure / algorithm that does this efficiently, you can probe it with more randomness: create a random input (say a list of numbers), pick a random number X, count how many times X appears in your list (via a linear scan); then check that you get the same count after sorting.

    * Check that permuting your input doesn't make a difference.

    * Etc.

    • Speaking for myself — those are definitely all simpler cases, but for me I never found them compelling enough (beyond the "it doesn't crash" property). For me, the simplest case that truly motivated PBT for me was roundtrip serialization. Now I use PBT quite a lot, and most of them are either serialization roundtrip or oracle/model-based tests.

      1 reply →

    • > The absolute simplest I can think of is just running your functionality on some randomly generated input and seeing that it doesn't crash unexpectedly.

      For this use case, we've found it best to just use a fuzzer, and work off the tracebacks.

      That being said, we have used hypothesis to test data validation and normalizing code to decent success. We use on a one-off basis, when starting something new or making a big change. We don't run these tests everyday.

      Also, I don't like how hypothesis integrates much better with pytest than unittest.

For working with legacy systems I tend to start with Approval Tests (https://approvaltests.com/). Because they don't require me to understand the SUT very well before I can get started with them, and because they help me start to understand it more quickly.

I've only used it once before, not as unit testing, but as stress testing for a new customer facing api. I wanted to say with confidence "this will never throw an NPE". Also the logic was so complex (and the deadline so short) the only reasonable way to test was to generate large amounts of output data and review it manually for anomalies.

Here are some fairly simple examples: testing port parsing https://github.com/meejah/fowl/blob/e8253467d7072cd05f21de7c...

...and https://github.com/magic-wormhole/magic-wormhole/blob/1b4732...

The simplest ones to get started with are "strings", IMO, and also gives you lots of mileage (because it'll definitely test some weird unicode). So, somewhere in your API where you take some user-entered strings -- even something "open ended" like "a name" -- you can make use of Hypothesis to try a few things. This has definitely uncovered unicode bugs for me.

Some more complex things can be made with some custom strategies. The most-Hypothesis-heavy tests I've personally worked with are from Magic Folder strategies: https://github.com/tahoe-lafs/magic-folder/blob/main/src/mag...

The only real downside is that a Hypothesis-heavy test-suite like the above can take a while to run (but you can instruct it to only produce one example per test). Obviously, one example per test won't catch everything, but is way faster when developing and Hypothesis remembers "bad" examples so if you occasionally do a longer run it'll remember things that caused errors before.

Essentially this is a good example of parametrized tests, just supercharged with generated inputs.

So if you already have parametrized tests, you're already halfway there.

I think the easiest way is to start with general properties and general input, and tighten them up as needed. The property might just be "doesn't throw an exception", in some cases.

If you find yourself writing several edge cases manually with a common test logic, I think the @example decorator in Hypothesis is a quick way to do that: https://hypothesis.readthedocs.io/en/latest/reference/api.ht...

  • Thanks, the "does not throw an exception" property got my mental gears turning in terms of how to get started on this, and from there I can see how one could add a few more properties as one goes along.

    Appreciate you taking the time to answer.