← Back to context

Comment by benrutter

14 hours ago

I love the idea of hypothesis! Haven't found a lot of use cases for it yet, I think the quick start example helps explain why.

Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

Obviously it's just a tiny example, but in general I often find writing a test that will cover every hypothetical case isn't realistic.

Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?

What you're talking about is using an oracle (a different implementation of what you're testing), it's an option for property (or exhaustive) testing but is by no means a requirement. Even for a sort function there are plenty of properties you can check without needing an oracle e.g.

- that the output sequence is the same length as the input

- that the output sequence is sorted

- that the population counts are the same in the input and the output

> In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.

I've only used PBT a few times, but when it fits it's been extremely useful. A concrete example from my practice of what has been pointed out in this thread, that you want to properties of your function's output rather than the output itself: I was implementing a fairly involved algorithm (the Zhang-Shasha edit distance algorithm for ordered trees), and PBT was extremely useful in weeding out bugs. What I did was writing a function that generated random tree structures in the form I needed for my code, and tested the four properties that all distance functions should have:

1. d(x, x) = 0 for all x 2. d(x, y) >= 0 for all x, y 3. d(x, y) = d(y, x) for all x, y 4. d(x, z) <= d(x, y) + d(y, z) for all x, y, z (the triangle inequality)

Especially the triangle inequality check weeded out some tricky corner cases I probably wouldn't have figured out on my own. Some will object that you're not guaranteed to find bugs with this kind of random-generation strategy, but if you blast through a few thousand cases every time you run your test suite, and the odd overnight run testing a few million, you quickly get fairly confident that the properties you test actually hold. Of course any counterexamples the PBT finds should get lifted to regression tests in addition, to make sure they're always caught if they crop up again. And as with any testing approach, there are no guarantees, but it adds a nice layer of defense in depth IMO.

> Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

You don't need that, that’s just a convenient shortcut when you do have a function that does that (but, I would argue, a not-great example for precisely that reason.)

All you actually need is the ability to verify the property (or set of properties) that you are testing in the output. You could replace the test in that sample with something like this, which I think is a better illustration of what is going on (this is somewhat simplified, rather than the length test you really want to test that it has the same set:

  @given(st.lists(st.integers() | st.floats()))
  def test_sort_correct(lst):
    result = my_sort(lst)
    
    # result has the same unique items as original list
    assert set(result) | set(lst) 

    # for each item, result has the same number of copies as original
    assert all(
      result.count(item) == lst.count(item)
      for item in set(result)
    ) 

    # result is sorted
    assert all(result[n] <= result[n+1] for n in range(len(result)-1))

> Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?

One example would be when you have a naive implementation of some algorithm and you want to introduce a second one, optimized but with much more complex implementation. Then this naive one will act as a model for comparisons.

Another case that comes to mind is when you have rather simple properties to test (like: does it finish without a crash, within a given time?, does not cross some boundaries on the output?), and want to easily run over a sensible set of varying inputs.

I use hypothesis to test that the happy path is really happy. I use it to define the set of expected inputs for a function, and it looks for the nastiest possible sets of inputs. It’s kind of the opposite of how I write ordinary tests, where I pick challenging but representative examples. Often I’m just checking some basic property, like “function f is the inverse of function g” for round tripping data through a different representation. Sometimes it’s just “function f does not divide by zero.”. Hypothesis seems to have a library of corner cases that it checks first. If you don’t say you can’t handle floats above 2^53, it will try 2^53, for instance. (That’s the threshold where doubles stop being able to represent every integer.) Often the result of running hypothesis is not a code change but a more precise description of what inputs are valid.

Imho it's just not a great example. A better example would be testing the invariants of the function. "For any list of numbers, my_sort does not throw an exception" is trivial to test. "For any list of numbers, in the list returned by my_sort the item at index n is smaller than the item at index n+1" would be another test. Those two probably capture the full specification of my_sort, and you don't need a sort function to test either. In a real-world example you would more likely be testing just some subset of the specification, those things that are easy to assert

  • Always returning the empty list meets your spec.

    • Good point. I suppose we should add "number of input elements equals number of output elements" and "every input element is present in the output". Translated in a straightforward test that still allows my_sort([1,1,2]) to return [1,2,2], but we have to draw the line somewhere

      2 replies →

Mostly I found making composite strategies most helpful. I had a lot of fiddly details to get right when validating some dsp algorithms and the intermediate data structures they required, and for that it was very helpful, mostly I used property testing with some kind of composite strategy for generating valid inputs.

Indeed. For a sorting algorithm it would be more insightful to show it test the actual property of something that is sorted: every consecutive element is larger than the previous one (or the other way around). You don’t need a sorting function to test the “sorted” property.

I definitely had the same issue when I started, I think you've hit on the two main mental blocking points many people hit.

> this means you need a second function that acts the same as the function you wrote.

Actually no, you need to check that outcomes match certain properties. The example there works for a reimplementation (for all inputs, the output of the existing function and new one must be identical) but you can break down a sorting function into other properties.

Here are some for a sorting function that sorts list X into list Y.

Lists X and Y should be the same length.

All of the elements in list X should be in list Y.

Y[n] <= Y[n+m] for n+m<len(X)

If you support a reverse parameter, iterating backwards over the sorted list should match the reverse sorted list.

Sorting functions are a good simple case but we don't write them a lot and so it can be hard to take that idea and implement it elsewhere.

Here's another example:

In the app focus ELEMENT A. Press tab X times. Press shift-tab X times. Focus should be be on ELEMENT A.

In an app with two or more elements, focus ELEMENT A. Press tab. The focus should have changed.

In an app with N elements. Focus ELEMENT A. Press tab N times. The focus should be on ELEMENT A.

Those are a simple properties that should be true, are easy to test and maybe you'd have some specific set of tests for certain cases and numbers but there's a bigger chance you then miss that focus gets trapped in some part of your UI.

I had a library for making UIs for TVs that could be shaped in any way (so a carousel with 2 vertical elements then one larger one, then two again for example). I had a test that for a UI, if you press a direction and the focus changes, then pressing the other direction you should go back to where you were. Extending that, I had that same test run but combined with "for any sequence of UI construction API calls...". This single test found a lot of bugs in edge cases. But it was a very simple statement about using the navigation.

I had a similar test - no matter what API calls are made to change the UI, either there are no elements on screen, or one and only one has focus. This was defined in the spec, but we also had defined in the spec that if there was focus, removing everything and then adding in items meant nothing should have focus. These parts of the spec were independently tested with unit tests, and nobody spotted the contradiction until we added a single test that checked the first part automatically.

This I find exceptionally powerful when you have APIs. For any way people might use them, some basic things hold true.

I've had cases where we were identifying parts of text and things like "given a random string, adding ' thethingtofind ' results in getting at least a match for ' thethingtofind ', with start and end points, and when we extract that part of the string we get ' thethingtofind '". Sounds basic, right? Almost pointless to add when you've seen the output work just fine and you have explicit test cases for this? It failed. At one step we lowercased things for normalisation and we took the positions from this lowercased string. If there was a Turkish I in the string, when lowercased it became two characters:

    >>> len("İ") == len("İ".lower())
    False

So anything where this appeared before the match string would throw off the positions.

Have a look at your tests, I imagine you have some tests like

"Doing X results in Y" and "Doing X twice results in Y happening once". This is two tests expressing "Doing X n times where n>=1, Y happens". This is much more explicitly describing the actual property you want to test as well.

You probably have some tests where you parameterise a set of inputs but actually check the same output - those could be randomly generated inputs testing a wider variety of cases.

I'd also suggest that if you have a library or application and you cannot give a general statement that always holds true, it's probably quite hard to use.

Worst case, point hypothesis at a HTTP API, have it auto generate some tests that say "no matter what I send to this it doesn't return a 500".

In summary (and no this isn't LLM generated):

1. Don't test every case.

2. Don't check the exact output, check something about the output.

3. Ask where you have general things that are true. How would you explain something to a user without a very specific setup (you can't do X more than once, you always get back Y, it's never more than you put in Z...)

4. See if you have numbers in tests where the specific number isn't relevant.

5. See if you have inputs in your tests where the input is a developer putting random things in (sets of short passwords, whatever)

On more complex logic you can often reason about many invariants. I don't think using sorting and then comparing to the existing sort is a good example. Precisely for the reason you mention.

A sort itself is a good example though. You can do much better then just compare to an existing sort implementation. Because you can easily check if an array is sorted with a linear scan. If your sort is stable you can do an additional check that for all pairs of adjacent elements that compare as equal the order of them is the same as in the input array.

I don't do the testing using Hypothesis, but I do a great deal of property based testing of Common Lisp implementations, where it has been invaluable.

One large testing loop is testing CL's compile function. This involves creating random valid lambda expressions (unnamed functions), compiling them, and running them on random input. The properties tested are things like "the compiler doesn't crash, the generated code doesn't crash, different versions of the code with different valid declarations compute the same thing".

More specific properties are tested for specific CL functions. For example, CL has a function subtypep which, given two types, says either the first value is or is not a subtype of the second, or that it couldn't figure it out. One can test this on randomly generated types, evaluating properties like (subtypep T1 T2) should be consistent with (subtypep `(not ,T2) `(not ,T1)).

The point of PBT isn't that you write tests to cover every possible case, it's that PBT will cook up tests for all sorts of cases you will never come up with yourself. I have repeatedly found it kicking out the whackiest test inputs that happen to trigger bugs from strange interactions.