← Back to context

Comment by teiferer

17 hours ago

This approach has two fundamental problems.

1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it. Is your function doing a+b? Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does. Any logical error that might be slipped into your SUT implementation has a high risk of also slipping into your test and will therefore be hidden by the complexity, even though it would be obvious from just looking at a few well thought through examples.

2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. You as the developer know the implementation and have a chance of coming up with the edge cases. You know the dark corners of your code. Random tests are just playing the lottery, replacing thinking hard.

> Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b since the framework provides a and b. You can do a simpler version that's less efficient, but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.

Not true. For example, if `f` is `+`, you can assert that f(x,y) == f(y,x). Or that f(x, 0) == x. Or that f(x, f(y, z)) == f(f(x, y), z).

Even a test as simple as "don't crash for any input" is actually extremely useful. This is fuzz testing, and it's standard practice for any safety-critical code, e.g. you can bet the JPEG parser on the device you're reading this on has been fuzz tested.

> You basically just give up and leave edge case finding to chance.

I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed. Doing this for a JPEG parser actually causes it to produce valid images, which you would never expect to happen by chance. See: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...

> Such a fuzzing run would be normally completely pointless: there is essentially no chance that a "hello" could be ever turned into a valid JPEG by a traditional, format-agnostic fuzzer, since the probability that dozens of random tweaks would align just right is astronomically low.

> Luckily, afl-fuzz can leverage lightweight assembly-level instrumentation to its advantage - and within a millisecond or so, it notices that although setting the first byte to 0xff does not change the externally observable output, it triggers a slightly different internal code path in the tested app. Equipped with this information, it decides to use that test case as a seed for future fuzzing rounds:

  • > I don't know anything about Hypothesis in Python, but I don't think this is true in general. The reason is because the generator can actually inspect your runtime binary and see what branches are being triggered and try to find inputs that will cause all branches to be executed.

    The author of Hypothesis experimented with this feature once, but people usually want their unit tests to run really quickly, regardless of whether property based or example based. And the AFL style exploration of branch space typically takes quite a lot longer than what people have patience for in a unit test that runs eg on every update to every Pull Request.

    • (Hypothesis maintainer here)

      Yup, a standard test suite just doesn't run for long enough for coverage guidance to be worthwhile by default.

      That said, coverage-guided fuzzing can be a really valuable and effective form of testing (see eg https://hypofuzz.com/).

      2 replies →

I have not met anyone that says you should only fuzz/property test, but claiming it can’t possibly find bugs or is unlikely to is silly. I’ve caught numerous non-obvious problems, including a non-fatal but undesirable off-by-1 error in math heavy code due to property testing. It works well when it’s an “np”-hard style problem where the code is harder than the verification. It does not work well for a+b but most problems it’s generally easier to write assertions that have to hold when executing your function. But if it’s not don’t use it - like all testing, it’s an art to determine when it’s useful and how to write it well.

Hypothesis in particular does something neat where it tries to generate random inputs that are more likely to execute novel paths within the code under test. That’s not replicated in Rust but is super helpful about reaching more paths of your code and that’s simply not able to be done manually if you have a lot of non obvious boundary conditions.

> 1. It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert

No. That's one valid approach, especially if you have a simpler alternative implementation. But testing against an oracle is far from the only property you can check.

For your example: suppose you have implemented an add function for your fancy new data type (perhaps it's a crazy vector/tensor thing, whatever).

Here are some properties that you might want to check:

a + b == b + a

a + (b + c) = (a + b) + c

a + (-a) == 0

For all a and b and c, and assuming that these properties are actually supposed to hold in your domain, and that you have an additive inverse (-). Eg many of them don't hold for floating point numbers in general, so it's good to note that down explicitly.

Depending on your domain (eg https://en.wikipedia.org/wiki/Tropical_semiring), you might also have idempotence in your operation, so a + b + b = a + b is also a good one to check, where it applies.

You can also have an alternative implementation that only works for some classes of cases. Or sometimes it's easier to prepare a challenge than to find it, eg you can randomly move around in a graph quite easily, and you can check that your A* algorithm you are working on finds a route that's at most as long as the number of random steps you took.

> 2. Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of. You basically just give up and leave edge case finding to chance. Testing for 0 or -1 or 1-more-than-list-length are obvious cases which both you the human test writer and some test framework can easily generate, and they are often actual edge cases. But what really constitutes an edge case depends on your implementation. [...]

You'd be surprised how often the generic heuristics for edge cases actually work and how often manual test writers forget that zero is also a number, and how often the lottery does a lot of the rest.

Having said this: Python's Hypothesis is a lot better at its heuristics for these edge cases than eg Haskell's QuickCheck.

  • > a + b == b + a

    > a + (b + c) = (a + b) + c

    > a + (-a) == 0

    Great! Now I have a stupid bug that always returns 0, so these all pass, and since I didn't think about this case (otherwise I'd not have written that stupid bug in the first place), I didn't add a property about a + b only being 0 if a == -b and boom, test is happy, and there is nothing that the framework can do about it.

    Coming up with those properties is hard for real life code and my main gripe with formal methods based approaches too, like model checking or deductice proofs. They move the bugs from the (complicated) code to the list of properties, which ends up just as complicated and error prone, and is entirely un...tested.

    Contrast that with an explicit dead-simple test. Test code doesn't have tests. It needs to be orders of magnitudes simpler than the system it's testing. Its correctness must be obvious. Yes, it is really hard to write a good test. So hard that it should steer how you architect your system under test and how you write code. How can I test this to have confidence that it's correct? That must be a guiding principle from the first line of code. Just doing this as an afterthought by playing lottery and trying to come up with smart properties after the fact is not going to get you the best outcome.

    • For the example it would be easily caught with PBT by testing the left and right identities for addition.

      In general though, the advice is don't excessively dogmatic. If you can't devise a property to test, use traditional testing with examples. It's not that hard. Same way to deal with the problem of end-to-end tests or unit tests not being useful for certain things, use the appropriate test style and approach to your problem.

> Then instead of asserting that f(1, 2) == 3 you need to do f(a, b) == a+b

Not really, no, it's right there in the name: you should be testing properties (you can call them "invariants" if you want to sound fancy).

In the example of testing an addition operator, you could test:

1. f(x,y) >= max(x,y) if x and y are non-negative

2. f(x,y) is even iff x and y have the same parity

3. f(x, y) = 0 iff x=-y

etc. etc.

The great thing is that these tests are very easy and fast to write, precisely because you don't have to re-model the entire domain. (Although it's also a great tool if you have 2 implementations, or are trying to match a reference implementation)

I feel like this talk by John Hughes showed that there is real value in this approach with production systems of varying levels of complexity, with two different examples of using the approach to find very low level bugs that you'd never think to test for in traditional approaches.

https://www.youtube.com/watch?v=zi0rHwfiX1Q

> (...) but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.

I think you're manifesting some misconceptions and ignorance about property-based testing.

Property-based testing is still automated testing. You still have a sut and you still exercise it to verify and validate invariants. This does not change.

The core trait of property-based testing is that instead of having to define and maintain hard coded test data to drive your tests, which are specific realizations of the input state, property-based testing instead focuses on generating sequences of randomly-generated input data, and in the event of a test failing it follows up with employing reduction strategies to distil input values that pinpoint minimum reproducible examples.

As a consequence, tests don't focus on which specific value a sut returns when given a specific input value. Instead, they focus on verifying more general properties of a sut.

Perhaps the main advantage of property-based testing is that developers don't need to maintain test data anymore, and this tests are no longer be green just because you forgot to update the test data to cover a scenario or to reflect an edge case. Developers instead define test data generators, and the property-based testing framework implements the hard parts such as the input distillation step.

Property-based testing is no silver bullet though.

> Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of.

Your comment completely misses the point of property-based testing. You still need to exercise your sut to cover scenarios. Where property-based testing excels is that you no longer have to maintain curated sets of test data, or update them whenever you update a component. Your inputs are already randomly generated following the strategy you specified.

Your comment is downvoted currently, but I think it has value in the discussion (despite being wrong in literally every respect) because it shows the immense misleading power of a single extraordinarily poorly chosen headline example on the project page. Testing a sorting function against the results of the sorted builtin is concise, and technically correct, but (even though there are situations where it would be exactly the right thing to do) it is completely misleading to anyone new to the concept when it comes to indicating what property-based testing is all about.

> It requires you to essentially re-implement the business logic of the SUT (subject-under-test) so that you can assert it.

It does not, and it would be next to worthless if it did. It requires being able to define the properties required of the SUT and right implement code that can refute them if they are not present (the name "hypothesis" for this library is, in fact, a reference to that; PBT treats the properties of code as a hypothesis, and attempts to refute it.)

> but in the end of the day, you somehow need to derive the expected outputs from input arguments, just like your SUT does.

No, see my reimplementation of the sorting example without resorting to the builtin (or any) sorting function other than the one being tested:

https://news.ycombinator.com/item?id=45825482

> Despite some anecdata in the comments here, the chances are slim that this approach will find edge cases that you couldn't think of.

You may think this based on zero experience, but I have seen no one who has tried hypothesis even once who has had that experience. Its actually very good at finding edge cases.

> You as the developer know the implementation and have a chance of coming up with the edge cases.

You as the developer have a very good chance of finding the same edge cases when writing tests for your own code that you considered when writing the code. You have much less chance of finding edge case when writing tests that you missed when writing code. You can incorporate the knowledge of probable edge cases you have when crafting Hypothesis tests just as with more traditional unit tests—but with traditional unit tests you have zero chance of finding the edge cases you didn't think of. Hypothesis is, actually, quite good at that.

> Random tests are just playing the lottery, replacing thinking hard.

Property-based testing doesn’t replace the thinking that goes into traditional unit testing, it just acts as a force multiplier for it.