An overly aggressive mock can work fine, but break much later

3 months ago (nedbatchelder.com)

The mock discussion still misses the real solution, which is to refactor the code so that you have a function that simple reads the file and returns json that is essentially a wrapper around open and doesn't need to be tested.

Then have your main function take in that json as a parameter (or class wrapping that json).

Then your code becomes the ideal code. Stateless and with no interaction with the outside world. Then it's trivial to test just like and other function that is simple inputs translated outputs (ie pure).

Every time you see the need for a mock, you're first thought should be "how can I take the 90% or 95% of this function that is pure and pull it out, and separate the impure portion (side effects and/or stateful) that now has almost no logic or complexity left in it and push it to the boundary of my codebase?"

Then the complex pure part you test the heck out of, and the stateful/side effectful impure part becomes barely a wrapper over system APIs.

  • Funnily enough, I am preparing a simple presentation at work to speak about exactly that. The idea of separating "logic" from I/O and side effects is an old one and can be found in many architectures (like hexagonal architecture). There is plenty of benefit doing this, but testing is a big one.

    It should be obvious, but this is not something that seem to be thought in school or in most workplaces, and when it is, it's often through the lens of functional programming, which most just treat as a curiosity and not a practical thing to use at work. So I started to teach this simple design principle to all my junior dev because this is something that is actually quite easy to implement, does not need a complete shift of architecture/big refactor when working on existing code, and is actually practical and useful.

  • > Then the complex pure part you test the heck out of, and the stateful/side effectful impure part becomes barely a wrapper over system APIs.

    In practice the issues I see with this are that the "side effect" part is usually either: extensive enough to still justify mocking around testing it, and also intertwined enough with your logic to be hard to remove all the "pure" logic. I rarely see 90-95% of functions being pure logic vs side effects.

    E.g. for the first, you could have an action that requires several sequenced side effects and then your "wrapper over APIs" still needs validation of calling the right APIs in the right oder with the right params, for various scenarios. Enter mocks or fakes. (And sometimes people will get clever and say use pubsub or events for this, but... you're usually just making the full-system-level testing there harder, as well as introducing less determinism around your consistency.)

    For the second, something like "do steps I and J. If the API you call in step J fails, unwind the change in I." Now you've got some logic back in there. And it's not uncommon for the branching to get more complex. Were you building everything in the system from first principles, you could try to architect something where I and J can be combined or consolidated in a way to work around this; when I and J are third party dependencies, that gets harder.

    • You're right that it's not always easy to separate pure from effectful code. But you definitely can (model it as a state machine) and I think it's worth it, especially if those steps can take a long time, can fail, might be executed in parallel, etc.

      For instance, I once worked on payment-related code at a large online retailer. The steps I and J from your example would have been calls to the payment gateway's API (payment initiation, actual payment request). There was also a step K (polling for payment confirmation) and even a step K' (a payment confirmation callback the gateway might or might not call before or after we get around polling for the payment status ourselves). And often there was even user interaction in between (the 3DS/3DS2 credit card payment scheme that's common here in the EU). Every single one of those steps could fail for a myriad of reasons, e.g. time out, be rejected, … and we had to make sure we always failed gracefully and, most importantly, didn't mess up our payment or order records.

      Of course this was an old enterprise Java code base, created by people who had long left the company, and all this had been written just the way you imagine it. It was an absolute mess.

      Every single time I worked on this code base I secretly wished one of the original authors had heard of state machines, pure vs. effectful code, and unit tests.

  • The risk to that approach is that you end up writing code that cannot deal with the real world problems of I/O, such as timeouts, failed reads, jitter, and other weird behaviour.

    Separating I/O from logic makes a lot of sense and makes tests much easier to write and code much easier to reason about, but you'll still need to implement some sort of mocking interface if you want to catch I/O problems.

    • > how can I take the 90% or 95% of this function that is pure and pull it out, and separate the impure portion (side effects and/or stateful) that now has almost no logic or complexity left in it

      They addressed this concern already. These are not contradicting approaches.

      1 reply →

  • I've had great success swapping in a in-memory database via dependency injection and just running 100% of the application, end to end.

    in the ideal case my tests start by writing some randomised data using the external API, I then update it(if applicable) using the external API, and finally read it, also using the external API, and compare the actual result with what I expected.

    I use randomised data to avoid collisions with other tests, which might cause flakiness and/or prevent running the tests concurrently. I avoid having seed data in the database if at all possible.

    It's the only approach I've found that can survive a major refactor of the codebase. Anything short of breaking the external API, which is typically a no-no anyway, shouldn't break these tests.

    Doing a refactor and being able to rely on the test suite for finding bugs and inconsistencies is amazing. Of course they won't find 100% of all bugs,but this way at least you know that a failing test means there's a problem in your production code.

    • Thou shalt never mock the database, for thou shalt anger the database when thou moketh it.

      In all seriousness, I have found this to be a useful suggestion, because the purpose of a test is to make sure invariants don't break in real code. When you mock the database, you're excluding large amounts of real code from test.

  • I think the important thing is that the code look pure from a testing perspective.

    Say you've got a function that accesses a key-value store. Ideally, you can factor out the i/o so that you do all your reads up front and all your writes at the end, leaving a pure function in the middle. But if the code is too tangled up in its side effects for that, the next best thing is to create a fake KV store and then wrap the function like this:

      def doTest(input, initialState):
        kv = makeFake(initialState)
        result = doThing(kv, input)
        return result, kv.dumpContents()
    

    doThing isn't a pure function, but doTest is. Now you can write tests like this:

      result, outputState = doTest(input, initialState)
      assert (result, outputState) == expected
    

    I guess you could call that "imperative core, functional shell," lol.

  • This sounds great until one of your other functions calls that function.

    You're just describing dependency injection, but if you say that, people won't want to listen cause doing that all the time sucks.

    • Dependency injection frameworks can be a pain, but basic dependency injection (e.g. pass a DB handle to anything that needs the DB) is a must. What's the alternative? Having everyone independently create their own DB connections?

      1 reply →

  • I agree with you, however convincing an entire team of devs to explicitly separate the interface of impure parts of code is very difficult.

    If you introduce a mocking library to the test portion of the codebase, most developers will start to use it as a way to shortcut any refactoring they don't want to do. I think articles like this that try to explain how to better use mocks in tests are useful, although I wish they weren't necessary.

  • Your just moving the problem one layer up.

    You still need to write a test for how it all comes together and you should write tests for your error handling. You need a mock to respond with an error.

    • Not necessarily. If your error handling looks like this:

        value, err := externalLibraryFunctionFoo(a, b)
        if err != nil {
          return nil, fmt.Errorf("calling foo: %w", err)
        }
      

      then you probably don't need to test it. All you're doing is bubbling up the error handling from other libraries' functions.

      2 replies →

This blog post talks as if mocking the `open` function is a good thing that people should be told how to do. If you are mocking anything in the standard library your code is probably structured poorly.

In the example the author walks through, a cleaner way would be to have the second function take the Options as a parameter and decouple those two functions. You can then test both in isolation.

  • > If you are mocking anything in the standard library your code is probably structured poorly.

    I like Hynek Schlawak's 'Don’t Mock What You Don’t Own' [1] phrasing, and while I'm not a fan of adding too many layers of abstraction to an application that hasn't proved that it needs them, the one structure I find consistently useful is to add a very thin layer over parts that do I/O, converting to/from types that you own to whatever is needed for the actual thing.

    These layers should be boring and narrow (for example, never mock past validation you depend upon), doing as little conversion as possible. You can also rephrase the general purpose open()-type usage into application/purpose-specific usages of that.

    Then you can either unittest.mock.patch these or provide alternate stub implementations for tests in a different way, with this this approach also translating easily to other languages that don't have the (double-edged sword) flexibility of Python's own unittest.mock.

    [1] https://hynek.me/articles/what-to-mock-in-5-mins/

  • > This blog post talks as if mocking the `open` function is a good thing that people should be told how to do. If you are mocking anything in the standard library your code is probably structured poorly.

    Valgrind is a mock of standard library/OS functions and I think its existence is a good thing. Simulating OOM is also only possible by mocking stuff like open.

    • > Valgrind is a mock of standard library/OS functions and I think its existence is a good thing.

      That is mostly wrong.

      Valgrind wraps syscalls. For the most part it just checks the arguments and records any reads or writes to memory. For a small number of syscalls it replaces the syscall rather than wrapping it (for instance calls like getcontext where it needs to get the context from the VEX synthetic CPU rather than the real CPU).

      Depending on the tool it can also wrap or replace libc and libpthread functions. memcheck will replace all allocation functions. DRD and Helgrind wrap all pthread functions.

      3 replies →

  • Details matters, but good test doubles here are important. You want to capture all calls to IO and do something different. You don't want tests to break because someone has a different filesystem, didn't set their home directory as you want it setup, or worse is trying to run two different tests at the same time and the other test is changing files the other wants.

    Note that I said test doubles. Mocks are a bit over specific - they are about verifying functions are called at the right time with the right arguments, but the easy ability to set return values makes it easy to abuse them for other things (this abuse is good, but it is still abuse of the intent).

    In this case you want a fake: a smart service that when you are in a test setups a temporary directory tree that contains all the files you need in the state that particular test needs, and destroys that when the test is done (with an optional mode to keep it - useful if a test fails to see debug). Depending on your situation you may need something for network services, time, or other such things. Note that in most cases a filesystem itself is more than fast enough to use in tests, but you need isolation from other tests. There are a number of ways to create this fake, it could override open, or it could just be a GetMyProgramDir function that you override are two that I can think of.

    • Your tests are either hermetic, or they're flaky.

      That means the test environment needs to be defined and versioned with the code.

    • Even in the case you mention you really shouldn't be overriding these methods. Your load settings method should take the path of the settings file as an argument, and then your test can set up all the fake files you want with something like python's tempfile package

      1 reply →

  • > This blog post talks as if mocking the `open` function is a good thing that people should be told how to do.

    It does. And this is exactly the problem, here!

    > TFA: The thing we want to avoid is opening a real file

    No! No, no, no! You do not 'want to avoid opening a real file' in a test.

    It's completely fine to open a real file in a test! If your code depends on reading input files, then your test should include real input files in it! There's no reason to mock any of this. All of this stuff is easy to set up in any unit test library worth it's salt.

    • > then your test should include real input files in it! There's no reason to mock any of this.

      That's okay for testing some branches of your code. But not all. I don't want to have to actually crash my hard drive to test that I am properly handling hard drive crashes. Mocking[1] is the easiest way to do that.

      [1] For some definition of mock. There is absolutely no agreement found in this space as to what the terms used mean.

I feel like the #1 reason mocks break looks nothing like this and instead looks like: you change the internal behaviors of a function/method and now the mocks interact differently with the underlying code, forcing you to change the mocks. Which highlights how awful mocking as a concept is; it is of truly limited usefulness for anything but the most brittle of tests.

Don't test the wrong things; if you care about some precondition, that should be an input. If you need to measure a side effect, that should be an output. Don't tweak global state to do your testing.

  • Most of the real world is about manipulating the real world. For algorithms it is fine to say depend on the pure inputs/outputs. However what we care about is that global state is manipulated correctly and so the integration tests that verify that are what are important. In most cases your algorithm shouldn't be unit tested separately since it is only used in one place and changes when the users change: there is no point in extra tests. If the algorithm is used in many places comprehensive unit tests are important, but they get in the way when the algorithm is used only once and so the tests just inhibit changes to the algorithm as requirements change (you have to change the user, the integration tests, and the unit tests that are redundant).

    As such I disagree. Global state is what you should be testing - but you need to be smart about it. How you setup and verify global state matters. Don't confuse global state above with global state of variables, I mean the external state of the program before and after, which means network, file, time, and other IO things.

    • IO and global state is also just inputs that can be part of arrange-act-assert. Instead of mocking your database call to always return "foo" when the word "SELECT" is in the query, insert a real "foo" in a real test database and perform a real query.

      Again I've heard "but what if my database/table changes so rapidly that I need the mock so I don't need to change the query all the time", in which case you ought to take a moment to write down what you're trying to accomplish, rather than using mocks to pave over poor architectural decisions. Eventually, the query fails and the mock succeeds, because they were completely unrelated.

      So far I've only seen mocks fail eventually and mysteriously. With setups and DI you can treat things mostly as a black box from a testing point of view, but when mocks are involved you need surgical precision to hit the right target at the right time.

      5 replies →

  • > you change the internal behaviors of a function/method and now the mocks interact differently with the underlying code, forcing you to change the mocks

    Rarely should a mock be “interacting with the underlying code”, because it should be a dead end that returns canned data and makes no other calls.

    If your mock is calling back into other code you’ve probably not got a mock but some other kind of “test double”. Maybe a “fake” in Martin Fowler’s terminology.

    If you have test doubles that are involved in a bunch of calls back and forth between different pieces of code then there’s a good chance you have poorly factored code and your doubles are complex because of that.

    Now, I won’t pretend changes don’t regularly break test doubles, but for mocks it’s usually method changes or additions and the fix is mechanical (though annoying). If your mocks are duplicating a bunch of logic, though, then something else is going on.

  • I see a lot of times people (read: me) are lazy and make a mock that does not have anywhere near the behavior of the original. It's more like a very partial stub. I will mock an api with 20 possible calls with the 2 that I use. Unsurprisingly, this mock is closely tied to the current implementation.

    • I think making narrow use-case-specific test doubles is often better than a big shared one. You can even wrap the API with an interface that only exposes the methods you need.

> In Why your mock doesn’t work I explained this rule of mocking:

> Mock where the object is used, not where it’s defined.

For anyone looking for generic advice, this is a quirk of python due to how imports work in that language (details in the linked post) and shouldn't be considered universal.

Honestly I don't buy it. Worse, this is one of the reason I prefer to do "minimal integration tests" instead of unit tests. Take the example snippet of code

    def get_user_settings() -> str:
        with open(Path("~/settings.json").expanduser()) as f:
            return json.load(f)

    def add_two_settings() -> int:
        settings = get_user_settings()
        return settings["opt1"] + settings["opt2"]

and the very first comment just below

>>> The thing we want to avoid is opening a real file

and then the article goes and goes around patching stdlib stuff etc.

But instead I would suggest the real way to test it is to actually create the damn file, fill it with the "normal" (fixed) content and then run the damn test.

This is because after years of battling against mocks of various sort I find that creating the "real" resource is actually less finicky than monkeypatching stuff around.

Apart from that; yeah, sure the code should be refactored and the paths / resources moved out of the "pure logical" steps, but 1) this is an example and 2) this is the reality of most of the actual code, just 10x more complex and 100x more costly to refactor.

  • That works fine for files, but what if the integration is with a third party service for example?

    You can create an actual mock networked service but it's much more work.

    I think this is an example explaining what seems like a good practice for using mocks in python to me, the actual code in the post is barely "supporting cast".

    • If it's HTTP you can create the fixtures and serve them with a mock server. I'm a frontend dev, so backend APIs are like 3rd parties to me.

      I use a browser extension for scraping actual backend responses, which downloads them with a filename convention the mock server understands. I mostly use it for development, but also for setting up screenshot tests. For example,

        PATCH /select
      
        'api/user(locked-out).GET.423.json'
      

      screenshot the app and pixel diff it

        PATCH /select
      
        'api/user.GET.200.json'
      
      

      screenshot…

      2 replies →

The main reason why your mock breaks later is because you refactored the code. You did the one thing tests are supposed to help you do, and the tests broke. If code was never modified, you wouldn't need automated tests. You'd just test it manually one time and never touch it again. The whole point of tests is you probably will rewrite internals later as you add new features, improve performance or just figure out better ways to write things. Mock-heavy tests are completely pointless in this respect. You end up rewriting the code and the test every time you touch it.

There are really only a few reasons to use mocks at all. Like avoiding network services, nondeterminism, or performance reasons. If you need to do a lot of mocking in your tests this is a red flag and a sign that you could write your code differently. In this case you could just make the config file location an optional argument and set up one in a temp location in the tests. No mocking required and you're testing the real API of the config file module.

If you make the function pure it will be easier to test. Pass the moving parts as function parameters, then you can pass in the mocks in the actual functions when testing. Example:

    f = () => a+b

refactor for easier testing

    f = (a, b) => a+b

in your test you can now mock a and b

> An overly aggressive mock can work fine, but then break much later. Why?

Because you are testing against implementation, not specification.

You’re welcome.

If you’re doing TDD, you could just view this as moving the “open” call to your unit test. As others point out, that encourages pure functions that can pipe in input from other sources than just file paths.

Arguably this is a problem in when the patch is unapplied.

Presumably in the coverage case it’s being called by a trace function, which inevitably runs during test execution — and while we want the trace function to be called during the test function, we really want it without any patches the test function is using. But this arguably requires both an ability for the trace function to opt-out of patches and for the patcher to provide a way to temporarily disable all of them.

Why even mock anything in this example? You need to read the source code to work out what to mock, reaching deep inside the code to name some method to mock.

But what if you just passed in the contents of the file or something?

Edit: oh wait actually this is what the very last line in the blog post says! But I think it should be emphasized more!

It is worth pointing out that you can often use containerized services as an alternative to mocking.