Comment by sjducb

1 day ago

It’s missing the time taken to instantiate a class.

I remember refactoring some code to improve readability, then observing something that was previously a few microseconds take tens of seconds.

The original code created a large list of lists. Each child list had 4 fields each field was a different thing, some were ints and one was a string.

I created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting. Modern Python developers would use a data class for this.

The new code was very slow. I’d love it if the author measured the time taken to instantiate a class.

Instantiating classes is in general not a performance issue in Python. Your issue here strongly sounds like you're abusing OO to pass a list of instances into every method and downstream call (not just the usual reference to self, the instance at hand). Don't do that, it shouldn't be necessary. It sounds like you're trying to get a poor-man's imitation of classmethods, without identifying and refactoring whatever it is that methods might need to access from other instances.

Please post your code snippet on StackOverflow ([python] tag) or CodeReview.SE so people can help you fix it.

> created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting.

I went to the doctor and I said “It hurts when I do this”

The doctor said, “don’t do that”.

Edit: so yeah a rather snarky reply. Sorry. But it’s worth asking why we want to use classes and objects everywhere. Alan Kay is well known for saying object orientated is about message passing (mostly by Erlang people).

A list of lists (where each list is four different types repeated) seems a fine data structure, which can be operated on by external functions, and serialised pretty easily. Turning it into classes and objects might not be a useful refactoring, I would certainly want to learn more before giving the go ahead.

  • The main reason why is to keep a handle on complexity.

    When you’re in a project with a few million lines of code and 10 years of history it can get confusing.

    Your data will have been handled by many different functions before it gets to you. If you do this with raw lists then the code gets very confusing. In one data structure customer name might be [4] and another structure might have it in [9]. Worse someone adds a new field in [5] then when two lists get concatenated name moves to [10] in downstream code which consumes the concatenated lists.

  • I mean it sounds reasonable to me to wrap the data into objects.

    customers[3][4]

    is a lot less readable than

    customers[3].balance

    • Absolutely

      But hidden in this is the failing of every sql-bridge ever - it’s definitely easier for a programmer to read customers(3).balance but the trade off now is I have to provide class based semantics for all operations - and that tends to hide (oh you know, impedance mismatch).

      I would far prefer “store the records as plain as we can” and add on functions to operate over it (think pandas stores basically just ints floats and strings as it is numpy underneath)

      (Yes you can store pyobjects somehow but the performance drops off a cliff.)

      Anyway - keep the storage and data structure as raw and simple as possible and write functions to run over it. And move to pandas or SQLite pretty quickly :-)

      1 reply →