← Back to context

Comment by vprcic

10 days ago

Each time a discussion about sorting starts, I'm reminded of a "lively debate" I had with my boss/friend about the most optimal sorting approach. He claimed it's O(n) pointing to counting sort as an example. This didn't sit well with me. A sorting algorithm, I insisted, should be defined something like "a function taking an unsorted array of elements and returning a sorted one". But it seems there is no agreed upon definition and you can have something like counting sort where you get an array in O(n) but still need O(m) to get the index of a specific element. I argued that then the best sorting algorithm is the do-nothing algorithm. It returns the initial array in O(1), but needd O(m log m) to get you the index of an element (it uses merge sort to do it).

counting sort is O(nW), where W is largest value

if you don't care about W or it is essentially constant - then it can be dropped

but it is an input parameter that will change execution time

  • It's O(n+W), not O(n*W).

    > if you don't care about W or it is essentially constant - then it can be dropped

    Also, every algorithm that ends in a real computer is bound to a constant time. That's still not a practical thing to do.

    • The simplification is correct but not correctly stated (even though colloquially it's common to state it that way). Technically it's when some component of the algorithm is dwarfed by another, then you can exclude it (i.e. O(n+W) =~ O(n) when W <<< n, O(mn) =~ O(n) when m <<< n etc). It's colloquially shortened to "when it's a constant" because generally it's constant regardless of program execution whereas the variable parts of the formula actually do change and thus the big-O analysis does still give you a correct lower bound on the performance.

      TLDR: You're being unhelpfully pedantic.

I can’t think of a single time I’ve needed a sorted list of only numbers. It’s always numbers and something else, like names or dates. Maybe for median calculations, but I don’t even use those that much either. Especially in telemetry, where mean is easy and median is not.

  • > I can’t think of a single time I’ve needed a sorted list of only numbers.

    Gosh. Let me try to convince you.

    I use permutation arrays all the time: lists of indexes that can be used across multiple vectors.

    This is much faster than the pattern of scanning rows, constructing tuples of (thingToSort . thingIWantInThatOrder) and making a custom sort function, and destructuring those tuples...

    And really, not having to write custom sort functions is really really nice.

    > Especially in telemetry, where mean is easy and median is not.

    Funny. Yes median is obvious with a permutation array, and maybe mean is less so.

    When your data is really big and not very variable, mean of x is roughly the same as the mean of any sufficient sample of x, and that sample can be meaningfully represented as a permutation array!

    You can get such an array with reservoir sampling and some maths, and (depending on what you know of your data and variance) sometimes even simpler tricks.

    That's kindof actually how the "faster than dijkstra" trick referred-to in the article works: Data sets with small variance has this same property that the min of x is roughly the same as the min of a sufficient sample of x (where the size of sufficient has to do with the variance). And so on.

    Another big use-case in my code is trees: Apter trees have a flat memory layout which is convenient for permutation arrays which can simultaneously represent index, rotation, tombstones, and all sorts of other things you might need to do with a tree.

    Give it a dig. There's good stuff in there.

  • To be pedantic, median is cheaper than sorting. O(n) with a quicksort-like algorithm.

    Also, if you're taking an average of floating point numbers, you might want to sort it first and add from smallest to largest, to better preserve precision

  • You have a list of IDs, and want to make them compact for storage or transport - fast and simple way is to sort and delta encode.

    • Hmm. That’s fair, though I’d probably use set operations instead. What you find though is that for most other problems besides diffing, ID order is not chronological order, so you need to sort by a date stamp instead. But I’m typically letting the database do that, so I’m a consumer of sorted numbers, but not an implementor. Because what I sort is nearly always compound sorts. By field A, then field B and field C if those two still don’t cut it.

  • If the primary key is the number, it still works (and dates are just numbers by the way) because you can sort a heterogenous dataset by a single numeric key pretty trivially.

    But sorting by arbitrary strings like names can’t avoid comparison sort.

    • That data structure isn’t an array of numbers, it’s an array of pointers to objects.

"ordering" means arranging things in order by some metric.

"sorting" means assigning things into bins (which are usually ordered).

  • This is news to me. Source?

    • That's because it's not true.

      https://www.merriam-webster.com/dictionary/ordering

      Order - transitive verb - 1. to put in order : arrange - "The books are ordered alphabetically by author."

      noun - 4. b(1) the arrangement, organization, or sequence of objects or of events - "alphabetical/chronological/historical order" "listed the items in order of importance"

      https://www.merriam-webster.com/dictionary/sorting

      Sort - transitive verb - 1. to put in a certain place or rank according to characteristics - "sort the mail" "sorted the winners from the losers" "sorting the data alphabetically"

      noun - 5. an instance of sorting - "a numeric sort of a data file"

      1 reply →

    • https://dictionary.cambridge.org/dictionary/english/sort

      to put a number of things in an order or to separate them into groups: Paper, plastic, and cans are sorted for recycling.

      sort something into something I'm going to sort these old books into those to be kept and those to be thrown away.

      sort something by something You can use the computer to sort the newspaper articles alphabetically, by date, or by subject.

      sort (through) She found the ring while sorting (through) some clothes.

    • "The children were sorted in to two lines by gender then ordered by height"

      You might substitute "sorted by height" but its certainly not a correction. While "ordered into lines" would be an error.