Comment by tom_mellior

4 years ago

Years ago I was working on a compiler frontend that was capable of reading multiple files into one program representation, as opposed to compilers that compile each input file completely separately. At some point we were trying to compile some project made up of many small files, and our frontend got very slow. However, when we concatenated all those files into one big source file, everything went smoothly.

I investigated. It turned out that we were keeping some sort of global symbol table structure. It was updated after parsing each file, something like this (the original was C++, but this is pseudocode because I can't be bothered):

    list<symbol> global_symbol_table;

    void update_symbol_table(list<symbol> new_symbols) {
        global_symbol_table.add_all(new_symbols);
        // Keep sorted so we can do efficient lookups with binary search.
        sort(global_symbol_table);
    }

For N files, this meant N calls to sort() on a list that grew linearly in N, so having something like O(N^2 log N) complexity overall.

This has a few problems:

1. Even if you want to use a "sorted list" representation, it would suffice to sort the new symbols only (the global table is always sorted by its invariant) and do a merge of the sorted list.

2. But really, you want a set data structure that does the same thing better.

3. But also, could we maybe speed up the lookups in some other way? I looked around for the uses of this global symbol table and found... none. We were keeping data around and updating it in the least efficient manner imaginable, without ever using that data.

I deleted the above function and the global symbol table, and performance was back to expected levels.