← Back to context

Comment by n_u

9 days ago

Which part of the index are you putting in the buffer pool here? The postings list, the doc store or the terms dict?

Is it being cached for future queries or are you just talking about putting it in memory to perform the computation for a query?

I'm primarily looking at document lists and possibly the keyword-documents mapping.

Caching will likely be fairly tuned toward the operation itself, since it's not a general-purpose DBMS and I can fairly accurately predict which pages will likely be useful to cache or when read-ahead is likely to be fruitful based on the operation being performed.

For keyword-document mappings some LRU cache scheme is likely a good fit, when reading a list of documents readahead is good (and I can inform the pool of how far to read ahead), when intersecting document lists I can also generally predict when pages are likely to be re-read or needed in the future based on the position in the tree.

Will definitely need a fair bit of tuning but overall the problem is greatly simplified by revolving around very specific types of access patterns.

  • Ah interesting. Is your keyword-document map (aka term dict) too big to keep in memory permanently? My understanding is that at Google they just keep it in memory on every replica.

    Edit: I should specify they shard the corpus by document so there isn't a replica with the entire term dict on it.

    • Could plausibly fit in RAM, is only like ~100 GB in total. We'll see, will probably keep it mmap:ed at first to see what happens. It isn't the target of very many queries (relatively speaking) at any rate so either way is probably fine.

      3 replies →