Comment by n_u

4 days ago

Well bi-word matching requires that you still have all of the documents stored to verify the full phrase occurs in the document rather than just the bi-words. So it isn't always better.

For example the phrase query "United States of America" doesn't occur in the document "The United States is named after states of the North American continent. The capital of America is Washington DC". But "United States", "states of" and "of America" all appear in it.

There's a tradeoff because we still have to fetch the full document text (or some positional structure) for the filtered-down candidate documents containing all of the bi-word pairs. So it requires a second stage of disk I/O. But as I understand most practitioners assume you can get away with less IOPS vs positional index since that info only has to fetched for a much smaller filtered-down candidate set rather than for the whole posting list.

But that's why I was curious about the storage ratio of your positional index.