Comment by stingraycharles

2 months ago

That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

7 comments

stingraycharles

mgaunard 2 months ago

That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.

ashvardanian 2 months ago

I get why it sounds that way, but it’s not actually true.
StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.
The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.
I go into the details in the “Ideation & Challenges in Substring Search” section of the article
Const-me 2 months ago
> it cannot be faster than normalization alone
Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.
The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.
- mgaunard 2 months ago
  
  The author published the bandwidth of its algo, it's one fifth of a typical memory bandwidth (it's not possible to go faster than memory obviously for this benchmark, since we're assuming the data is not in cache).
stingraycharles 2 months ago
It can, because of how CPUs work with registers and hot code paths and all that.
First normalizing everything and then comparing normalized versions isn’t as fast.
And it also enables “stopping early” when a match has been found / not found, you may not actually have to convert everything.
- mgaunard 2 months ago
  
  Running more code per unit of data does not make the code hotter or reduce the register pressure, quite the opposite...
  
  1 reply →