Comment by ashvardanian

2 months ago

I get why it sounds that way, but it’s not actually true.

StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.

The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.

I go into the details in the “Ideation & Challenges in Substring Search” section of the article

0 comments

ashvardanian

No comments yet

Contribute on Hacker News ↗