Full Unicode Search at 50× ICU Speed with AVX‑512

2 months ago (ashvardanian.com)

79 comments

ashvardanian

I was really confused about the case folding, this page explained the motivation well https://jean.abou-samra.fr/blog/unicode-misconceptions

""" Continuing with the previous example of “ß”, one has lowercase("ss") != lowercase("ß") but uppercase("ss") == uppercase("ß"). Conversely, for legacy reasons (compatibility with encodings predating Unicode), there exists a Kelvin sign “K”, which is distinct from the Latin uppercase letter “K”, but also lowercases to the normal Latin lowercase letter “k”, so that uppercase("K") != uppercase("K") but lowercase("K") == lowercase("K").

The correct way is to use Unicode case folding, a form of normalization designed specifically for case-insensitive comparisons. Both casefold("ß") == casefold("ss") and casefold("K") == casefold("K") are true. Case folding usually yields the same result as lowercasing, but not always (e.g., “ß” lowercases to itself but case-folds to “ss”). """

One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols? To make quantified machine readable (oh, this is not a 100K license plate or money amount, but a temperature)? Or to make it easier for specialized software to display it in correct placed/units?

bee_rider 2 months ago
They seem to have (if I understand correctly) degree-Celsius and degree-Fahrenheit symbols. So maybe Kelvin is included for consistency, and it just happens to look identical to Latin K?
IMO the confusing bit is giving it a lower case. It is a symbol that happens to look like an upper case, not an actual letter…
- bigwheels 2 months ago
  
  And why can't the symbol be a regular old uppercase "K"? Who is this helping?
  
  13 replies →
peterfirefly 2 months ago

> One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols?
To allow round-tripping.
Unicode did not win by being better than all previously existing encodings, even though it clearly was.
It won by being able to coexist with all those other encodings for years (decades) while the world gradually transitioned. That required the ability to take text in any of those older encodings and transcode it to Unicode and back again without loss (or "gain"!).
bawolff 2 months ago

> One question I have is why have Kelvin sign that is distinct from Latin K and other indistinguishable symbols?
Unicode has the goal of being a 1:1 mapping for all other character encodings. Usually weird things like this is so there can be a 1:1 reversible mapping to some ancient character encoding.

ashvardanian 2 months ago

This article is about the ugliest — but arguably the most important — piece of open-source software I’ve written this year. The write-up ended up long and dense, so here’s a short TL;DR:

I grouped all Unicode 17 case-folding rules and built ~3K lines of AVX-512 kernels around them to enable fully standards-compliant, case-insensitive substring search across the entire 1M+ Unicode range, operating directly on UTF-8 bytes. In practice, this is often ~50× faster than ICU, and also less wrong than most tools people rely on today—from grep-style utilities to products like Google Docs, Microsoft Excel, and VS Code.

StringZilla v4.5 is available for C99, C++11, Python 3, Rust, Swift, Go, and JavaScript. The article covers the algorithmic tradeoffs, benchmarks across 20+ Wikipedia dumps in different languages, and quick starts for each binding.

Thanks to everyone for feature requests and bug reports. I'll do my best to port this to Arm as well — but first, I'm trying to ship one more thing before year's end.

dboon 2 months ago

This is exactly the kind of thankless software which the world operates on. It’s unfortunate that such fundamental code hasn’t already been vectorized or the gills, but thank you for doing so! It’s excellent work
zvr 2 months ago

Thank you for this, and congrats on your achievement!
Sesse__ 2 months ago
> I grouped all Unicode 17 case-folding rules
But why are you using the case-folding rules and not the collation rules?
- ashvardanian 2 months ago
  
  Yes, CaseFolding.txt. I'm considering using the collation rules for sorting. Now they only target lexicographic comparisons and seem to be 4x faster than Rust's standard quick-sort implementation, but few people use it: https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...
adzm 2 months ago

This is a truly amazing accomplishment. Reading these kernels is a joy!
fatty_patty89 2 months ago
Thank you
do the go bindings require cgo?
- ashvardanian 2 months ago
  
  The GoLang bindings – yes, they are based on cGo. I realize it's suboptimal, but seems like the only practical option at this point.
  
  8 replies →

unwind 2 months ago

Very cool and impressive performance.

I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].

Anyway, according to Wikipedia the dedicated symbol should not be used:

However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.

That was comforting, to me. :)

[1]: https://en.wikipedia.org/wiki/Kelvin#Orthography

jjmarr 2 months ago
> I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like
Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.
- Sesse__ 2 months ago
  
  It's why the Unicode Collation Algorithm exists.
  If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:
  004B ; [.2514.0020.0008] # LATIN CAPITAL LETTER K 212A ; [.2514.0020.0008] # KELVIN SIGN
  The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.
  Compare e.g. :
  1D424 ; [.2514.0020.0005] # MATHEMATICAL BOLD SMALL K
  which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).
  
  2 replies →
- ComputerGuru 2 months ago
  
  Normalization wouldn’t address this.
  
  3 replies →

anonnon 2 months ago

> ICU has bindings for Rust that provide case-folding functionality, but not case-insensitive substring search.

> ICU has many bindings. The Rust one doesn’t expose any substring search functionality, but the Python one does:

Python's ICU support is based on ICU4C. Rust's ICU "bindings" are actually a new implementation called ICU4X, by developers who worked on i18n at Mozilla and Google and on ICU4C, with the goal of a cleaner, more performant implementation that is also memory safe. Maybe not relevant (as in substantially altering the benchmarks), but it's at least worth noting that the ICU backends aren't consistent throughout.

ashvardanian 2 months ago

Thanks a lot for the correction! I'll adjust the references in a bit.

orthoxerox 2 months ago

Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?

ashvardanian 2 months ago
Yes — fuzzy and phonetic matching across languages is part of the roadmap. That space is still poorly standardized, so I wanted to start with something widely understood and well-defined (ICU-style transforms) before layering on more advanced behavior.
Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.
- janc_ 2 months ago
  
  Even when transliteration is somewhat de-facto standardised, it usually is dependent on the target/host language. So e.g. Arabic & Russian are transliterated differently in e.g. English, French, German, Dutch, etc.

moralestapia 2 months ago

Ash is amazing!

Also very cool and approachable guy.

(Best wishes if you're reading this.)

mgaunard 2 months ago

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

stingraycharles 2 months ago
That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.
If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).
- mgaunard 2 months ago
  
  That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.
  
  6 replies →
orthoxerox 2 months ago

In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.

hans_castorp 2 months ago

Would be cool if that could be integrated into Postgres :)

ashvardanian 2 months ago

I was just about to ask some friends about it. If I’m not mistaken, Postgres began using ICU for collation, but not string matching yet. Curious if someone here is working in that direction?

kardianos 2 months ago

Looks neat. What are all the genomic sequence comparisons in there for? Is this a grab bag of interesting string methods or is there a motivation for this?

ashvardanian 2 months ago

Levenshtein distance calculations are a pretty generic string operation, Genomics happens to be one of the domains where they are most used... and a passion of mine :)

bawolff 2 months ago

While this is definitely really cool, wouldn't people who need this type of speed usually just store the text to be searched already case folded?

user3939382 2 months ago

I just used AVX-512 today for a lisp processor very performant.

xking5205 2 months ago

its good

andersa 2 months ago

From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.

Arnt 2 months ago

Ah, let's have a long discussion of this.
Unicode avoids "different" and "same", https://www.unicode.org/reports/tr15/ uses phrases like compatibility equivalence.
The whole thing is complicated, because it actually is complicated in the real world. You can spell the name of Gießen "Giessen" and most Germans consider it correct even if not ideal, but spelling Massachusetts "Maßachusetts" is plainly wrong in German text. The relationship between ß and ss isn't symmetric. Unicode captures that complexity, when you get into the fine details.
pjmlp 2 months ago
It isn't until it is, how would you write it when ß isn't available on the keyboard?
Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.
Then German language users from de-CH region, consider Mass the correct way.
Yeah, localization and internalization is a mess to get right.
- wat10000 2 months ago
  
  Case insensitivity is localized like anything else. I and i are equivalent, right? Not if you’re doing Turkish, then it’s I and ı, and İ and i.
  In practice you can do pretty well with a universal approach, but it can’t be 100% correct.
  
  2 replies →
mxmlnkn 2 months ago
I never understood why the recommended replacement for ß is ss. It is a ligature of sz (similar to & being a ligature of et) and is even pronounced ess-zet. The only logical replacement would have been sz, and it would have avoided the clash of Masse (mass) and Maße (measurements). Then again, it only affects whether the vowel before it is pronounced short or long, and there are better ways to encode that in written language in the first place.
- adrian_b 2 months ago
  
  I agree that writing it "sz" might have created less problems.
  However, it is likely that it has never been pronounced "sz", but always "ss" and the habit of writing "sz" for the double consonant may have had the same reason as the writing of "ck" instead of the double "kk".
looperhacks 2 months ago

MASS is allowed casing of Maß, but not the preferred casing: https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk... Page 48
b2ccb2 2 months ago
The confusion likely stems from the relatively new introduction of the capitalized ẞ https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F
Maß capitalized (used to be) MASS.
Funnily enough, Mass means one liter beer (think Oktoberfest).
- looperhacks 2 months ago
  
  Both Maß and Mass are valid spellings for a liter of beer ;) Not to confuse it with Maß, which just means any measurement, of course.
- andersa 2 months ago
  
  It's strange, because I would expect "maß" as the case insensitive search to match "MASS" in the search text, but "mass" should not match "Maß".
  
  1 reply →