Comment by simonw

1 month ago

This is a neat brute-force search system - it uses goroutines, one for each of the 1,200 books in the corpus, and has each one do a regex search against the in-memory text for that book.

Here's a neat trick I picked up from the source code:

    indices := fdr.rgx.FindAllStringSubmatchIndex(text, -1)

    for _, pair := range indices {
        start := pair[0]
        end := pair[1]
        leftStart := max(0, start-CONTEXT_LENGTH)
        rightEnd := min(end+CONTEXT_LENGTH, len(text))

        // TODO: this doesn't work with Unicode
        if start > 0 && isLetter(text[start-1]) {
            continue
        }

        if end < len(text) && isLetter(text[end]) {
            continue
        }

An earlier comment explains this:

    // The '\b' word boundary regex pattern is very slow. So we don't use it here and
    // instead filter for word boundaries inside `findConcordance`.
    // TODO: case-insensitive matching - (?i) flag (but it's slow)
    pattern := regexp.QuoteMeta(keyword)

So instead of `\bWORD\b` it does the simplest possible match and then checks to see if the character one index before the match and or one index after the matches are also letters. If they are it skips the match.

1 comment

simonw

never_inline 1 month ago

Spinning 1K goroutines per request doesn't feel right to me for some reason.

Isn't trigram search supposed to be better?

https://swtch.com/~rsc/regexp/regexp4.html