Comment by mrkeen

7 months ago

> There is a semi-famous internal document he wrote where he argued against the other search leads that Google should use less machine-learning, or at least contain it as much as possible, so that ranking stays debuggable and understandable by human search engineers.

There's a lot of ML hate here, and I simply don't see the alternative.

To rank documents, you need to score them. Google uses hundreds of scoring factors (I've seen the number 200 thrown about, but it doesn't really matter if it's 5 or 1000.) The point is you need to sum these weights up into a single number to find out if a result should be above or below another result.

So, if:

  - document A is 2Kb long, has 14 misspellings, matches 2 of your keywords exactly, matches a synonym of another of your keywords, and was published 18 months ago, and

  - document B is 3Kb long, has 7 misspellings, matches 1 of your keywords exactly, matches two more keywords by synonym, and was published 5 months ago

Are there any humans out there who want to write a traditional forward-algorithm to tell me which result is better?

You do not need to. Counting how many links are pointing to each document is sufficient if you know how long that link existed (spammers link creation time distribution is widely differnt to natural link creation times, and many other details that you can use to filter out spammers)

  • > You do not need to.

    Ranking means deciding which document (A or B) is better to return to the user when queried.

    Not writing a traditional forward-algorithm to rank these documents implies one of the following:

    - You write a "backward" algorithm (ML, regression, statistics, whatever you want to call it).

    - You don't use algorithms to solve it. An army of humans chooses the rankings in real time.

    - You don't rank documents at all.

    > Counting how many links are pointing to each document is sufficient if you know how long that link existed

    - Link-counting (e.g. PageRank) is query-independent evidence. If that's sufficient for you, you'll always return the same set of documents to each user, regardless of what they typed into the search box.

    At best you've just added two more ranking factors to the mix:

      - document A
        qie:
            length: 2Kb
            misspellings: 14
            age: 18 months
          + in-links: 4
          + in-link-spamminess: 2.31E4
        qde:
            matches 2 of your keywords exactly
            matches a synonym of another of your keywords
    
      - document B
        qie:
            length: 3Kb
            misspellings: 7
            age: 5 months
          + in-links: 2
          + in-link-spamminess: 2.54E3
        qde:
            matches 1 of your keywords exactly
            matches 2 keywords by synonym
    

    So I ask again:

    - Which document matches your query better, A or B?

    - How did you decide that, such that not only can you program a non-ML algorithm to perform the scoring, but you're certain enough of your decision that you can fix the algorithm when it disagrees with you ( >> debuggable and understandable by human search engineers )

    • A few minor nitpicks. Pagerank is not just link counting, who is linking to the page matters. Among the linking pages those that are ranked higher matter more -- and how does one figure out their rank ? its by Pagerank. It may sound a bit like chicken and egg but its fine, its the fixed point of the self-referential. definition.

      Pagerank based ranking will not return the same set of pages. Its true that the ranking is global in vanilla version of Pagerank, but what gets returned in rank order is the set of qualifying pages. The set of qualifying pages are very much query sensitive. Pagerank also depends on a seed set of initial pages, these may also be set on a query dependent way.

      All this is a little moot now, because Pagerank even defined in this way stopped being useful a long time ago.

    • Statistical methods are debuggable. Is PageRank not debuggable? I am not sure where ML starts and statistics end.

  • > spammers link creation time distribution is widely differnt to natural link creation times

    Yes, this is a statistical method. Guess what machine learning is and what it actually excels?

For a few months last year every time I searched for information about a package related to software available in homebrew, the first few results were to a site that clearly just had crawled all of the links in homebrew, and templated out a site of links corresponding to each package name. and thats about it. It would have been nice if the generated pages contained any useful information, but alas it did not.

There's got to be a better way.