Comment by datadeft

7 months ago

You do not need to. Counting how many links are pointing to each document is sufficient if you know how long that link existed (spammers link creation time distribution is widely differnt to natural link creation times, and many other details that you can use to filter out spammers)

> You do not need to.

Ranking means deciding which document (A or B) is better to return to the user when queried.

Not writing a traditional forward-algorithm to rank these documents implies one of the following:

- You write a "backward" algorithm (ML, regression, statistics, whatever you want to call it).

- You don't use algorithms to solve it. An army of humans chooses the rankings in real time.

- You don't rank documents at all.

> Counting how many links are pointing to each document is sufficient if you know how long that link existed

- Link-counting (e.g. PageRank) is query-independent evidence. If that's sufficient for you, you'll always return the same set of documents to each user, regardless of what they typed into the search box.

At best you've just added two more ranking factors to the mix:

  - document A
    qie:
        length: 2Kb
        misspellings: 14
        age: 18 months
      + in-links: 4
      + in-link-spamminess: 2.31E4
    qde:
        matches 2 of your keywords exactly
        matches a synonym of another of your keywords

  - document B
    qie:
        length: 3Kb
        misspellings: 7
        age: 5 months
      + in-links: 2
      + in-link-spamminess: 2.54E3
    qde:
        matches 1 of your keywords exactly
        matches 2 keywords by synonym

So I ask again:

- Which document matches your query better, A or B?

- How did you decide that, such that not only can you program a non-ML algorithm to perform the scoring, but you're certain enough of your decision that you can fix the algorithm when it disagrees with you ( >> debuggable and understandable by human search engineers )

  • A few minor nitpicks. Pagerank is not just link counting, who is linking to the page matters. Among the linking pages those that are ranked higher matter more -- and how does one figure out their rank ? its by Pagerank. It may sound a bit like chicken and egg but its fine, its the fixed point of the self-referential. definition.

    Pagerank based ranking will not return the same set of pages. Its true that the ranking is global in vanilla version of Pagerank, but what gets returned in rank order is the set of qualifying pages. The set of qualifying pages are very much query sensitive. Pagerank also depends on a seed set of initial pages, these may also be set on a query dependent way.

    All this is a little moot now, because Pagerank even defined in this way stopped being useful a long time ago.

  • Statistical methods are debuggable. Is PageRank not debuggable? I am not sure where ML starts and statistics end.

> spammers link creation time distribution is widely differnt to natural link creation times

Yes, this is a statistical method. Guess what machine learning is and what it actually excels?