Comment by marginalia_nu

1 month ago

Cooking up a NSFW filter for marginalia search.

Pipeline so far has gone like this:

* Use the search engine's API to query a bunch of depravity

* Use qwen3.5 to label the search results and generate training data

* Try to use fasttext to create a fast model

* Get good results in theory but awful results in practice because it picks up weird features

* Yolo implement a small neural net using hand selected input features instead

* Train using fasttext training data

* Do a pretty good job

* for (;;) Apply the model to real a world link database and relabel positive findings with qwen to provide more training data

Currently this is where I'm at

  Accuracy:   90.90%
  True  Positive: 1021
  False Positive: 154
  True  Negative: 2816
  False Negative: 230
  Precision:  0.8689
  Recall:     0.8161
  F1:         0.8417

There's a lot of vague middle ground and many of the false positives are arguably just mislabeled.

7 comments

marginalia_nu

SubiculumCode 1 month ago

Just want to say that I love your search engine for my ttrpg side projects to find obscure blogs, etc. thank you.

Bombthecat 1 month ago

Never heard about it. Is it like Google search? And why does it need a nsfw filter?

marginalia_nu 1 month ago

It's like google search in 1998.
It needs an NSFW filter because some people want it, especially certain API consumers.

sscarduzio 1 month ago

Nice cover up for ... actually hoarding depravity ;)

marginalia_nu 1 month ago
It really is for scientific purposes! ;-)
- jll29 1 month ago
  
  For scientific search experiments, you may like to consider using PyTerrier (which facilitates comparing multiple search model types - (sparse) vector space model; Boolean model; Binary Probabilistic Model; Support Vector Learning-to-Rank model; Divergence from Randomness model; (dense)embedding ranked retrieval models etc.).
  
  1 reply →