← Back to context

Comment by swamp-agr

4 days ago

Anti-spam bot plugin for messengers:

- MVP version for Telegram (since spamming is a part of their business model, it feels natural to start with it)

- More precisely, data pipeline for weights and measurements for word frequencies. Think of it as small-language models.

- More precisely, it is about morphological analysis of words across different languages. Unlike Meta with regex-based dictionaries[0], I am porting rule-based morphology analysis python library[1] into target programming language.

- More precisely, right now it is about understanding DAWG data structure by porting it from C++[2] to Haskell[3].

- Instead of introducing FFI I wanted to become more comfortable with LLMs, I am trying to approach their internals (or my possibly wrong vision of their internals) by building small language model based on a corpus of thousands of spam messages.

Links:

[0] duckling: https://hackage.haskell.org/package/duckling

[1] pymorphy2: https://github.com/pymorphy2/pymorphy2

[2] dawgdic/C++: https://code.google.com/archive/p/dawgdic/

[3] dawgdic/Haskell (work-in-progress): https://github.com/swamp-agr/dawgdic