Comment by swamp-agr
4 days ago
Anti-spam bot plugin for messengers:
- MVP version for Telegram (since spamming is a part of their business model, it feels natural to start with it)
- More precisely, data pipeline for weights and measurements for word frequencies. Think of it as small-language models.
- More precisely, it is about morphological analysis of words across different languages. Unlike Meta with regex-based dictionaries[0], I am porting rule-based morphology analysis python library[1] into target programming language.
- More precisely, right now it is about understanding DAWG data structure by porting it from C++[2] to Haskell[3].
- Instead of introducing FFI I wanted to become more comfortable with LLMs, I am trying to approach their internals (or my possibly wrong vision of their internals) by building small language model based on a corpus of thousands of spam messages.
Links:
[0] duckling: https://hackage.haskell.org/package/duckling
[1] pymorphy2: https://github.com/pymorphy2/pymorphy2
[2] dawgdic/C++: https://code.google.com/archive/p/dawgdic/
[3] dawgdic/Haskell (work-in-progress): https://github.com/swamp-agr/dawgdic
No comments yet
Contribute on Hacker News ↗