Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

16 hours ago (github.com)

README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code

  • Really practical teaching approach. I clicked in to see how safetensors are loaded and just kept reading. Thanks for sharing.

I feel like I learned twice as much in 10 minutes reading this than I did reading LLM for Dummies. Thank you

The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.

Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.

I am looking at a plain and simple C implemented LLM inference, and/or x86_64 assembly implemented, and/or AMD GPU RDNA assembly.

Anybody?

It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(