Comment by esquire_900
4 hours ago
Cost wise it does not seem very effective. .5 token / sec (the optimized one) is 3600 tokens an hour, which costs about 200-300 watts for an active 3090+system. Running 3600 tokens on open router @.4$ for llama 3.1 (3.3 costs less), is about $0,00144. That money buys you about 2-3 watts (in the Netherlands).
Great achievement for privacy inference nonetheless.
I think we use different units. In my system there are 3600 seconds per hour, and watts measure power.
OP probably means watt-hours.
Something to consider is that input tokens have a cost too. They are typically processed much faster than output tokens. If you have long conversations then input tokens will end up being a significant part of the cost.
It probably won't matter much here though.