← Back to context

Comment by lhl

15 hours ago

I published this as a comment as well, but it's probably worth nothing that the ChatGPT water/power numbers cited (the one that is most widely cited in these discussions) comes from an April 2023 paper (Li et al, arXiv:2304.03271) that estimates water/power usage based off of GPT-3 (175B dense model) numbers published from OpenAI's original GPT-3 2021 paper. From Section 3.3.2 Inference:

> As a representative usage scenario for an LLM, we consider a conversation task, which typically includes a CPU-intensive prompt phase that processes the user’s input (a.k.a., prompt) and a memory-intensive token phase that produces outputs [37]. More specifically, we consider a medium-sized request, each with approximately ≤800 words of input and 150 – 300 words of output [37]. The official estimate shows that GPT-3 consumes an order of 0.4 kWh electricity to generate 100 pages of content (e.g., roughly 0.004 kWh per page) [18]. Thus, we consider 0.004 kWh as the per-request server energy consumption for our conversation task. The PUE, WUE, and EWIF are the same as those used for estimating the training water consumption.

There is a slightly newer paper (Oct 2023) that directly measured power usage on a Llama 65B (on V100/A100 hardware) that showed a 14X better efficiency. [2] Ethan Mollick linked to it recently and got me curious since I've recently been running my own inference (performance) testing and it'd be easy enough to just calculate power usage. My results [3] on the latest stable vLLM from last week on a standard H100 node w/ Llama 3.3 70B FP8 was almost a 10X better token/joule than the 2023 V100/A100 testing, which seems about right to me. This is without fancy look-ahead, speculative decode, prefix caching taken into account, just raw token generation. This is 120X more efficient than the commonly cited "ChatGPT" numbers and 250X more efficient than the Llama-3-70B numbers cited in the latest version (v4, 2025-01-15) of that same paper.

For those interested in a full analysis/table with all the citations (including my full testing results) see this o1 chat that calculated the relative efficiency differences and made a nice results table for me: https://chatgpt.com/share/678b55bb-336c-8012-97cc-b94f70919d...

(It's worth point out that that used 45s of TTC, which is a point that is not lost on me!)

[1] https://arxiv.org/abs/2304.03271

[2] https://arxiv.org/abs/2310.03003

[3] https://gist.github.com/lhl/bf81a9c7dfc4244c974335e1605dcf22