Comment by jazzpush2
8 hours ago
FWIW Zero-3 refers to a common strategy for sharding model components across GPUs (commonly called FSDP-2, Full Sharded Data Parallel). The "3" is the level of sharding (how much stuff to distribute across GPUs, e.g. just weights, versus optimizer state as well, etc.)
No comments yet
Contribute on Hacker News ↗