← Back to context

Comment by l1n

8 hours ago

Seems similar to Microsoft DeepSpeed.

2 comments

l1n

Reply

bee_rider 7 hours ago

The compare against “DeepSpeed ZeRO-3” apparently.

jazzpush2 6 hours ago

FWIW Zero-3 refers to a common strategy for sharding model components across GPUs (commonly called FSDP-2, Full Sharded Data Parallel). The "3" is the level of sharding (how much stuff to distribute across GPUs, e.g. just weights, versus optimizer state as well, etc.)