← Back to context Comment by rogerrogerr 1 day ago They’ll never reveal the data, because that would reveal this is all built on stolen work. 1 comment rogerrogerr Reply simonw 1 day ago Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:https://huggingface.co/allenai/OLMo-2-0325-32BHere's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.
simonw 1 day ago Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:https://huggingface.co/allenai/OLMo-2-0325-32BHere's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.
Some of the models DO reveal the data, and it's still built on "stolen work" in that it's unlicensed scrapes of the Web. Here's an example:
https://huggingface.co/allenai/OLMo-2-0325-32B
Here's one of their training mixes: https://huggingface.co/datasets/allenai/dolma3_pool - which includes 8 trillion tokens from Common Crawl.