Comment by jimmaswell
16 hours ago
My partner has been trying various models on our server but we haven't gotten anything to run at a usable speed. Q30H engineering sample (Xeon 8570) with two cpus, 56 cores per CPU, 768GB DDR5 RAM running at 5600MHz, two old 3090s in it at the moment with an NVLink and we could put our third in there. We built this server before the prices skyrocketed because we happened across some Tyan boards on Woot that were absurdly cheap for what they are (the motherboards should be $1000+ but we got them for a few hundred).
This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine.
Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.
We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.
Yes, this should be a monster machine. Ampere is an older generation, so I expect that's where some of your issues have been
Start with a quant, you can run the Qwen 27B model at 4-bit on one 3090, presumably 6/8-bit on 2x3090.