Comment by CSMastermind
5 months ago
I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.
> I've tested all of ours on each of the main models
Could you list them? I've noticed even quite techy people seem to be critically behind on what has happened in the last few months.
Sure, as of today, I test on:
GPT: 4o, o1 pro mode, o3-mini-high
Gemini: 2.0 Flash, 2.0 Pro Experimental
Claude 3.5 Sonnet
Grok 3
DeepSeek-V3
Mistral: codestral 25.01, mistral-large 24.11
Qwen2.5-Max
---
If there are others I should try definitely open to suggestions.
And ruin the benchmark? Come on, bro.
Can you provide a rough description of the class of the task? No details, obviously, but enough to understand what the models are struggling with.
Yes! Would be curious to learn more about this.
For mobile (React Native) our two questions are making an app that matches the design from a figma file or writing a bridge to a native library we provide.
For front-end we ask they either match a mock from a figma file or writing a small library that handle async data fetching efficently.
For data we ask for either writing a simple scraper for a web page we host or we ask them to write a SQL script that does a tricky data transformation.
For back-end we either ask them to write a simple API that has some specified features on its routes like multi-sort or we ask them to come up with a SQL schema for a tricky use case.
For 3D visualization we provide some data and ask a question about it, Ill share an example below.
For computer vision we ask about plane detection or locating an object in space given a segmented video and 3D model.
For AI we either ask them to find the right threshold for a similiarty search in a vector database or we ask them to write a script to score the results of an AI process given a golden set of results.
For platform we ask them to write a script to do some simple static analysis or specify how they would implement authorization in our system.
We also have a few one off questions for roles like search, infra, and native mobile. I also have some general data structures and algorithms questions.
Here's an example of one of the 3D Viz screens: https://docs.google.com/document/d/1yWLXvbGValKDsglaO5IUVgRS...
At least you are providing them with valuable training data, then. Maybe in a future model!
Is it really valuable data? The task is probably very niche, which is why all models struggle with it and is unlikely to be solvable by a future model without specific training.
We send the candidates the screening questions in the form of a message that links to a Google Doc so I doubt they ended up in their training data.
Also I don't think our problems are particularly niche, it's completely reasonable that an LLM could solve them (and hopefully will in the future).