Comment by thaumasiotes
17 hours ago
> I gave the feedback at one Google interview that they should send Google employees through to see how many get hired. Good to see they basically tried that.
They did, but not with the intention of doing anything about the problem.
This is a question of reliability, the conceptual 'correlation' of a measurement instrument with itself when measuring the same thing.
Reliability is one of two major concepts in psychometrics, the other being validity, the conceptual correlation between a measurement instrument and that part of reality that you're hoping to measure.
The question behind validity is "I want to know X; if I measure Y, how helpful will that be?". And the question behind reliability is "if I measure Z, how accurate will that measurement be?"
https://en.wikipedia.org/wiki/Reliability_(statistics)
https://en.wikipedia.org/wiki/Construct_validity
Yegge calls out both concepts explicitly, though not by name, in this essay:
>> The outcomes from interviewing are statistically terrible. Google did wave upon wave of analysis over the years, and all the results were incredibly depressing.
>> [reliability] To name just a few off the top of my head: interviewers barely agreed with each other. Put the same candidate in front of two of our sharpest people and you’d routinely get a confident “strong hire” from one and a flat “no” from the other.
>> [validity, though the 'problem' here is strongly confounded by a restriction of range issue] And once people were actually on the job, their interview scores told you next to nothing about how they’d do
>> [reliability] Hell, some of our star performers failed their Google interviews four or five times, finally got in after 2+ years...
>> [validity] ...and then outshone everyone else.
The discussion of how interviewing outcomes are statistically terrible would benefit from naming the ways in which they're statistically terrible. Knowing the problem you have is an important step toward solving it.
(And as a side note, the last I heard from Google, you're not allowed to interview more often than once a year. Interviewing five times in two years would seem to violate that policy.)
It is a basic theorem that the validity of any instrument is bounded above by the square root of the reliability. It isn't possible for an unreliable instrument to be tightly correlated to reality, because it is, by definition, not tightly correlated with anything. That's what it means to be unreliable.
Thus, any company that wanted its hiring process to be good would necessarily be extremely concerned with making that process accurate; you need to come to the same decision when you assess the same person. This is something that interviews cannot achieve except at extreme cost. You'd need far more than five interviews to get a reliable assessment from them, despite the claim in this essay that "any more than four interviews and you're just playin' with your food". Of course, the Google interviews aren't supposed to be reliable anyway, so in that sense the claim is probably accurate.
The prescription Yegge offers is valid. Multi-month work assessments will give you a strong, reliable, and valid signal. They're also very expensive.
Another thing the essay completely glosses over is that this problem has been recognized for a long time, and we already know how to do assessments that are reliable, valid, and cheap to perform. They're called standardized tests.
At least historically, Google prioritized not hiring bad candidates over hiring good candidates. So it was neither a priority for interviews to be consistent (for good candidates) or for employees to be able to consistently pass interviews.
That certainly makes sense as a goal, given the cost of hiring someone bad and then not being able to get rid of them.
The problem is that companies like Google that have evaluated their own hiring process, by comparing candidates "hiring score" with subsequent on-the-job performance, have found that there is little correlation. So, while the goal (be more concerned about false positives than false negatives) makes sense, their process of trying to achieve this is broken.
Big companies already have standardized tests; test banks that get rotated with grading rubrics. Examiners (employees) will ask their favorite questions over and over to calibrate where a given candidate stands.
Serious question, tell me what you think of using IQ tests to hire SWEs? Should we just do that instead?
Yes I do think it has merit. I think some kind of specialized IQ test which measures for aptitude can be used for for screening. Of course it's not a be all and end all but it should significantly reduce the 'grinding-leetcode' situation.
Why not do that for all jobs. Forget resumes and work experience/accomplishments, and just hire based on a test score?
Perhaps we could administer this IQ test at age 12, so that the low-scoring individuals can go straight into the fast-food industry, and the rest can pick between the doctor/lawyer/SWE offers that will be showered upon them?
He said standardized test, not standard general cognitive test.
The tests that carry the particular branding "IQ" (Wechsler / Raven's / etc.) suffer from some problems in this regard - not very many questions exist and there are very large coaching effects. (Also, psychologists will tell you that getting an accurate result means you need the test to be administered by a trained psychologist. This is mostly nonsense, but to the extent you believe them, it's cost-prohibitive.)
Hiring from a test that measures IQ is a very good idea (and there is a test that's commonly used for hiring purposes, the Wonderlic†); hiring from "an IQ test" is a bit less good. Anyone who wants to subvert the Raven's test will be able to do that. High-stakes tests need more security.
The concept of "IQ" can be toxic in contemporary American politics, so there are many more tests that "happen" to test IQ than there are tests that advertise themselves as testing IQ.
† https://psycnet.apa.org/record/1982-00123-001 : "correlations between Wonderlic IQs and WAIS Full Scale IQs were [0.93] for the main group and [0.91] for the cross-validation group". Note that this test involves only basic math and takes 12 minutes of the candidate's time.
The decisions made by individual interviewers are extremely accurate if you realize they are just saying 'YES - I want this person hired' or 'NO - I don't want this person hired'. It's entirely subjective but likely very repeatable.
It's not repeatable; that's the whole point of describing how the same person gets wildly different results when they interview on multiple occasions.
Candidates are getting different outcomes on separate paths through the process because different people are interviewing them.