← Back to context

Comment by toast0

10 hours ago

> If you see SQL where someone wrote a SELECT and is then using a cursor to loop through those results and do other queries, you've found the person who is still thinking imperatively.

To a first approximation, yes. But 'client-side joins' can be a valuable tool when the database engine won't cooperate. For some queries and some engines, you can do a select with a join to get everything you need in one query, but a select to get a list of ids followed by a union of selects (not an IN query) to get details for each id will have the results at the client sooner, with less load on the database, at some potential loss of consistency. Your client needs to be within a reasonable round trip of the server or the two queries approach won't get the answer faster.

I'm not sure I follow. Fetching the data in one query should almost always be faster and more efficient - the same work is being done regardless, except now you have the additional overhead involved with a second query (network and connection overhead, parsing etc.)

  • Depends on how the database engine sets up the join. I've seen queries where the join ends up spilling into a temp table and it takes a lot longer to do it that way. IIRC, this can happen especially if you've asked the database engine to sort things.

    Same with UNION vs IN. If you union 10 queries for one row each by id, it'll hit the index every time; but if you do it with IN, maybe it decides to do an index scan which takes longer.

    You could say well the database engine is broken if client-side join works better than database-side join, and sure it probably is, but given the choice of fix the database engine or do a client-side join, I know which one is feasible in the short term.

    • In that case, I'd suggest using EXPLAIN on your query and CREATE INDEX on whatever requires full table scans.

      Worst case, you could CREATE MATERIALIZED VIEW whatever would've happened in that temp table.

      At the end of the day, your single query will be faster.

    • > database engine sets up the join

      Every major database uses nested loop, hash, or sort-merge joins. If the data is small enough to fetch in a second query, I don't see any scenario where it would make a query spill to disk when it otherwise wouldn't have (except those pesky OR's).

      Most JOIN issues can be resolved by composing using sub-queries/CTE's to defer a join to a smaller intermediary result, and the only time this generally can't be done is if you need a predicate on the joined table.

      > you've asked the database engine to sort things

      I've only found this to be true when there's an OR condition where the single query ends up doing a bitmap against every row, in which case a UNION will be faster since it's just appending two already-index-ordered streams.

      > Same with UNION vs IN

      A union is almost always going to be worse - most query planners cannot optimize across queries in a UNION, so each query is going to have separate ops vs. a single op with the IN. The only case I've seen this be true is for multi-column predicates with an OR clause against a composite index. I just tested the former in both Postgres and MSSQL and the UNION query cost was 2x the IN clause.

      > maybe it decides to do an index scan which takes longer

      This has not been my experience unless the table statistics are bad.