← Back to context

Comment by raw_anon_1111

24 days ago

I’ve told you my background. I’m telling you that you are using the wrong tool for the job. It’s not an issue with the database. Even if you did need an OLAP database like Reddhift, you are still treating it like an OLTP database as far as your ETL job. You really need to do some additional research

I do not need JOINs. I do not need single row lookups or updates. I need a compute engine and efficient storage.

I need fast consumers, I need good materialized views.

I am not treating anything like OLTP databases, my opinion on OLTP is even harsher. They can’t even handle the data from S3 without insane amounts of work.

I do not even think in terms of OLTP OLAP or whatever. I am thinking in terms of what queries over what data I want to do and how to do it with the feature set available.

If necessary, I will align all postgresql tables on a timeline of discrete timestamps instead of storing things as intervals, to allow faster sequential processing.

I am saying that these systems as a whole are incapable of many things Ive tried them to do. I have managed to use other systems and did many more valuable things because they are actually capable.

It is laughable that the task of loading data from S3 into whatever schema you want is better done by tech outside of the aws universe.

I can paste this whole conversation into an LLM unprompted and I don’t really see anything I am missing.

The only part I am surely missing are nontechnical considerations, which I do not care about at all outside of business context.

I know things are nuanced and there’s companies with PBs of data doing something with Redshift, but people do random stuff with Oracle as well.

  • And you honestly still haven’t addressed the main point - you are literally using the wrong tool for the job and didn’t do your research for the right tool. Even a cursory overview of Redshift (or Snowflake) tells you that it should be used for bulk inserts, aggregation queries, etc.

    Did you research how you should structure your tables fir optimum performance for OLAP databases? Did you research the pros and cons of using a column based storage engine like Redshift to a standard row based storage engine in an traditional RDMS? Not to mention depending on your use case you might need ElssticSearch.

    This if completely a you problem for not doing your research and using the worse possible tool for your use case. Seriously, reach out to an SA at AWS and they can give you some free advice, you are literally doing everything wrong.

    That sounds harsh. But it’s true.

    • Clickhouse is column based storage, I can also apply delta compression, where gapless timestamp columns basically have 0 storage cost. I can apply Gorilla as well and get nice compression from irregular columns. I am aware of Redshift's AZ64 cols and they are a let down.

      I can change sort order, same as in Redshift with its sort keys, to improve compression and compute. Redshift does not really exploit this sort-key config as much as it could.

      My own assessment is that I'm extremely skilled at making any kind of DB system yield to my will and get it to its limits.

      I have never used Redshift, Clickhouse or Snowflake with 1 by 1 inserts. I have mentioned S3 consumers (a library or a service, optimized to work well with autoscaling done by S3, respecting SlowDown -- something Redshift itself is incapable of respecting -- and achieving enormous download rates -- some of the consumers I've used completely saturate the 200Gbps limits of some EC2 machines at AWS). These consumers cannot be used in a 1-by-1 setting, the whole point is to have an insanely fast pipelining system with batched processing, interleaving network downloads with CPU compute, so that in the end, any kind of data repackaging and compression is negligible compared to download, so you can just predict how long the system will take to ingest by knowing what your peak download speed is, because the actual compute is fully optimized and pipelined.

      Now, it might just be Redshift has bugs and I should report them, but I did not have the experience of AWS reacting quickly to any of the reports I've made.

      I disagree, it's not a me problem. I am a bit surprised after all I've written that you're still implying I want OLTP, am using the wrong tool for the job. There are just some tools I would never pick, because they just don't work as advertised, Redshift is one of them. There are much better in-memory compute engines that work directly with S3, and you can create any kind of trash low-value pipelines with them, if you reach mem limits of your compute system, there are much better compute engine + storage combos than Redshift. My belief is that Redshift is purely a nontechnical choice.

      Now, to steelman you, if you're saying:

      * data warehouse as managed service,

      * cost efficiency via guardrails,

      * scale by policy, not by expertise,

      * optimize for nontechnical teams,

      * hide the machinery,

      * use AWS-native bloated, slow or expensive glue (Glue, Athena, Kinesis, DMS),

      * predictable monthly bill,

      * preventing S3 abuse,

      * preventing runaway parallelism,

      * avoiding noisy-neighbor incidents (either by protecting me or protecting AWS infra),

      * intentionally constrained to satisfy all of the above,

      then yes, I agree, I am definitely using the wrong tool but as I said, if the value proposition is nontechnical, I do not really care about that.

      2 replies →