Comment by raw_anon_1111

19 days ago

You realize “the pipeline” you have to build is literally just Athena SQL statement “Create table select * from…”. Yes you can run this directly from S3 and it will create one big file

https://docs.aws.amazon.com/athena/latest/ug/ctas.html

I have a sneaking suspicion that you are trying to use Redshift as a traditional OLTP database. Are you also normalizing your table like an OLTP database instead of like an OLAP

https://fiveonefour.com/blog/OLAP-on-Tap-The-Art-of-Letting-...

And if you are using any OLAP database for OLTP, you’re doing it wrong. It’s also a simple “process” to move data back and forth between Aurora MySQL or Postgres by federating your OlTP database with Athena (handwavy because I haven’t done it) or the way I have done it is use one Select statement to export to S3 and another to export into your OLTP database.

And before you say you shouldn’t have to do this, you have always needed some process to take data from your normalized data to un normalized form for reporting and analytics.

Source: doing boring enterprise stuff including databases since 1996 and been working for 8 years with AWS services outside AWS (startups and consulting companies) and inside AWS (Professional Services no longer there)

Why are you doing this manually? There is a built in way of doing Kinesis Data Streams to Redshift

https://docs.aws.amazon.com/streams/latest/dev/using-other-s...

Also by default, while you can through Glue Catalog have S3 directly as a destination for Redshift, by default it definitely doesn’t use S3.

These things cost money, Redshift handling live ingestion from Kinesis is tricky.

There is no need for Athena, Redshift ingestion is a simple query that reads from S3. I dont want to copy 10TB of data just to have it in 1 file. And yes, default storage is a bit better than S3 but for an OLAP database there seems to be no proper column compression and data footprint is too big resulting in slow reads if one is not careful.

I mentioned clickhouse, data is obviously not OLTP schemed.

I don’t have normalized data. As I mentioned, Clickhouse consumer goes through 10TB of blobs and ends up having 15GB of postprocessed data in like 5-10 minutes, slowest part is downloading from S3.

I am not willing to pay 10k+ a month for something that absolutely sucks compared to a proper OLAP db.

Redshift is just made for some very specific, bloated, throw as much software pipelines as you can, pay as much money as you can, workflows that I just don’t find valuable. Its compute engine and data repr is just laughably slow, yeah, it can be as fast as you want by throwing parallel units but it’s a complete waste of money.

  • It seems like you want a time series database not an OLAP. Every problem you described you would also have with Snowflake or another OLAP database

    • Thanks for having this discussion with me. I believe I don't want a time series database. I want to be able to invent new queries and throw them at a schema, or create materialized views to have better queries etc. I just don't find Snowflake or Redshift anywhere close to what they're selling.

      I think these systems are optimized for something else, probably organizational scale, predictable low value workloads, large teams that just throw their shit at it and it works on a daily basis, and of course, it costs a lot.

      My experience after renting a $1k EC2 instance and slurping all of S3 onto it in a few hours, and Redshift being unable to do the same, made me not consider these systems reliable for anything other than ritualistic performative low value work.

      6 replies →