Comment by vjerancrnjak

1 month ago

Redshift does not fit into aws ecosystem. If you use kinesis, you get up to 500 paritions with a bunch of tiny files, now I have to build a pipeline after kinesis that puts all of it into 1 s3 file, only to then import it into redshift which might again put it on s3 backed storage for Its own file shenanigans.

Clickhouse, even chdb inmemory magic has better S3 consumer than Redshift. It sucks up those Kinesis files like nothing.

Its a mess.

Not to mention none of its Column optimizations work and the data footprint of gapless timestamp columns is not basically 0 as it is in any serious OLAP but it is massive, so the way to improve performance is to Just align everything on the same timeline so its computation engine does not beed to figure out how to join stuff that is Actually time Aligned

I really can’t figure out how anyone can do seriously big computations with Redshift. Maybe people like waiting hours for their SQL to execute and think software is just that slow.

11 comments

vjerancrnjak

skatanski 1 month ago

Really good to hear that. We've had AWS reps trying to push RedShift on multiple occasions after we've done our research and selected Clickhouse for our analytical workloads. Every time we have a meeting with them for some other reason - the topic of RedShift returns, they always want to discuss it again.

raw_anon_1111 1 month ago

You realize “the pipeline” you have to build is literally just Athena SQL statement “Create table select * from…”. Yes you can run this directly from S3 and it will create one big file

https://docs.aws.amazon.com/athena/latest/ug/ctas.html

I have a sneaking suspicion that you are trying to use Redshift as a traditional OLTP database. Are you also normalizing your table like an OLTP database instead of like an OLAP

https://fiveonefour.com/blog/OLAP-on-Tap-The-Art-of-Letting-...

And if you are using any OLAP database for OLTP, you’re doing it wrong. It’s also a simple “process” to move data back and forth between Aurora MySQL or Postgres by federating your OlTP database with Athena (handwavy because I haven’t done it) or the way I have done it is use one Select statement to export to S3 and another to export into your OLTP database.

And before you say you shouldn’t have to do this, you have always needed some process to take data from your normalized data to un normalized form for reporting and analytics.

Source: doing boring enterprise stuff including databases since 1996 and been working for 8 years with AWS services outside AWS (startups and consulting companies) and inside AWS (Professional Services no longer there)

Why are you doing this manually? There is a built in way of doing Kinesis Data Streams to Redshift

https://docs.aws.amazon.com/streams/latest/dev/using-other-s...

Also by default, while you can through Glue Catalog have S3 directly as a destination for Redshift, by default it definitely doesn’t use S3.

vjerancrnjak 1 month ago
These things cost money, Redshift handling live ingestion from Kinesis is tricky.
There is no need for Athena, Redshift ingestion is a simple query that reads from S3. I dont want to copy 10TB of data just to have it in 1 file. And yes, default storage is a bit better than S3 but for an OLAP database there seems to be no proper column compression and data footprint is too big resulting in slow reads if one is not careful.
I mentioned clickhouse, data is obviously not OLTP schemed.
I don’t have normalized data. As I mentioned, Clickhouse consumer goes through 10TB of blobs and ends up having 15GB of postprocessed data in like 5-10 minutes, slowest part is downloading from S3.
I am not willing to pay 10k+ a month for something that absolutely sucks compared to a proper OLAP db.
Redshift is just made for some very specific, bloated, throw as much software pipelines as you can, pay as much money as you can, workflows that I just don’t find valuable. Its compute engine and data repr is just laughably slow, yeah, it can be as fast as you want by throwing parallel units but it’s a complete waste of money.
- raw_anon_1111 1 month ago
  
  It seems like you want a time series database not an OLAP. Every problem you described you would also have with Snowflake or another OLAP database
  
  7 replies →