← Back to context

Comment by vjerancrnjak

5 hours ago

Redshift does not fit into aws ecosystem. If you use kinesis, you get up to 500 paritions with a bunch of tiny files, now I have to build a pipeline after kinesis that puts all of it into 1 s3 file, only to then import it into redshift which might again put it on s3 backed storage for Its own file shenanigans.

Clickhouse, even chdb inmemory magic has better S3 consumer than Redshift. It sucks up those Kinesis files like nothing.

Its a mess.

Not to mention none of its Column optimizations work and the data footprint of gapless timestamp columns is not basically 0 as it is in any serious OLAP but it is massive, so the way to improve performance is to Just align everything on the same timeline so its computation engine does not beed to figure out how to join stuff that is Actually time Aligned

I really can’t figure out how anyone can do seriously big computations with Redshift. Maybe people like waiting hours for their SQL to execute and think software is just that slow.

You realize “the pipeline” you have to build is literally just Athena SQL statement “Create table select * from…”. Yes you can run this directly from S3 and it will create one big file

https://docs.aws.amazon.com/athena/latest/ug/ctas.html

I have a sneaking suspicion that you are trying to use Redshift as a traditional OLTP database. Are you also normalizing your table like an OLTP database instead of like an OLAP

https://fiveonefour.com/blog/OLAP-on-Tap-The-Art-of-Letting-...

And if you are using any OLAP database for OLTP, you’re doing it wrong. It’s also a simple “process” to move data back and forth between Aurora MySQL or Postgres by federating your OlTP database with Athena (handwavy because I haven’t done it) or the way I have done it is use one Select statement to export to S3 and another to export into your OLTP database.

And before you say you shouldn’t have to do this, you have always needed some process to take data from your normalized data to un normalized form for reporting and analytics.

Source: doing boring enterprise stuff including databases since 1996 and been working for 8 years with AWS services outside AWS (startups and consulting companies) and inside AWS (Professional Services no longer there)

Why are you doing this manually? There is a built in way of doing Kinesis Data Streams to Redshift

https://docs.aws.amazon.com/streams/latest/dev/using-other-s...

Also by default, while you can through Glue Catalog have S3 directly as a destination for Redshift, by default it definitely doesn’t use S3.