Comment by ahmed_ds

1 year ago

This is why I like tools like datastation and hex.tech. You write the initial query using SQL than process the results as a dataframe using Python/pandas. Surely, mixing Pandas and SQL like that is not good for data pipelines but for exploration and analytics, I have found this approach to be enjoyable.

2 comments

ahmed_ds

theodpHN 1 year ago

Yes, it's very convenient to be able to use SQL with your massively parallel commercial database (Oracle, Snowflake, etc.) and then again with the results sets (Pandas, etc.). Interestingly, it's a concept that was implemented 35 years ago in SAS (link below) but is just now gaining traction in today's "modern" software (e.g., via DuckDB).

USING THE NEW SQL PROCEDURE IN SAS PROGRAMS (1989) https://support.sas.com/resources/papers/proceedings-archive... The Sql procedure uses SQL to create, modify, and retrieve data from SAS data sets and views derived from those data sets. You can also use the SOL procedure to join data sets and views with those from other database management systems through the SAS/ACCESS software interfaces.

ahmed_ds 1 year ago

Wow, that is really cool. One of my theses is that DuckDB will be bought by GCP (BigQuery), and polars will be bought by Databricks (or AWS). The thesis is based on the idea that Snowflake bought the Modin platform. The movement in DE seems to be towards data warehouse platforms streaming data (views/results) down to dataframe (Modin, Polars, DuckDB) platforms, which then stream down to their BI platforms. Because these database platforms are designed as OLAP platforms so, this approach makes sense.