← Back to context

Comment by RobinL

4 days ago

I think a lot of this comes down to the question: Why aren't tables first class citizens in programming languages?

If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.

R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.

FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.

There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another. Graphs, and relatedly, state machines are tools that are grossly underused because of bad language-level support. Finally, not a structure per se, but I think most languages that are batteries-included enough to included a regex engine should have a a full-fledged PEG parsing engines. Most, if not all, Regex horror stories derive from a simple "Regex is built in".

What tools are easily available in a language, by default, shape the pretty path, and by extension, the entire feel of the language. An example that we've largely come around on is key-value stores. Today, they're table stakes for a standard library. Go back to 90's, and the most popular languages at best treated them as second-class citizens, more like imported objects than something fundamental like arrays. Sure, you can implement a hash map in any language, or import some else's implementation, but oftentimes you'll instead end up with nightmarish, hopefully-synchronized arrays, because those are built-in, and ready at hand.

  • When there is no clear canonical way of implementing something, adding it to a programming language (or a standard library) is risky. All too often, you realize too late that you made a wrong choice, and then you add a second version. And a third. And so on. And then you end up with a confusing language full of newbie traps.

    Graphs are a good example, as they are a large family of related structures. For example, are the edges undirected, directed, or something more exotic? Do the nodes/edges have identifiers and/or labels? Are all nodes/edges of the same type, or are there multiple types? Can you have duplicate edges between the same nodes? Does that depend on the types of the nodes/edges, or on the labels?

    • Even the raw storage for graphs doesn't have just one answer: you could store edge lists or you could store adjacency matrixes. Some algorithms work better with one, some work better with the other. You probably don't want to store both because that can be extra memory overhead as well as a locking problem if you need to atomically update both at once. You probably don't want to automatically flip back and forth between representations because that could cause garbage collector churn if not also long breadth or depth searches, and you may not want to encourage manual conversions between data structures either (to avoid providing a performance footgun to your users). So you probably want the edge list Graph type and the adjacency matrix Graph type to look very different, even though (they are trivially convertible they may be expensive to convert as mentioned), and yeah that's the under-the-hood storage mechanism. From there you get into possible exponential explosion as you start to get into the other higher level distinctions between types of graphs (DAGs versus Trees versus cyclic structures and so forth, and all the variations on what a node can be, if edges can be weighted or labeled, etc).

  • > I think most languages that are batteries-included enough to included a regex engine should have a a full-fledged PEG parsing engines

    Then there would be more PEG horror stories. In addition, string and indices in regex processing are universal, while a parser is necessarily more framework-like, far more complex and doomed to be mismatched for many applications.

  • Would love to see a language in which hierarchical state machines, math/linear algebra, I/O to sensors and actuators, and time/timing were first class citizens.

    Mainly for programming control systems for robotics and aerospace applications

  •     > There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another.
    

    I disagree. Most programmers will go their entire career and never need a matrix data structure. Sure, they will use libraries that use matrices, but never use them directly themselves. It seems fine that matrices are not a separate data type in most modern programming languages.

    • Unless you think "most programmers" === "shitty webapp developers", I strongly disagree. Matrices are first class, important components in statistics, data analysis, graphics, video games, scientific computing, simulation, artificial intelligence and so, so much more.

      And all of those programmers are either using specialized languages, (suffering problems when they want to turn their program into a shitty web app, for example), or committing crimes against syntax like

      rotation_matrix.matmul(vectorized_cat)

      8 replies →

There are a number of dynamic languages to choose from where tables/dataframes are truly first-class datatypes: perhaps most notably Q[0]. There are also emerging languages like Rye[1] or my own Lil[2].

I suspect that in the fullness of time, mainstream languages will eventually fully incorporate tabular programming in much the same way they have slowly absorbed a variety of idioms traditionally seen as part of functional programming, like map/filter/reduce on collections.

[0] https://en.wikipedia.org/wiki/Q_(programming_language_from_K...

[1] https://ryelang.org/blog/posts/comparing_tables_to_python/

[2] http://beyondloom.com/tools/trylil.html

  • Interesting links - tnx. Apropos the optimism of "eventually", I think of language support for say key-value pair collections, namespaces, as still quite impoverished. With each language supporting only a small subset of the concision, apis, and datastructures, found useful in some other. This some 3 decades after becoming mainstream, and the core of multiple mainstream languages. Diminishing returns, silos, segregation of application domains, divergence of paradigm/orientation/idioms, assorted dysfunctions as a field, etc... "eventually" can be decades. Maybe LLMs can quicken that... or perhaps call an end to this era, permitting a "no, we collectively just never got around to creating any one language which supported all of {X}".

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

Everyone in R uses data.frame because tibble (and data.table) inherits from data.frame. This means that "first class" (base R) functions work directly on tibble/data.table. It also makes it trivial to convert between tibble, data.table, and data.frames.

> R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.

You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,

which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.

  • Yeah data.table is just about the best-in-class tool/package for true high-throughput "live" data analysis. Dplyr is great if you are learning the ropes, or want to write something that your colleagues with less experience can easily spot check. But in my experience if you chat with people working in the trenches of banks, lenders, insurance companies, who are running hundreds of hand-spun crosstabs/correlational analyses daily, you will find a lot of data.table users.

    Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.

  • And readability. data.table is very capable, but the incantations to use it are far less obvious (both for reading and writing) than dplyr.

    But you can have the best of both worlds with https://dtplyr.tidyverse.org/, using data.table's performance improvements with dplyr syntax.

It makes sense from a historical perspective. Tables are a thing in many languages, just not the ones that mainstream devs use. In fact, if you rank programming languages by usage outside of devs, the top languages all have a table-ish metaphor (SQL, Excel, R, Matlab).

The languages devs use are largely Algol derived. Algol is a language that was used to express algorithms, which were largely abstractions over Turing machines, which are based around an infinite 1D tape of memory. This model of 1D memory was built into early computers, and early operating systems and early languages. We call it "mechanical sympathy".

Meanwhile, other languages at the same time were invented that weren't tied so closely to the machine, but were more for the purpose of doing science and math. They didn't care as much about this 1D view of the world. Early languages like Fortran and Matlab had notions of 2D data matrices because math and science had notions of 2D data matrices. Languages like C were happy to support these things by using an array of pointers because that mapped nicely to their data model.

The same thing can be said for 1-based and 0-based indexing -- languages like Matlab, R, and Excel are 1-based because that's how people index tables; whereas languages like C and Java are 0-based because that's how people index memory.

  • As a slight refinement of your point, C does have storage map based N-D arrays/tensors like Fortran, just with the old column-major/row-major difference and a clunky "multiple [][]" syntax. There was just a restriction early on to need compile-time known dimensions to the arrays (up to the final dimension, anyway) because it was a somewhat half-done/half-supported thing - and because that also fit the linear data model well. So, it is also common to see char *argv[] like arrays of pointers or in numerics sometimes libraries which do their own storage map equations from passed dimensions.

    Also, the linear memory model itself is not really only because of Algol/Turing machines/theoretical CS/"early" hardware and mechanical sympathy. DRAM has rows & columns internally, but byte addressability leads to hiding that from HW client systems (unless someone is doing a rowhammer attack or something). More random access than tape rewind/fast forward is indeed a huge deal, but I think the actual popularity of linearity just comes from its simplicity as an interface more than anything else. E.g.s, segmented x86 memory with near/far pointers was considered ugly relative to a big 32-bit address space and disk files and other allocation arenas have internally a large linear address/seek spaces. People just want to defer using >1 number until they really need to. People learn univariate-X before they learn multivariate-X where X could be calculus, statistics, etc., etc.

Every copy of Microsoft Excel includes Power Query which is in the M language and has tables as a type. Programs are essentially transformations of table columns and rows. Not sure if its mainstream but is widely available. M language is also included in other tools like PowerBI and Power Automate.

This is an interesting observation. One possible explanation for a lack of robust first class table manipulation support in mainstream languages could be due to the large variance in real-world table sizes and the mutually exclusive subproblems that come with each respective jump in order-of-magnitude row size.

The problems that one might encounter in dealing with a 1m row table are quite different to a 1b row table, and a 1b row table is a rounding error compared to the problems that a 1t row table presents. A standard library needs to support these massive variations at least somewhat gracefully and that's not a trivial API surface to design.

I don't think this is the real problem. In R and Julia tables are great, and they are libraries. The key is that these languages are very expressive and malleable.

Simplifying a lot, R is heavily inspired by Scheme, with some lazy evaluation added on top. Julia is another take at the design space first explored by Dylan.

> Why aren't tables first class citizens in programming languages?

They are in q/kdb and it's glorious. Sql expressions are also first class citizens and it makes it very pleasant to write code

People use data.table in R too (my favorite among those but it’s been a few years). data.table compared to dplyr is quite a contrast in terms of language to manipulate tabular data.

SQL is not just about a table but multiple tables and their relationships. If it was just about running queries against a single table then basic ordering, filtering, aggregation, and annotation would be easy to achieve in almost any language.

Soon as you start doing things like joins, it gets complicated but in theory you could do something like an API of an ORM to do most things. With using just operators you quickly run into the fact that you have to overload (abuse) operators or write a new language with different operator semantics:

  orders * customers | (customers.id == orders.customer_id | orders.amount > Decimal(‘10.00’)

Where * means cross product/outer join and | means filter. Once you add an ordering operator, a group by, etc. you basically get SQL with extra steps.

But it would be nice to have it built in so talking to a database would be a bit more native.

  • Every time I see stuff like this (Google’s new SQL-ish language with pipes comes to mind), I am baffled. SQL to me is eminently readable, and flows beautifully.

    For reference, I think the same is true of Python, so it’s not like I’m a Perl wizard or something.

    • Oh I agree. The problem is that they are two different languages. Inside a Python file, SQL is just a string. No syntax highlighting, no compile time checking, etc. A Kwisatz Haderach of languages that incorporates both its own language and SQL as first class concepts would be very nice but the problem is that SQL is just too different.

      For one thing, SQL is not really meant to be dynamically constructed in SQL. But we often need to dynamically construct a query (for example customer applied several filters to the product listing). The SQL way to handle that would be to have a general purpose query with a thousand if/elses or stored procedures which I think takes it from “flows beautifully” to “oh god who wrote this?” Or you could just do string concatenation in a language that handles that well, like Python. Then wrap the whole thing in functions and objects and you get an ORM.

      I still have not seen a language that incorporates anything like SQL into it that would allow for even basic ORM-like functionality.

      2 replies →

PyTorch was first only Torch, and in Lua. I didn't follow it too close at the time, but apparently due to popular demand it got redone in Python and voila PyTorch.

R’s the best, bc it’s been a statistical analysis language from the beginning in 1974 (and was built and developed for the purpose of analysis / modeling). Also, the tidyverse is marvelous. It provides major productivity in organizing and augmenting the data. Then there’s ggplot, the undisputed best graphical visualization system + built-ins like barplot(), or plot().

But ultimately data analysis is going beyond Python and R into the realm of Stan and PyMC3, probabilistic programming languages. It’s because we want to do nested integrals and those software ecosystems provide the best way to do it (among other probabilistic programming languages). They allow us to understand complex situations and make good / valuable decisions.

I know the primary data structure in Lua is called a table, but I’m not very familiar with them and if they map to what’s expected from tables in data science.

  • Lua's tables are associative arrays, at least fundamentally. There's more to it than that, but it's not the same as the tables/data frames people are using with pandas and similar systems. You could build that kind of framework on top of Lua's tables, though.

    https://www.lua.org/pil/2.5.html

this is my biggest complaint about SAS--everything is either a table or text.

most procs use tables as both input and output, and you better hope the tables have the correct columns.

you want a loop? you either get an implicit loop over rows in a table, write something using syscalls on each row in a table, or you're writing macros (all text).

Fortran gives you that and more, it has first class multidimensional arrays, including matrix operations.

The 3rd edition of Dartmouth BASIC, back in the 1960's, had a MAT command for dealing with matrices.

Because there's no obvious universal optimal data structure for heterogeneous N-dimensional data with varying distributions? You can definitely do that, but it requires an order of magnitude more resource use as baseline.

What is a table other than an array of structs?

  • It’s not that you can’t model data that way (or indeed with structs of arrays), it’s just that the user experience starts to suck. You might want a dataset bigger than RAM, or that you can transparently back by the filesystem, RAM or VRAM. You might want to efficiently index and query the data. You might want to dynamically join and project the data with other arrays of structs. You might want to know when you’re multiplying data of the wrong shapes together. You might want really excellent reflection support. All of this is obviously possible in current languages because that’s where it happens, but it could definitely be easier and feel more of a first class citizen.

  • Well it could be a struct of arrays.

    Nitpicking aside, a nice library for doing “table stuff” without “the whole ass big table framework” would be nice.

    It’s not hard to roll this stuff by hand, but again, a nicer way wouldn’t be bad.

  • The difference is semantics.

    What is a paragraph but an array of sentences? What is a sentence but an array of words? What's a word but an array of letters? You can do this all the way down. Eventually you need to assign meaning to things, and when you do, it helps to know what the thing actually is, specifically, because an array of structs can be many things that aren't a table.

  • I would argue that's about how the data is stored. What I'm trying to express is the idea of the programming language itself supporting high level tabular abstractions/transformations such as grouping, aggregation, joins and so on.

    • Implementing all of those things is an order of magnitude more complex than any other first class primitive datatype in most languages, and there's no obvious "one right way" to do it that would fit everyones use cases - seems like libraries and standalone databases are the way to do it, and that's what we do now.

    • Map/filter/reduce are idiomatic Java/Kotlin/Scala.

      SELECT thing1, thing2 FROM things WHERE thing2 != 2;

      val thingMap = things.map { it.thing2 to it.thing2 }.filter { it.thing2 !=2 }

      Then you've got distinct(), sorting methods, take/drop for limits, count/sumOf/average/minOf/maxOf.

      There are set operations, so you can do unions and differences, check for presence, etc.

      Joins are the hard part, but map() and some lambda work can pull it off.

    • Yeah, that's LINQ+EF. People have hated ORMs for so long (with some justification) that perhaps they've forgotten what the use case is.

      (and yes there's special language support for LINQ so it counts as "part of the language" rather than "a library")

I'd say there are converging standards like Parquet for longterm on disk, Arrow for in memory cross language, and increasingly duckdb for just standard SQL on that in memory or on disk representation. If I had to guess most of the data table things vanish long term because everyone can just use SQL now for all the stuff they did with quirky hacked up APIs and patchy performance because of those hacked up APIs.

> Why aren't tables first class citizens in programming languages?

Because they were created by before the need for it and maybe before their invention.

Manipulating numeric arrays and matrices in python is a bit clunky because it was not designed as a scientific computing language so they were added as library. It's much more integrated and natural to use in scientific computer languages such as matlab. However the reverse is also true: because matlab wasn't designed to do what python does, it's a bit clunkier to use outside scientific computing

  • Tables were definitely around before programming languages.

    There are clay tablets from ancient Sumeria that represent information using tables.

APL Is great

  • Perfect solution for doing analysis on tables. Wes McKinney (inventor of pandas is rumored to have been inspired by it too).

    My problem with APL is 1.) the syntax is less amazing at other more mundane stuff, and 2.) the only production worthy versions are all commercial. I'm not creating something that requires me to pay for a development license as well as distribution royalties.

  • Agreed. I once used it for data preparation for a data science project (GNU APL). After a steep learning curve, it felt very much like writing math formulas — it was fun and concise, and I liked it very much. However, it has zero adoption in today's data science landscape. Sharing your work is basically impossible. If you're doing something just for yourself, though, I would probably give it a chance again.

Mathematica recently added the Tabular command, for what it’s worth. I haven’t used it much yet, but it seems to be quite capable.

  • Yes, Wolfram Language (WL) -- aka Mathematica -- introduced `Tabular` in 2025. It is a new data structure with a constellation of related functions (like `ToTabular`, `PivotToColumns`, etc.) Using it is 10÷100 times faster than using WL's older `Dataset` structure. (In my experience. With both didactic and real life data of 1_000÷100_000 rows and 10÷100 columns.)

This. I really really want some kind of data frame which has actual compile time typing my LSP/IDE can understand. Kusto query language (Azure Data Explorer) has it and the auto completion and error checking is extremely useful. But kusto query language is really just limited to one cloud product.

>Why aren't tables first class citizens in programming languages?

Matlab has them, in fact it has multiple competing concepts of it.

Well you nailed it, the language you're looking for is SQL. There's a reason why duckdb got such traction over the last years. I think data scientists overlook SQL and Excel like tooling.

  • Out of the current options, I strongly agree - I even wrote a blog post! https://www.robinlinacre.com/recommend_sql/

    But on the other hand, that's doesn't mean SQL is ideal - far from it. When using DuckDB with Python, to make things more succinct, reusable and maintainable, I often fall into the pattern of writing Python functions that generate SQL strings.

    But that hints at the drawbacks of SQL: it's mostly not composable as a language (compared to general purpose languages with first-class abstractions). DuckDB syntax does improve on this a little, but I think it's mostly fundamental to SQL. All I'm saying is that it feels like something better is possible.

There are a number of data-focussed no-code/visual/drag-and-drop tools where data tables/frames are very much a first class citizen (e.g. Easy Data Transform, Alteryx, Knime).

Dplyr is quite happy with data.frame. R is built around tabular data. Other statistical languages are too, such as Stata.

Saying that SQL is the standard for manipulating tabular data is like saying that COBOL is the standard for financial transactions. It may be true based on current usage, but nobody thinks it's a good idea long term. They're both based on the outdated idea that a programming language should look like pidgin English rather than math.