Comment by zelphirkalt

14 hours ago

Decision trees are great. My favorite classical machine learning algorithm or group of algorithms, as there are many slight variations of decision trees. I wrote a purely functional (kind of naive) parallelized implementation in GNU Guile: https://codeberg.org/ZelphirKaltstahl/guile-ml/src/commit/25...

Why "naive"? Because there is no such thing as NumPy or data frames in the Guile ecosystem to my knowledge, and the data representation is therefore probably quite inefficient.

6 comments

zelphirkalt

srean 13 hours ago

What benefit does numpy or dataframes bring to decision tree logic over what is available in Guile already ? Honest question.

Guile like languages are very well suited for decision trees, because manipulating and operating on trees is it's mother tongue. Only thing that would be a bit more work would be to compile the decision tree into machine code. Then one doesn't have traverse a runtime structure, the former being more efficient.

BTW take a look at Lush, you might like it.

https://wedesoft.github.io/aiscm/

zelphirkalt 12 hours ago
I think data frames are quite memory efficient and can store non-uniform data types (as can vectors in Guile). Generally, a ton of work has gone into making operations on data frames fast. I don't think a normal vector or multi-dimensional array can easily compete. Data frames are probably also compiled to some quite efficient machine code. Not sure whether Guile's native data structures can match that. Maybe they can.
Also I think I did not optimize for memory usage, and my implementation might keep copies of subsets of data points for each branch. I was mostly focused on the algorithm, not that much on data representation.
Another point, that is not really efficiency related, is that data frames come with lots of functionality to handle non-numeric data. If I recall correctly, they have functionality like doing one-hot encoding and such things. My implementation simply assumes all you have is numbers.
There might also be efficiency left on the table in my implementation, because I use the native number types of Guile, which allow for arbitrarily large integers (which one might not need in many cases) and I might even have used fractions, instead of inexact floats.
I guess though, with good, suitable data structures and a bit of reworking the implementation, one could get a production ready thing out of my naive implementation, that is even trivially parallelized and still would have the linear speedup (within some bounds only, probably, because decision trees usually shouldn't be too deep, to avoid overfitting) that my purely functional implementation enables.
Thanks for the links!
- srean 12 hours ago
  
  For linear algebraic transformation applied to several rows at once, I wholeheartedly agree.
  Not so convinced about decision trees though (that process one row at a time).
  Yeah, unless you had to deal with arbitrarily large integer features, Guile integers would come with a big efficiency hit.
  
  2 replies →
boccaff 13 hours ago

tree algorithms on sklearn use parallel arrays to represent the tree structure.