← Back to context

Comment by usgroup

2 years ago

It wasn't obvious to me why the authors chose kNN for the classifier. Since they produce a distance matrix, they could have used multi-dimensional scaling to convert the matrix to factors, and then used a tree algorithm such as xgboost which would likely make use of more information than kNN and produce significantly better results. They could have also used a PAQ compression algorithm which are much better than the LZ compressors -- all of which could have significantly improved the results and possibly delivered on their original conclusions.

what i liked about the subject paper is that the compression algorithm is abstracted away and it led me to consider what else one could do with compression via the p(x) ~ K^(-|x|) relationship, where K is the alphabet size and |x| is the length of the string x, and assuming optimal coding.

For example it occurred to me that one could do traditional classification by packing the factors for each response into separate documents and then proceeding as the paper does to classify which document best compresses the next sample in order to determine its class: a sort of supervised classification using compression algorithms. The closer the compressor is to an optimal code for the dataset, the better it will work.

Schemes for sequence prediction are equally straightforward to implement.

I found it to be a pleasant surprise.