Comment by itissid

2 years ago

Compression algorithms are an economization/compression of space(bits and bytes). ML models, especially generative models are an economization/compression of human expression and thought. Text classification is a type of compression over human expression. Is there perhaps something fundamental property about human language and data that can explain which one can do better at ML tasks?

There may come a day when such a theory takes shape and it may not be surprising that two could be related(some how) in some space where the encoding of compressed bits and bytes and compressed human expression are closely related. Indeed such a theory(entropy based? physics based?) of it might help people choose a compression algorithm over an ML one for certain types of compression over Human Expression.

Looking at the problem from a data driven pov, what are the hard negatives that cause these algorithms to behave poorly? Maybe that theory for now can only be approximated in terms of the data on the varied kinds of human text available. An example of kind of texts problem(one among many) is predicting mixtures using statistical topic models does well on academic text but has a hard time with internet text.

Is there someone out there working on such theories(besides wolfaram physics which I know of)?