← Back to context

Comment by tdullien

4 years ago

I have a very naive question: What are the downsides of estimating mutual information instead?

I have a math (but not statistics) background, and am sometimes bewildered by the many correlation coefficients that float around when MI describes pretty exactly what one wants ("how much does knowledge of one variable tell you about the other variable"?).

So ... what am I not understanding?

Different coefficients help you look at different kinds of relationships. For example, Pearson's R tells you about linear relationships between variables -- it's closely tied to the covariance: "how useful is it to draw a line through these data points, and how accurate is interpolating likely to be?".

Spearman's correlation helps you understand monotonic/rank-order relationships between variables: "Is there a trend where increasing X tends to also increase Y?" (This way we can be just as good at measuring the existence of linear, logarithmic, or exponential relationships, although we can't tell them apart.)

Mutual information helps you understand how similar two collections of data are, in the sort of unstructured way that's useful in building decision trees. You could have high mutual information without any sort of linear or monotonic relationship at all. Thus it's more general while at the same time not telling you anything that would be helpful in building, for instance, a predictive multivariate linear model.

TLDR; More specific coefficients leverage assumptions about the structure of the data (eg linearity), which can help you construct optimal versions of models under those assumptions. Mutual information doesn't make any assumptions about the structure of the data so it won't feed into such a model, but it still has lots of applications!

MI is quite useful and widely used. It typically requires binning data though when distributions are unknown / empirically estimated. This approach is a rank-based score, more similar to Spearman correlation than Pearson. This allows for nonlinear relationships between the two variables.

A slightly critical review on the work van be seen here: https://academic.oup.com/biomet/advance-article/doi/10.1093/.... They argue that the older forms of rank correlation, namely D, R, and tau*, are superior. Nonetheless, it seems like a nice contribution to the stats literature, although I doubt the widespread use of correlation is going anywhere.

In the article it is explained that the purpose of this coefficient is to estimate how much X is a function of Y [1](or how noisy this association is); in particular this coefficient is 1 iff X is a function of Y.

With MI (the article claims that) you can have a coefficient of 1 without X being a function of Y.

[1] this means that this coefficient is intentionally not symmetric.

I'm also interested in this, having tried and semi-successfully used mutual information for finding associations between multinomial variables. As an even more naive question, I find the actual selection of estimators bewildering. How do I know which estimator to use for mutual information? How do I know if my chosen estimator has converged or is doing a bad job on my data? Bringing it back to the topic at hand, does the estimator presented in the paper provide good estimates for a wider variety of cases than the mutual information plug-in estimator? If so I can see it might be nice for simplicity reasons alone. Can we have different estimators for this new correlation coefficient? Any ideas what that would look like?

Mutual information is not trivial or even possible to estimate in many practical situations as far as i know. Example applications in robotics or computer vision, where mutual information would be useful are segmentation and denoising of unordered 3d point data, for example.

  • Yes, as someone mentioned above, the problem is getting the underlying distribution of the data, so you can measure SUM p_i log(p_i); this usually involves some binning, which can be tricky (and yes, I know the formula I gave is entropy not MI)

    I try to remind myself that "it is just a model", as a corollary to "all models are wrong, some are useful." You are never dealing with the real world. And you are usually trying to estimate some future as of yet unobserved signal based on existing data. In other words, if your bins are reasonable and reasonably usefully accurate, you can build a working if not perfect system.

    Don't try to optimize testing error performance to a value lower than the irreducible error in the system.

    • Even with binning, the problem is one of accurate sampling from an unknown probability distribution.

      Biased samples produce biased results and this OP correlation coefficient might be sensitive to such an issue.

      In one of the projects we were assuming gamma distribution (for speech processing) and sampling that is notoriously hard. Trying to use binned MI produced serious errors, as opposed to Minimim MSE one, even Maximum Likelihood did better (if noisy).

    • I'm not sure I understand how binning applies in e.g. segmentation of point clouds into distinct objects. The data would likely contain a mix of unknown distributions, partially observed (due to occlusions) and not easily parametrized (chair, table, toaster, etc.)... Locally you can find planar patches though, so correlation can still be useful.