If anyone is interested, I've also published a Go implementation [1] of the code for float64 slices.
Results seem to exactly match the R and Python implementation, so there will be a second pass focusing on performance, stability and support for categorical variables.
The current version of the python lib seems to be extremely badly written code. Or is the algo so bad ? Takes something like 21s to compute the correlation for just 10k samples.
The equation is on the second page, and if you know enough to know what correlation is, you know enough to implement from the equation given. Takes N*Log(N) to run though, if implemented naively. (because you have to sort your data)
Yes, the author has shared the link to R package here:
https://cran.r-project.org/web/packages/XICOR/index.html
Edit: R code from Dr. Chatterjee's Stanford page is here - https://souravchatterjee.su.domains//xi.R
If you have never worked with R, the code seems clunky so I suggest checking out Python implementation on Github here:
https://github.com/czbiohub/xicor
The Python library is not from the original author though. But it's easy to read the code and it works with pandas as well.
If anyone is interested, I've also published a Go implementation [1] of the code for float64 slices.
Results seem to exactly match the R and Python implementation, so there will be a second pass focusing on performance, stability and support for categorical variables.
[1] https://github.com/tpaschalis/xicor-go
The current version of the python lib seems to be extremely badly written code. Or is the algo so bad ? Takes something like 21s to compute the correlation for just 10k samples.
This issue contains simple code that is claimed to be >300x faster: https://github.com/czbiohub/xicor/issues/17
1 reply →
Thanks, the Python code is very clear and simple and makes it super easy to understand the idea without having to digest the paper.
The equation is on the second page, and if you know enough to know what correlation is, you know enough to implement from the equation given. Takes N*Log(N) to run though, if implemented naively. (because you have to sort your data)