← Back to context

Comment by IgorPartola

13 years ago

Better than a forgetting factor, add a Kalman filter (http://en.wikipedia.org/wiki/Kalman_filter). This way you can trust your "new" data more than really "old" data, etc. The beauty of it is that it only adds three attributes to each data sample.

Could you expound on this a bit? What attributes would you have to add? How would you calculate scores?

  • You would add a variance (P), estimate of the value and the timestamp of the last measurement. Using the last timestamp you can calculate Q. Generally, the older the last measurement, the higher Q.

    The calculation is straightforward once you let some things be the value of identity:

      P1 = P0 + Q
      K = P0 / (P0 + R)
      x1 = x0 + K * (z - x0)
      P1 = (1 - K) * P0
    

    Now you have the new score for your data (x1) and a new variance to store (P1). Other values are:

    x0, P0 - previous score, previous covariance Q - Roughly related to the age of the last measurement. Goes up with age. R - Measurement error. Set it close to 0 if you are sure your measurements are always error-free. z - the most recent measured value.

    Let's say you measure number of clicks per 1000 impressions. Now you can estimate the expectation value (x1) for the next 1000. After the second 1000 re-estimate again.

How does a decay factor not "trust your 'new' data more than really 'old' data"?

  • The Kalman filter is much more sophisticated. Typical re-estimation will be:

      x1 = x0 + alpha * (z - x0)
    

    where alpha is static. The Kalman filter will make it dynamic, taking into account how you obtained the measurements, how old the last re-estimation was, how noisy the process is, etc. Want to do multi-variate analysis? Make alpha a matrix transform.