← Back to context

Comment by srean

3 days ago

Fun memories.

We have successfully replaced thousands of complicated deep net time series based anomaly detectors at a FANG with statistical (nonparametric, semiparametric) process control ones.

They use 3 to 4 orders lower number of trained parameters and have just enough complexity that a team of 3 or four can handle several thousands of such streams.

The amount of baby sitting that deep net models needed was astronomical, debugging and understanding what has happened quite opaque.

For small teams, with limited resources I would still heavily recommend stats based models for time series anomaly detection.

May not be your best career move right now for political reasons. Those making massive bets do not like to confront that some of their bets might not have been well placed. They may try to make it difficult for contrary evidence to become too visible.

Super cool, thanks for sharing!

This is one of the reasons I am so skeptical of the current AI hype cycle. There are boring, well-behaved classical solutions for many of the use-cases where fancy ML is pushed today.

You'd think that rational businesses would take the low-risk snooze-fest high-margin option any day instead of unintelligible and unreliable options that demand a lot of resources, and yet...

  • >This is one of the reasons I am so skeptical of the current AI hype cycle. There are boring, well-behaved classical solutions for many of the use-cases where fancy ML is pushed today.

    In 2013 my statistics professor warned that once we are in the real world, "people will come up to you trying to sell fancy machine learning models for big money, though the simple truth is that many problems can be solved better by applying straightforward statistical methods".

    There has always been the ML hype, but the last couple years are a whole different level.

  • It does not work that way in the short term.

    Say you have bet billions as a CEO, CTO, CFO. The decision has already been made. Such a steep price had to come at the cost of many groups and teams and projects in the company.

    Now is not a time to water plants that offer alternatives. You will have a smoother ride choosing tools that justifies that billion dollar bet.

    • Decision-making in organizations is definitely a hard problem.

      I think an uncomfortable reality is that a lot of decisions (technology, strategy, etc.) are not optimal or even rational, but more just an outcome of personal preferences.

      Even data-driven approaches aren't immune since they depend on the analysis and interpretation of the data (which is subjective).

      2 replies →

  • > There are boring, well-behaved classical solutions for many of the use-cases where fancy ML is pushed today.

    I know some examples but not too many. Care to share more examples?

    • Some off the top of my head...

      - Instead of trying to get LLMs to answer user questions, write better FAQs informed by reviewing tickets submitted by customers

      - Instead of RAG for anything involving business data, have some DBA write a bunch of reports that answer specific business questions

      - Instead of putting some copilot chat into tools and telling users to ask it to e.g. "explain recent sales trends", make task-focused wizards and visualizations so users can answer these with hard numbers

      - Instead of generating code with LLMs, write more expressive frameworks and libraries that don't require so much plumbing and boilerplate

      Of course, maybe there is something I am missing, but these are just my personal observations!

      13 replies →

    • In my domain, I see lots of people reaching immediately for "AI" techniques to solve sensor fusion and state estimation problems where a traditional Kalman filter type solution would be faster and much more interpretable.

      4 replies →

  • >unintelligible and unreliable options that demand a lot of resources

    Some options have more persuasive salesmen than others.

My first big career move was similar. Employer had all these complicated ML models that played great at conferences, but operators who had to use them said they were inaccurate and indecipherable. I asked the operators what they used to sort real vs false alarms, and the answers were very simple mathematical relationships. Ripped out the ML, coded up the operators versions with some slightly tuned up statistics and got dramatically better results.

Thank you for your perspective. As a machine vision engineer in the semiconductor industry, I have seen a lot of hype around deep learning and AI for vision applications. From my experience, deep learning works well for OCR but less so for classification tasks.

I often achieve better results by focusing on good lighting and using classical computer vision techniques.

I agree with your point about the politics of technology adoption. To protect my career, I usually promote hybrid approaches that combine deep learning and traditional computer vision methods. In reality, many deep learning solutions still rely heavily on classical techniques. Your comments on political challenges and decision-making in technology are very relevant to my experience.

What confuses me about deep nets is that there's rarely enough signal to be able to meaningfully train a large number of parameters. Surely 99 % of those parameters are either (a) incredibly unstable, or (b) correlate perfectly with other parameters?

  • They do. There are enormous redundancies. There's a manifold over which the parameters can vary wildly yet do zilch to the output. The nonlinear analogue of a null space.

    Parameter instability does not worry a machine learner as much as it worries a statistician. ML folks worry about output instabilities.

    The current understanding goes that this overparameterization makes reaching good configurations easier while keeping the search algorithm as simple as stochastic gradient descent.

    • Huh, I didn't know that! Are there efforts to automatically reduce the number of parameters once the model is trained? Or do the relationships between parameters end up too complicated to do that? I would assume such a reduction would be useful for explainability.

      (Asking specifically about time series models and such.)

      3 replies →

> We have successfully replaced thousands of complicated deep net time series based anomaly detectors at a FANG with statistical (nonparametric, semiparametric) process control ones.

Interesting.

Were you using things like Matrix Profile too ? And if so, have those been replaced too ?

  • Fwiw, I have a masters in operations research as a focus area within an industrial engineering degree, and spent 15 years working in manufacturing systems with a focus on test automation & quality. Traditional SPC/SQC analysis is, and will remain, king -- at least for some time. That can potentially evolve on high-vol/low-mix scenarios that lend themselves more easily to training models on anomaly detection, but especially for complex product manufacturing in high-mix factories that's not the case. It's far better to let your test/quality engineers do their jobs and figure out statistical controls on their own.

    Among other reasons, this is largely true because acceptable ranges for different anomaly & defect types can vary significantly for different revs of a single product, or even sub-revs (things that are tied to an ECO but don't result in incrementing the product rev), or -- more crucially -- the line the product is manufactured on. One thing that's notoriously tricky to troubleshoot without being physically onsite is whether a defect is because of a machine, because of a person, or because of faulty piece parts/material.

    Understanding and knowing how to apply traditional statistical analysis to these problems -- and also designing useful data structures to store all the data you're collecting -- is far more valuable right now than trying to shoehorn in an AI model to do this work.

Can you be more specific about what SPC algorithm you moved to? Did you trade off prediction quality for complexity, increasing the number of false alarms?

  • We generally targeted specific statistics of derived/processed streams. For some such streams we cared if the mean changed. In others if the spread changed in a way that was unusual for the time of day. In yet others if some percentile changed that was unusual for the time of day. Sometimes it will be more than one of such statistics.

    Then we would track an online estimator of that measure with an SPC chart. The thresholds would be set based on our appetite for false alarms. We did not fit or use properties of parametric distributions that standard SPC charts use. So no 3-sigma business. In our case convergence to Gaussian would often be not fast enough for such techniques to be useful.

    Also the original streams were far from IID, temporal dependencies were strong. So we had to derive from them derived streams that didn't show temporal dependencies any longer, at least not as strongly. This was the most important bit.

    The next key aspect was to keep the alerting thresholds as untarnished and unaffected as possible from the outliers that would unavoidably occur. Getting this to work without additional human supervisory labels was the next most important part.

    Make this part too robust to outliers then the system would not automatically adapt to a new normal. Make it too sensitive and we would get overwhelmed by false positives.

    • > So we had to derive from them derived streams that didn't show temporal dependencies any longer, at least not as strongly

      Could you expand on that (or at least point me towards some keywords to read up on)?

      I think I understand conceptually what this means (network latency increases at noon CDT each day because YouTube load increases during the lunch hour, as an example) but I'm wondering how you normalize data streams for temporal dependencies with unknown frequencies (nominal change #1 happens each Sunday, while nominal change #2 happens each day at noon).

      1 reply →

This sounds fascinating. Can you say anything about the application?

Autoscaling? Data center cooling and power use?

  • Would rather not. Just to be in the compliant zone legally and also to stay somewhat anonymous. Sincerely sorry to disappoint. But let me assure you it was nothing exotic.

    • Fair. Not sure what it’s like getting tech talks approved through comms these days, but this would be fascinating to hear about at a SF or SouthBay Systems meetup.

could you give a brief overview of: - what libs were you using - what kind of algos / models were most useful for what kind of data?

I have an IoT use-case, I wanted to look both at NNs and more classical stats models to see if it has value

  • Can't for obvious reasons. But no specialized libraries used. The usual Python stack that comes packaged for any respectable OS distribution these days, mixed in with other close-to-the-metal languages for performance or API compatibility reasons.

    Look up nonparametric statistical process control and you will find useful papers. The algorithms are actually quite simple to implement. If the algorithms are not simple then probably they are not worth your time. The analysis in the paper might be complicated, don't worry about that, look for simplicity of the algorithms.

  • did similar work at similar scale to srean.

    Assume you have signal from one IoT device, say a sensor reading. Anomalies are sudden changes in the value of the signal. Define sudden (using the time delta between observations and your other domain knowledge); let's say the sensor reports 1x/second and sudden means 1-3 minutes.

    Simple options: rolling mean last 3 values/rolling mean last 60 values. If this value is over a threshold, alert

    Say the readings are normally distributed, or they can be detrended/made normal via a simple 1 or 2 stage AR/MA model. Apply the https://en.wikipedia.org/wiki/Western_Electric_rules to detect anomalies.

    Complexer but still simple options. Say you have IoT sensors over a larger area, and an anomaly is one sensor which is higher than others. Run roughly the same analysis as above, but on the correlation matrix of all the sensors. Look for rapidly changing correlations.

    example: temperature detectors in each room of your house, and your kid opens the front door to go play in the snow. The entry hall cools down while the rest of the house's temp stays roughly stable. You can picture what that does to the correlation matrix.

    • Bang on.

      It was little more complicated to remove temporal dependencies from the original streams and we could not rely on Gaussian behaviour. Other than that, it's pretty much the same, barring an effort to keep the alerting thresholds unaffected by recent anomalies.