It's a good time to make some money baffling people with bullshit in the cybersecurity space.
Of course if you let an ML-powered "anomaly detection" engine run rampant on your logs, it's going to find anomalies...just like if you hire a ghost hunter, you'll be informed that your house is haunted. In the end, ghost chasing is all this anomaly nonsense turns out to be-- the justifications for conclusions by ML practitioners and ghost hunters alike tend to be equally mumbly and hand-wavy.
Me working from home is technically an anomaly, and one these systems are all too eager to flag. We get random logins from overseas VPSes-- it's an anomaly! Oh, wait, no, we onboarded a client application. Oh, look, a random login from China for a US-based employee with no history of foreign logins! Yeah, that guy just started in a new position with travel requirements. Hey, this IP just tried to log into 5000 user accounts! Congratulations, you just alerted me to the existence of carrier NAT.
None of this saves any time and usually wastes it, since it stirs up paranoia where none was otherwise warranted. It's a fun toy that gives the appearance of being productive when all it's actually doing is generating literally endless busywork. Good for justifying your SOC budget I suppose.
But in the end nobody wants to pay a quarter-million dollars for a black box that just sits there quietly-- if it's not constantly drawing attention to itself and all the badness it's pretending to find, you're not going to have any reason to renew the license.
"Renew it? Why? This thing didn't find anything at all last year."
Oh I know, I spent a decade in Security and work in ML now, and I can see how badly people want to put the two together, but it's basically 90% bullshit and 10% same old shit of varying effectiveness.
So what does your organisation use for intrusion detection? Humans eyeballing logs doesn't scale. Rule-based approaches?
Mostly user complaints...
At a previous tech support position, I collaborated with a data scientist to create a predictive alert system based on system notification data. It would monitor the quantity of noise from each interface on the network and alert on anomalies in Slack. The only issue was that it didn't work - we saw only false positive noise, and it sat quiet during actual incidents. It would be interesting to see another team's attempts, and what different design choices they make.
Another problem of anomaly detection is that they do not provide any (domain specific) explanation for why the system thinks it is an anomaly. The system also does not say what to do in this situation, which means that such anomalies are not actionable findings. Therefore I think anomaly detection should be used as a pre-processing step which generates input for some other other components of the system.
Are they doing supervised training for anomalies?
I am skeptical of anomaly detection since in my experience anomalies are common and diverse and don't actually matter, so I expect these systems to basically inundate people with false positives.
Their offline training accuracy is garbage: 16% precision, so all of the real work is basically being done in the online training portion, which gets it to a respectable 82%+ precision.
But they don't tell you how many alerts they had to label to get those numbers. Maybe over the long run you get those numbers, but you really want to know if it takes 10 or 10,000 examples to get there.
Also, their dataset distribution is very different to reality: they have 7% of their dataset annotated as real anomalies; I don't think anyone in the real world wants 5% of their log entries to get flagged as anomalies. So I expect their precision numbers to be far worse on more realistically distributed logs.
I do, with others, a lot of ML anomaly detection in the cyber security context. Deeply has interesting ideas, especially the encoded logs via lstm. The work was presented at a workshop at NIPS 2017.
One of the interesting facts we ve been able to measure empirically over the past few years is that the statistical anomalies' scores magnitude as reconstruction error are uncorrelated with the criticality of the anomaly in terms of security / threat.
This means that in practice SOC operators need to label on top of the anomaly detection and a supervised model can do the reranking after a while.
This is an interesting paper, but it sort of sidesteps one of the harder problems in generalized machine learning for log analysis:
> As shown by several prior work [9, 22, 39, 42, 45], an effective methodology is to extract a “log key” (also known as “message type”) from each log entry. The log key of a log entry e refers to the string constant k from the print statement in the source code which printed e during the execution of that code.
So if you're looking for a way to apply this to log data that varies wildly, like site access logs, you still have the difficult problem of converting the URIs to the numeric vectors needed by ML algorithms without losing the significant parts of the input.
Here is another generic approach to anomaly detection from event data which has been used for analyzing logs received from automatic lawn mowers:
It allows for using different algorithms like one class SVM or MDS (including custom algorithms). It also allows for defining custom domain specific features as integral part of its analysis engine. In particular, for log analysis, frequencies of various event types have been generated.
Would you mind explaining what is the primary use of Hobbes?
Once you discard type structure, it's a fruitless task to try to reconstruct it.
It's much easier to make sense of logs when we don't discard that type structure.
Here are some slides
The authors are using their own Spell tool to parse syslog files into patterns that represent the fixed part of printf-like log statement. Is the source of that available? At the heart of this is a tree-based construction that is not well explained.
Would be interested to see the results on a benchmark dataset for online anomaly detection, comparing to those approaches used in practice: https://github.com/numenta/NAB#the-numenta-anomaly-benchmark...
Applying K-Means clustering across different features of online traffic always shows some weird and often malicious stuff:
Care to share more about what kinds of features you cluster on?
Has anyone in real production systems benefit from anomaly detection of logs ? I have usually converted some of the important events in logs to metrics and alerted users based on simple moving averages / spikes etc. I have usually started with alerts from system level metrics and then checked the logs. Applying Anomaly detection to logs directly hasn't worked for me yet.
Yes, tons of that.
See this - using K-Means clustering for anomaly detection in web traffic:
Using DBscan clustering for anomaly detection in healthcare claims data (detecting doctors who anomalously prescribing opioids). Using public CMS data set from 2015.
4 out of 8 top anomalies (doctors) were later actually convicted of crimes or gone into all sort of troubles with DOJ:
(Splunk Enterprise + free apps was used to ingest data and build all this logic and dashboards)
Thank you so much, it really was helpful.
I was wondering, has anyone here applied cluster analysis techniques for anomaly detection?
I read a paper that used it for insurance fraud detection, but I don't know what other fields are using clustering to detect frauds and abnormalities?
I'd be grateful if someone can help.
I had contacted the first author in March and the answer was that "our source code is currently not available because of a pending patent application".
Is there a github for DeepLog?
I don't but I was researching the space and https://www.anodot.com/ has the most feature-rich product - though they only discover anomalies in numeric time-series.
Check out www.loomsystems.com for a spot on AI log analysis
Elastic.co X-Pack has machine learning for log anomalies and people buy and use that stuff. Has anybody direct experience with that?
This breaks the HN guidelines, which ask you not to post shallow dismissals. Better options would be either to factually explain what the problems are, so that people can learn something, or not to post.