By Igor Baikalov, Chief Scientist, Securonix
In his January 31 article in Dark Reading titled “5 Questions to Ask about Machine Learning” Anup Gosh proposes 5 questions consumers should ask in order to separate marketing hyperbole from technical reality.
The questions Gosh proposes are all good points that you should understand in order to evaluate a security solution properly. The way the questions are phrased, however, is more appropriate to a malware detection use case. The questions posed mostly applicable to the supervised learning with binary classification (e.g. is this malware: Yes or No?).
The Securonix solution is capable of dealing with more complex, multi-dimensional problems where the outcome of the case (e.g. Yes or No) is rarely known and supervised learning has limited applicability. When supervised learning is applicable, it’s usually in the very imbalanced dataset (e.g. the number of negatives is much higher than the number of positives).
Given this, here is Securonix’s response to Gosh’s 5 questions:
That detection rate you quote in your marketing materials is impressive, but what’s the corresponding false-positive rate?
While marketing materials rarely reflect the true state of data science in the company, it’s still a very valid point: there’s always a trade-off between sensitivity and specificity, or false positive and false negative rates, and you need to consider both when evaluating solution’s performance. The area under the receiver operating characteristic (ROC) curve that Anup recommends is one way to look at the diagnostic ability of binary classifier, but it’s not the only one, and it’s definitely not the best metric for imbalanced data, because the false positive rate stays misleadingly low when the number of negatives is too high. We prefer to publish the full confusion matrix to let people make their own choices, alongside the Matthews Correlation Coefficient which we find to be a better metric in our cases.
How often does your model need updating, and how much does your model’s accuracy drop off between updates?
As we work with many models at Securonix, there’s no single answer. It depends on the rate of change in the underlying dataset: some, like domain generation algorithm (DGA) malware feeds or phishing subjects, don’t change that often, and monthly updates of the model might be sufficient. Other models that depend more on a constantly changing environment might need to be updated daily to capture changes and to allow for timely incorporation of analyst feedback.
Does your machine learning algorithm make decisions in real time?
We have both unsupervised and supervised learning algorithms that can respond in real time. Moreover, we employ some of the online learning algorithms, like Mondrian Forest, where the model can be updated incrementally, in real time, without the need to retrain it on the entire dataset at once.
What is your training set?
Again, Securonix deals with many, many different datasets, and most of the time we don’t have the luxury of picking a broad, robust, and diverse one – we have to work with what we got. Therefore, feature engineering becomes extremely important in building a stable model with good predictive power.
How well does your machine learning system scale?
Securonix uses an Apache Spark distributed and highly scalable machine learning library extensively. The size of the dataset hasn’t been a problem; the issue is more often with the diversity of the dataset, which at some point stops increasing, and adding more data to the training set doesn’t add much of the informational content to the model.