Email Security

The Challenges of Applying Machine Learning to Cybersecurity

Machine learning tools keep organizations at the forefront of combating cyberthreats but defining what is a threat and keeping models updated is not always easy.

by Jose Lopez

févr. 02, 2023

Key Points

Machine learning recognizes patterns in data and detects similar information within those patterns and data, which allows it to recognize cyberthreats.
Machine learning must first be shown what malicious data looks like by a human, which can be a challenge because humans must first decide what data is malicious.
Malicious data can be only a fraction of all data being analyzed, but large amounts of malicious data is needed to properly train an ML model.
There are many other challenges and considerations developers must consider when developing an ML model that is designed to help fight cyberattacks.

Defining the Problem

It is likely very difficult for most cybersecurity professionals to remember a time before machine learning (ML) tools had been deployed to help in the fight against cyberattacks. ML’s main contribution to cybersecurity lies in its ability to recognize patterns in data and detect similar information within those patterns and data. Once trained on what malicious data looks like, machine learning can recognize similar data and stop that data from entering systems, or worse, executing code that launches an attack.

But, in order for ML to accurately detect malicious data, it must first be shown what malicious data looks like by a human. And while defining malicious activity can be difficult for a human, we can´t automate the process of detecting malicious data, and thus malicious activity, until we can give a very specific definition of what "malicious activity" looks like within the context of the problem we are trying to solve.

Labeling the Data

We can't label data as malicious unless we are completely sure what data is malicious. In addition, we need a large amount of data to be able to instruct the ML model, but malicious data is usually much less common than good data. This means that in order to get a large amount of malicious data, we must first sift through and set aside a vast amount of good data. This can be a very time-consuming initial task when creating or updating an ML model with malicious data.

An Imbalanced Dataset

Identifying what is malicious among the vast amounts of URLs, email, and files that an organization receives every day is a huge challenge, even with the help of ML. Sifting through the vast amount of data organizations receive to find just the right data that contains attacks, malware, phishing URLs and other cyberthreats can seem impossible. An ML model deployed in production needs to properly handle this heavily imbalanced proportion of malicious to benign data.

The Cost of Bad Predictions and the Model Robustness

When applying ML to cybersecurity, one of the first things we notice is that we will have we false positives. This is when the model labels benign data as malicious. We also quickly notice that we have false negatives, which is the opposite – malicious data being labeled as benign.

Our first step in correcting this issue with the model is to conduct an analysis of the cost of each false positive and false negative. We need to consider what each false negative or positive will cost our team in time and money to correct. Creating this metric assists in properly training the model moving forward.

We also need to include an adversarial robustness comparison metric in all new classifiers that are created within the model. Classifiers can include things like file size, file name, whether a file is read-only, whether it is designated as a system file, or even if the file is executable.

Although adversarial robustness is a poor measure of model robustness when we are talking about the security of an ML model, we need to test and be able to compare different models from a security perspective. This is not easy, however, because a metric that represents the weakness of a model to the adversarial examples is something that must be considered and discussed at length among development teams before developing and deploying a new ML model.

Model Changes

Once an ML model is developed and deployed, in order for it to remain effective, it must be continually updated with new threats because threat actors will constantly apply new techniques that make existing ML models obsolete. When developing an ML model initially, it is extremely important to consider this continuous need to update the model. How we will label new data to retrain our model is also something to strongly consider. This doesn't mean that ML models need to be updated daily like a list of virus signatures, but ML model developers definitely need to create a strategy and regular cadence for deploying updated versions of their model.

Other Considerations

Data Access – Development teams should plan for the potential of privacy concerns with the data they need to analyze and should build in extra development time for this obstacle.

Retraining on Clean Data – It is sometimes difficult to retrain existing models using data that was filtered by a previous model. Retraining on data already filtered by the model becomes more and more difficult as the model ages.

Model Regression – Training models using data that is different than the initial data set can lead to regressions in the models because the nature of the new data is different. This can lead to a newer model being less effective than a previous model which will cause great concern for your customers.

The Bottom Line

ML has been an important part of cybersecurity for some time now and will continue to be just as imperative moving forward. There are many challenges to consider when developing an ML model, but the payoff is definitely worth the time and effort needed to overcome those challenges.

ML is a big part of how Mimecast keeps its customers safe from email-born attacks. Learn more or start a free trial at Mimecast.com.