Improving network security using big data and machine learning


Published August 2015

Microsoft IT is improving network security using big data and machine learning to detect usage patterns that might otherwise escape notice. They are using these tools to address two particularly significant security problems in large organizations: the challenging landscape created by the sheer volume of machines on the network (exacerbated by the “bring your own device” workplace culture), and the ability to overshare sensitive information with collaboration software.


Download Article, 539 KB, Microsoft Word file




Products & Technologies

Detecting malware and protecting intellectual property are not new concerns for IT, but new aspects of the computing environment at large organizations give these old concerns new urgency.

Microsoft IT is working with a new big data platform and applying machine learning to help detect and analyze patterns that can indicate malware or the oversharing of sensitive information.

  • The ability to collect and analyze data on a large scale that is relevant to security initiatives.
  • The ability to detect patterns in the data that would be invisible to human analysts.
  • Improved security with regard to malware on the company network and protection of intellectual property.
  • Microsoft Azure Storage
  • Microsoft Azure HD Insight
  • Microsoft Azure Machine Learning

Intrusion detection: the first step in fighting malware is to see it

Malware, by nature, tries to be inconspicuous. At an organization as large as Microsoft, with half a million machines and devices connecting to the corporate network, it is naturally a little easier to be inconspicuous.

Microsoft IT needed a better way to stop hackers and the malware these hackers could introduce into the corporate network. Within Microsoft IT, the Information Security and Risk Management (ISRM) team began to develop a long-term, intrusion detection program. ISRM saw the value in deploying big data and machine learning in symbiotic combination to advance their intrusion detection capabilities.

With a big data platform, ISRM could create a massive collection of events from hundreds of thousands of machines across the network. On top of that data, ISRM could build a series of algorithms to look for anomalous or suspicious behavior, and fine-tune the machine learning to improve the performance of the model.

For example, consider the algorithm ISRM and its partners developed to isolate rare processes. An event on every single Windows device indicates when a process has started. Across half a million machines, that's about three billion events per day. ISRM runs an algorithm that shows them if a directory or process only appears on one or two machines in the entire company. The output for this algorithm is a rare directory, a rare executable file, or both. Typically, this produces about 10,000 records per day, which is a much more manageable number (as opposed to three billion) that ISRM then feeds into its machine-learning tool. With far fewer false positives to analyze, machine learning improves because it focuses on true positives.

Why a big data platform?

ISRM partnered with scientists in the Data and Decision Sciences Group (DDSG) to build this big data platform. Microsoft decided to build a new big data platform in addition to using a security information and event management (SIEM) solution because it wanted to be able to predict potential problems.

SIEMs provide real-time detection and are a valuable part of the overall security solution. But they can be difficult to scale to the size of the detection program Microsoft was looking for. Moreover, as principal IT service engineering manager Jenn LeMond noted, while SIEM is good for some types of detection (for example, when you already know what you are looking for), it is not very good at predictive detection (when you don’t know exactly what you are looking for).

"That's when you need to move to a big data platform, that's what machine learning is supposed to get you," LeMond said. "I want the machine to tell me what the data looks like from a normal perspective, and then tell me what the outliers are."

So the ISRM and DDSG teams agreed that the components of this new solution would include:

  • Microsoft Azure Storage. Provides the ability to store and process hundreds of terabytes of data and handles millions of requests per second, on average.

  • Azure HD Insight. Provides big data analysis and accommodates custom algorithms developed by ISRM and DDSG.

  • Azure Machine Learning (Azure ML). Provides predictive analytics and includes some generic “plug-and-play” algorithms to apply to big data.

Data collection: the most time-consuming part of the solution

Over the two years that this program has been running, ISRM has found that the most laborious part of the process has been collecting the data and aggregating it into a usable form. It is a complex, technical process that includes transforming the format of some data, possibly even transforming the values of some data to standardize with other data, interacting with real-time systems, and accepting streaming data.

The first feed of data took about six months. The second feed took about half that time. “As we learn about how to build tables more effectively, with the correct schema, the amount of time to input new feeds is beginning to reduce,” LeMond said.

Other key requirements that were recognized during the data collection phase of the project have included:

  • Accommodating thousands of transactions per second.

  • Thinking about the design and schema to support future cross-correlation.

  • Designing the system to perform well, not just for the analytics but also for ad hoc data exploration.

Data analysis: a vital, collaborative effort to make the output meaningful

ISRM found that one of the biggest challenges in the big data solution was getting rid of the noise of false positives in the output. In other words, it was one thing to collect data, and quite another to create a meaningful and effective tool for looking ahead, predictively. Data scientist Serge Berger calls this “actionable foresight.” To get actionable foresight, ISRM found that they needed a great deal of data analysis.

LeMond cites the work her team did to solve the problem of identifying rare paths and separating the meaningful results from the noise: "When we were looking for rare paths, we discovered that, for example, Windows 8 installs executables into three randomly named directories. That meant the way our algorithm worked, every single directory was flagged as rare, until we calculated the directory path segments into a mathematical formula.

"There was a game called 'Unicorn.exe.' We had 20,000 of those, all coming up as rare because the directory paths were random. When we calculated the segment paths together, all of that rolled up into a single 20,000 unicorn.exe, which was no longer rare."

It was necessary to understand the data to improve the results of the original algorithm. ISRM was, thus, able to tune the algorithm and significantly reduce the false positives in the output.

At present, none of the cloud analytics platforms on the market can handle the amount of raw data the ISRM project generates, underscoring the necessity of doing analytics and hive database queries first. Once hive queries are conducted, ISRM is able to use Azure ML to run secondary sets of logic on the output of that data.

ISRM considers the data analysis phase to be so important that:

  • They recommend pairing a data scientist with an IT security professional for optimal results. The more the two of them can communicate effectively and iterate on the algorithms the data scientist creates, the more effective the output will be for the data analysts working downstream.

  • They recommend conducting data analysis both before and after running the machine learning algorithms (discussed below). The first round of data analysis is necessary to optimize the big data platform for machine learning. The second round of data analysis is necessary to train the algorithms over time. The process is iterative.

  • Depending on the size and complexity of the environment, they recommend accelerating development by adding more data analysts to work with the data scientists. They have found that this is where most time is spent.

Running machine learning to make anomalies visible

A quick look at the statistics of this program illustrate exactly why machine-learning technology is an integral part of the solution. Seven to eight terabytes of data are generated per day across the company, and tens of thousands of events per day are produced by the process, network, and account analytics that the ISRM team employs. As LeMond succinctly puts it, "This is too much for a human being to look at."

Machine learning illuminates patterns in the data and makes anomalies visible, giving the security experts a better chance of detecting malware.

As Berger explains this portion of the solution, "We are looking for particular patterns associated with events. We're actually not interested in individual events, as these are not very useful in uncovering malware. But a certain pattern that exists within a period of time (for example five to seven days), may allow you to detect malware with a high degree of accuracy."

So DDSG is aggregating big data into time-series data, created by events. Specifically, system log (syslog) and Windows log events are analyzed in time, and made into temporal data.

To understand how it works, it’s helpful to recognize the different types of machine learning: deterministic and probabilistic (which itself has unsupervised, semi-supervised, and fully supervised aspects).

Deterministic machine learning

ISRM recommends using deterministic machine learning as a starting point. Start with small data sets, identify known bad behavior, and provide this information to the data scientists to build the initial models.

Whatever behavior has been identified, the deterministic portion of machine learning is looking for a deviation from a baseline. The baseline is called normal, and a deviation that exceeds a certain level of tolerance is called anomalous behavior.

To take a simplified, real-world example of deterministic machine learning, if your data set consists of the annual incomes of a group of people dining at a restaurant, the baseline would be the mean annual income of those people. If a billionaire were to walk into the restaurant, deterministic machine learning could then identify that individual as an extreme outlier from the norm, with a huge deviation from the established baseline.

Probabilistic machine learning

ISRM then uses probabilistic machine learning to find patterns in the data that may have been undetected by the more blunt deterministic technique. Probabilistic learning uses the principle of clustering like things together to predict patterns and outcomes.

"Malware is much rarer than not-malware,” Berger explains. “The clusters with malware tend to be much less populous than those with just the normal software. As a result, we solve many problems, including the problem of multiple false positives."

This phase of probabilistic machine learning is unsupervised, meaning that the algorithm runs without human intervention and produces a map of sorts. The map shows populous clusters, less populous clusters, and perhaps a few outliers or very sparsely populated clusters.

At this point, a human expert is called in to look at the clusters and decide if a cluster is a normal process or malware. This is called the semi-supervised phase of machine learning.

Ultimately, Berger reports, ISRM will use a fully supervised phase of machine learning, in which it is possible to predict by behavior whether a particular software is malware. Transactions and communications are scored for the probability of being anomalous. This phase can be deployed in real-time, and if the machine-learning model is well tuned, it can result in 90 percent accuracy, according to Berger.

Closing the loop: reporting findings of malware

At the end of this process, if ISRM finds malware within the Microsoft corporate network, they submit the sample of the malware to OneProtect, the anti-malware product at Microsoft. Once the signature shows up in OneProtect, the anti-malware product automatically starts cleaning this malware across the world. And most importantly, ISRM is able to learn the pattern of behavior of this malware so that it can detect similar malware in the future.

Oversharing detection: new steps to protect intellectual property

Another group in ISRM is using machine learning on behalf of corporate security efforts—but in this case, the goal isn't to detect malware, it's to protect the intellectual property of the company.

Olav Opedal, Principal IT Enterprise Architect, conducted an analysis of attitudes and behaviors across Microsoft, specifically in the context of using a cloud-based tool such as SharePoint Online. His study found that 23 percent of employees are overly altruistic and will share highly sensitive information with too many people. Oversharing is seldom malicious in nature; usually, the sharer does not understand the impact of their sharing decision. Some of the reasons for this over-sharing are:

  • The sharer might enjoy the feeling of importance associated with having sought-after, "insider" information.

  • The sharer may not understand the sensitive nature of the information or the privacy-level of the forum in which the sharing takes place.

  • In some rare cases, the sharer may be divulging information with malicious intent. Whatever the motivation of the sharers, Opedal and his team realized they needed a way to address the potential for the leak of intellectual property that could occur through this social phenomenon of oversharing.

How the process works

The process they conceived starts with key words. For example, during the development of Cortana, it was useful to know which documents on SharePoint referred to the keyword, “Cortana.”

They found that so many documents contained this keyword that it would have been a Herculean task for one human, or even several humans, to sift through all the documents, weeding out the sensitive references from the harmless.

So instead, Opedal worked with the tools team in ISRM, as well as Berger, to build a process that scans the fast index of both SharePoint Online and on-premises SharePoint. Then they downloaded the documents that have a probability of being sensitive. Opedal described this as essentially making a big data problem into a small data one.

Then individuals would rate whether a random sample of documents is sensitive or not, and machine learning models could be trained based on that sample.

“We combine behavioral sciences with methods and ideas from mathematics to create this into a classification problem,” Opedal explained. “It is a classification problem in the sense that the documents fall into one of two categories: either overshared intellectual property or not. Then using a random sample of documents classified as falling into one or the other of these categories, we build a model that can score this with a level of probability.”

The challenge of changing the behavior of oversharing

Opedal believes that the key to reducing this risky behavior of oversharing isn’t simply to shut off the ability of employees to share information externally. It’s more effective to create a signal to users when inappropriate sharing occurs, thereby training them to avoid it.

For example, with OneDrive for Business, many users would select the “share with everyone” folder for highly sensitive information.

When Microsoft attempted to solve the problem by denying external sharing on OneDrive for Business and SharePoint Online, they found that in many cases, users simply switched over to Dropbox. The number of Dropbox users grew to 12,000 during this period.

A new, streamlined process was implemented, which promoted IT-supported solutions that met required policies and standards. When external sharing was again permitted on OneDrive and SharePoint, the risk of users using consumer solutions (such as Dropbox) decreased.

"After monitoring the situation and studying the phenomenon," Opedal said, "we decided to deploy an internally built tool focused on machine learning scanning, and a tool called AutoSites, which signals users when inappropriate sharing occurs. We have found that this reduces oversharing"

AutoSites was built by the Discovery and Collaboration team in IT, based on the requirements Opedal provided to them.

With the success of the oversharing detection program overall, it is now being upgraded to include the capability to identify documents and files containing passwords stored on SharePoint Online.

For more information

For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Order Centre at (800) 933-4750. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information via the web, go to:


© 2015 Microsoft Corporation. All rights reserved. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.