IT Threat Detection using Neural Search
In this tutorial, we will create a “deep-learning powered” cybersecurity dashboard that simulates network traffic monitoring for malicious events in real-time.
“If you spend more on coffee than IT security, you will be hacked,” warned U.S. Cybersecurity Czar Richard Clarke, speaking to a standing-room-only crowd at RSA Conference. “And moreover, you deserve to be hacked.” If it weren’t for network attacks, this quote would make a great bumper sticker.
“If you spend more on coffee than IT security, you will be hacked”
According to research by IBM, it takes 280 days to find and contain the average cyberattack, while the average attack costs $3.86 million. But what are network attacks, and how can we leverage a next-gen search tool like Jina to mitigate our exposure to the threat?
Network attacks are a broad category of cybersecurity threats in which a malicious actor attempts to disrupt, steal, or corrupt an organization’s data by gaining unauthorized access to its systems. The proverbial “needle in a haystack”, network attacks are an inherently difficult problem because they require finding rare events in extremely large datasets.
When a dataset contains 100s-1000s of dimensions, it can pose tricky challenges (e.g., curse of dimensionality). Similarity search is an approach to understanding high-dimensional data that works by finding objects in a collection that are similar based on some definition of sameness. You can think of it as a k-Nearest Neighbor (k-NN) problem where the similarity of objects is measured by distance (source).
In this series of blogs, we will build a Jina application that leverages similarity search to classify network traffic flow as either benign or malicious. Our goal will be to develop a reliable, scalable, and speedy intrusion detection system that predicts if an attack happens in real-time.
To pull this off, we will perform “network surgery” on a pre-trained neural network, removing the classification layer, and instead repurposing the network as a feature extractor. In other words, our network will output features, as opposed to labels.
Then, we will take the 128-D embeddings generated by our feature extractor and make them searchable by indexing them using a Jina Flow. By indexing thousands of these 128-D vectors along with their labels (benign/malicious), we can capitalize on the powerful relationship between distance and similarity that vector space facilitates.
It will allow us to take unseen network traffic data from a different day, extract its features, and determine whether it is benign or malicious by finding its nearest neighbor and assigning it a class depending on the class of its nearest neighbor.
To recap, we are going to make a slight tweak to a pre-trained neural network and turn a classification problem into a similarity search problem so that we can simulate detecting malicious network traffic in real-time.
Here are the steps involved:
- Generate vector representations of our network traffic by using our network as a feature extractor
- Find a similarity measure that makes representations of similar things close together
- Find the nearest neighbors of search queries and return the things that they represent (benign/malicious) to identify malicious traffic
This project won’t build itself! Let's get started already and check out our dataset.
In data science, it is often the case that collecting and preprocessing your data is the most difficult and time-consuming step in building your application. This is particularly true in the cybersecurity domain, where datasets are notoriously difficult to find.
Since we can’t cover everything in one article, let’s just imagine we stubbed our toe on this super clean CSV dataset. Its got seventy-nine columns worth of numerical features (e.g., port, protocol, fwd Packets, etc), and one label (0 meaning benign, and 1 meaning attack), describing 15,000 rows worth of network traffic on a particular day.
Imagine you used Keras to train a simple sequential feed-forward neural network on this super clean dataset to detect the malicious events, but were disappointed when the metrics you used to evaluate the model, precision and recall, gave you some less than desirable results (0.77 precision–0.52 recall).
Improving Results — Jina to the Rescue!
Determined to improve our results, we are now going to make a slight tweak to our network and turn our classification problem into a similarity search problem by performing “network surgery” and repurposing our model as a feature extractor.
We want to see if we can get better results by indexing our encoded features into a DocumentArray and classifying network traffic as benign/malicious depending on its nearest neighbor in vector space. Let’s now perform some “network surgery”.
As you can see, instead of taking our 79-D input features and running them through our network, outputting a label of benign or attack, our network will now take these 79-D features and output a 128-D feature vector containing a rich, mathematical representation of each row of network traffic data.
Armed with an understanding of how embeddings are generated, it’s time to transition into the indexing phase. Indexing is where we store our embeddings and their associated labels (benign/malicious) in a way that makes them searchable.
Now that we have generated rich, mathematical representations of our network traffic data in the form of vector embeddings by passing it through our feature extractor, it’s time to index our vectors so that we can search them for similarity (cosine). The idea is that you can infer the class of an unknown data point based on the data points that immediately surround it.
To do this, we will need to define our index flow 👇
In a previous post, I characterized Flows as the “grant puppet masters” of the Jina ecosystem. Flow defines how the Executors are connected and how the data flows through them. Everything going into or out of a Flow has to be a Document.
A Flow assumes that every Executor needs the previously added Executor by default. But because Flows are modeled internally as graphs, Flow is not restricted to sequential execution.
Flows can represent any complex, non-cyclic topology to index or query Documents. A typical use case for such a Flow is a topology with a standard preprocessing part but different indexers separating embeddings and data. It can also be used to build switch-like nodes, where some Documents pass through one parallel branch of the Flow, while other Documents pass through a different branch.
Our application will also feature a complex Flow topology to index our network traffic embeddings, albeit with a twist.
As you can see in the figure below, rather than visiting each Executor defined in our Flow “in-order”, our Documents will originate from an Executor called ITPrepper. The Documents will then be sent to the DocArrayIndexer and WeaviateIndexer in parallel, with our Flow ensuring that each Document originating from ITPrepper goes to each indexer only once. The last Executor, DummyExecutor, will receive both DocumentArrays and merges them automatically.
While it may not be immediately apparent why you’d want to use two different indexers like this, one could imagine a scenario in which different indexers utilizing different search algorithms could each be queried independently. By getting the “vote” of all the indexers, we can create an “ensemble” of predictions at query time.
Now let’s shift our focus away from Flow topology and onto the performance of our model when we use it as a feature extractor. We need to determine whether we get any improvement in precision/recall using similarity search as opposed to the classification output of the original neural network.
Results with Similarity Search
With the indexing process complete and our indexed Documents begging to be searched, we are ready to determine whether we get better precision/recall results using our network as a feature extractor as opposed to the original output labels of the neural network.
If you recall from earlier in the indexing section, we now have two separate indexes (DocArrayIndexer/WeaviateIndexer) containing our network traffic embeddings and their associated labels in DocumentArrays.
We will now load each index separately, match it to itself to get the nearest neighbors, and loop through each of the Documents appending their known and predicted labels to pre-initialized python lists. We will use cosine distance as our definition of sameness and calculate our accuracy, precision, recall, and F1 scores with Scikit-learn.
In other words, we will classify each Document by comparing its embedding with every other embedding in the index, taking the one with the smallest distance and classifying it as benign or malicious depending on the “known_label” of its nearest neighbor.
Below are our evaluation metrics when we use similarity search as a basis for classification rather than the actual neural network output. As you can see, we significantly improved our precision and recall scores. We get the same results for both indexers, meaning they “agree” for all of our data points.
In the future, it would be interesting to use this model as a feature extractor for broader classes of malicious traffic that have not been seen during training. This is called transfer learning, where we use a pre-trained neural network as a starting point to learn patterns from data it was not originally trained on.
You can find the source code here 👇
GitHub - k-zehnder/cybersecurity-jina: 🔎 Network Intrusion Dahsboard built with Jina and Streamlit
In this article, we used Jina to demonstrate how similarity search can be used to solve a common business challenge like network security. We accomplished this by developing a feature extractor from a pre-trained neural network that allowed us to predict whether network traffic is malicious based on its similarity with other known training samples, rather than the output label of the network.
In a future article, we will explore how we can “wrap” our Jina search pipeline in a Streamlit front-end, making it easy for us to simulate network traffic monitoring in real-time through the browser!