Hi all, I’ve been building a machine learning course focused on security called Security Kiwi (https://security.kiwi) for a few months and it’s almost ready. The course material is based on my learning and tinkering during a BSc and MSc. However, you don’t need a degree to understand ML. The course is designed for anyone who is interested - you don’t need to be “good at math” or an excellent programmer to learn. The main language is Python, if you’re a bit rusty or new there will be a short Python refresher.
I’ve bought interesting insights from the world of academic research which is usually behind journal paywalls. Eventually, once we have learned the foundations of ML we’ll discuss research and create interesting projects, some based on research. Active projects ideas include: email malware collection systems, facial recognition for OSINT and cyber attack prediction. Potential areas of discussion & projects include: threat detection (insider, external), malware detection, threat prediction, threat projection (e.g. determining what stage of an attack and threat actor is likely currently at), breach detection and more.
The first part of the course launches on the 7th of January, and new content will be added regularly. I don’t like the idea of putting a hefty financial barrier in front of knowledge, so the majority of the content is freely accessible, with video walkthroughs, downloadable workbooks, Q&A and other materials designed to make learning easier and more successful part of Patreon support tiers.
In the meantime, I’ll walk you through a dataset designed for testing machine learning-based intrusion detection systems which we use in the course. We’ll build a few useful tools and gain an understanding of simple dataset manipulation needed when we are first shopping around for a dataset.
Why do we need a dataset?
Machine learning isn’t only focused on algorithms and state-of-the-art techniques like convolutional neural networks. We need an understanding of datasets and data science techniques. In our fictional scenario here, we are planning to create an intrusion detection system. We want to scan live network data and decide if the traffic is potentially malicious.
Gaining a Simple Overview of a Dataset
Once we have our idea (creating an intrusion detection system), we want to look for datasets which suit our needs. First, we want to stop and consider our requirements, so we can understand what we need to actually do with our dataset. That process is a little dry for the first introduction to machine learning for some, so you can read about that on Security Kiwi on January 7th. We’ll focus on practical skills using the Python libraries Pandas and matplotlib to gain an overview of what’s inside a dataset.
You can follow along with the code in a Google live code environment (Google Colab).
We’ll be looking at the first 5,000 rows of the URG’16 dataset created by Maciá-Fernández et al. We can’t look at the whole thing, as it’s 52GB of data. UGR16 contains real and synthetic netflow v9 data captured from sensors within a tier-3 ISP over four months. Typically, datasets only cover days or weeks. The ISP is a cloud service provider, providing virtualized services such as WordPress, Joomla, email, FTP etc. Victim machines were colocated alongside real clients and attacker machines placed outside the network. Synthetic attacks are generated at fixed and random times, allowing anomaly detectors to be assessed. Real botnet traffic captured from the malware Neris was inserted into the network data during capture.
Get the Data
Below we download a file containing the first 5,000 rows of the URG’16 dataset I created. In the course we will setup and use Jupyter Notebooks, however, here we work solely with Google Colab.
import requests DOWNLOAD_REPO = "https://raw.githubusercontent.com/krisbolton/machine-learning-for-security/master/" DOWNLOAD_FILENAME = DOWNLOAD_REPO + "ugr16-july-week5-first5k.csv" DATASET_FILENAME = "ugr16-july-week5-first5k.csv" response = requests.get(DOWNLOAD_FILENAME) response.raise_for_status() with open(DATASET_FILENAME, "wb") as f: f.write(response.content) print("Download complete.")
DOWNLOAD_REPO is the URL a repository containing datasets,
DOWNLOAD_FILENAME is the name of the file we want to download contained in that repository, these are combined in line 2.
DATASET_FILENAME allows you to get the filename when it is created locally. We then use the
requests library to fetch the dataset, check for errors (
.raise_for_status() ), create a file object using
open() , create a file writer
write() using the content of the request, and finally print a message so we know when it’s done.
Explore the Data
Initially, we want to know what type of data is contained within the dataset, how many rows we have and other simple data points such as this. Below we use the Python library Pandas to read the CSV file, convert it into a pandas dataframe and print information about the dataframe.
If you’re unfamiliar a dataframe is a data structure with rows and columns, similar to a table, which makes working with data much easier.
import pandas as pd df = pd.read_csv("ugr16-july-week5-first5k.csv") df.info()
import pandas as the variable
pd (a convention), read the contents of the CSV file into the variable
df (stands for dataframe (another convention)) and we use the
info() method on df which provides a basic summary of the dataframe.
The output of
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4999 entries, 0 to 4998 Data columns (total 13 columns): 2016-07-27 13:43:21 4999 non-null object 48.380 4999 non-null float64 220.127.116.11 4999 non-null object 18.104.22.168 4999 non-null object 53 4999 non-null int64 53.1 4999 non-null int64 UDP 4999 non-null object .A.... 4999 non-null object 0 4999 non-null int64 0.1 4999 non-null int64 2 4999 non-null int64 209 4999 non-null int64 background 4999 non-null object dtypes: float64(1), int64(6), object(6) memory usage: 507.8+ KB
info() shows us information about each column in the dataset as rows in this output. General information is provided, number of entries (remember most data structures count from 0, 0 to 4,999 means there are 5,000 entries), 13 columns, memory usage and information about those 13 columns. The first column in
info() is the heading of the dataset columns (in this case the dataset creators didn’t use headings), the second is the number of instances of a record, third the type of entry (in this can it cannot be null) and then the data type.
Visual Overview of Numerical Data
Below we visualise the different numerical data within the dataset using matplotlib and its histogram feature. This can be useful to gain an understanding at-a-glance of the distribution or shape of numerical data within a dataset. Here we don’t gain much insight.
import matplotlib.pyplot as plt df.hist(bins=50, figsize=(30,15)) plt.show()
View the Raw Data
Lets actually view some of the data using the Pandas dataframe we made earlier.
head() function prints the first n rows from a dataset, the default is 5, however, you can pass values within the parentheses (e.g.
head(50) for the first 50). The
tail() function shows entires from the end of a dataset. Viewing snippets like this allows us to see the actual values within our dataset without viewing the whole thing - with datasets in the order of gigabytes, opening such large files can be a task in itself.
You’ll notice the column headings aren’t particularly helpful, especially without reading the paper which launched the dataset and finding the methodology for how the researchers captured this data. The researchers used NFDUMP to create netflow v9 data capture, so we can look up that format and infer each column data. Then we can use a pandas dataframe function to add the correct headings to each column. This is only for us to understand what we’re looking at, we wouldn’t do this for our algorithm. Machine learning algorithms only work well with numerical data. We discuss this, and how to transform data into various types of numerical data in sections coming January 7th.
Below we add column headings by assigning a List to our dataframe (
df.columns = ['Date time', 'Duration', 'Source IP', 'Destination IP', 'Source Port', 'Destination IP', 'Protocol', 'Flag', 'Forwarding status', 'ToS', 'Packets', 'Bytes', 'Label'] df.head()
Much better, we can understand what we’re looking at now.
Understanding the Data
So what have we just looked at and how can it be used? This dataset contains labelled data, you may have noticed the Label column stating whether the traffic is background traffic or attack traffic. Labelled data has a known state, which you can use to train machine learning algorithms the different between normal background traffic and malicious traffic. Patterns in the timing or size of payloads, for example, may be determined to be indications of an attack by a machine learning algorithm. This is referred to as Supervised Machine Learning, where the algorithm is supervised by showing it how something should be with labels. There are other types of machine learning, including Unsupervised which works on unstructured and unlabelled data to derive insight.
We will go through these in more detail and in order to ease you into machine learning and on to intermediate and advanced topics, as well as interesting and useful projects.
I hope this little tutorial sparked an interest in machine learning and how it can be used for cyber security purposes. Check out the references below for more information on some of the things we discussed. You can use the contact form on https://security.kiwi to get in touch.
- Maciá-Fernández, G., Camacho, J., Magán-Carrión, R., García-Teodoro, P., and Therón, R. (2018) UGR‘16: A new dataset for the evaluation of cyclostationarity-based network IDSs. Elsevier. https://www.sciencedirect.com/science/article/pii/S0167404817302353
- University of Granada (2016) UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. https://nesg.ugr.es/nesg-ugr16/
- Pandas (2020a). Pandas - Python Data Analysis Library. https://pandas.pydata.org
- matplotlib (2020). Matplotlib: Python plotting https://matplotlib.org
- Pandas (2020b). pandas.DataFrame. Pandas Documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
- NFDUMP (2014) NFDUMP Overview. http://nfdump.sourceforge.net
- McKinney, W. (2017) Python for Data Analysis . O’Reilly Media. https://www.amazon.com/Python-Data-Analysis-Wes-Mckinney/dp/1491957662