Honours Thesis · Dalhousie University · 2018
Mirai Botnet
Traffic Analysis
An air-gapped virtual lab. Six DDoS attack types captured from live malware. Decision trees that classified every flow with near-perfect accuracy — while the industry-standard IDS missed them all.
Overview
The Problem with IoT Botnets
Mirai is malware that infects IoT devices — surveillance cameras, home routers, DVRs, baby monitors — and conscripts them into a botnet used to launch Distributed Denial of Service attacks. In 2016 it set records that still stand: a 620 Gbps attack against KrebsOnSecurity, a 1+ Tbps attack against OVH, and a Dyn DNS takedown that pulled Netflix, Twitter, and Amazon offline simultaneously. Over 100,000 devices participated.
The scale was enabled by one thing: insecure IoT devices shipped with factory-default credentials, no update mechanism, and no way to recall them. Mirai spreads by scanning the internet for Telnet access and trying 63 default username/password combinations. Most of the time, it gets in.
Research Goal
No Public Dataset. Build One.
At the time of writing, no public labelled dataset of Mirai attack traffic existed. Existing work had either used honeypots (hard to label confidently) or performed high-level static analysis without measuring detection effectiveness.
The goal of this thesis: build a controlled environment, generate genuine Mirai traffic across all attack types, test it against current IDS solutions, and then apply machine learning to see if traffic could be accurately classified by attack type — and distinguished from normal traffic and other botnet activity.
Methodology
Air-Gapped Virtual Lab
Mirai's source code was leaked in 2016. Using it to generate traffic without infecting real devices required a fully isolated environment. The lab was built in VirtualBox with all machines on an internal-only network, and the host machine was taken completely offline before any malware was executed.
The network consisted of a router (DNS + DHCP), a C&C server running Mirai's command and control infrastructure, and two client VMs acting as IoT devices. One was manually infected; the other was the scan target. Mirai's DNS-based C2 addressing was preserved — the C&C hostname was encrypted with Mirai's own enc utility and hard-coded into the bot binary.
intnet-1 · VirtualBox internal network · air-gapped
Data Collection
Six Attack Types, ~1M Flows
Attacks were launched from the C&C Telnet interface and captured on the infected host using Dumpcap. Raw pcap files were then processed through two flow exporters — Argus and Tranalyzer — producing two feature-set variants of the same traffic. Four attack types failed to generate traffic (UDP Plain, HTTP, DNS, Stomp) due to network configuration or debug mode limitations. Six produced substantial captures.
| Attack Type | Capture Size | Flows (Argus) |
|---|---|---|
| ACK Flood | 111 MB | 193,785 |
| SYN Flood | 20 MB | 193,039 |
| UDP Flood | 96 MB | 170,406 |
| GRE IP | 116 MB | 198,393 |
| GRE ETH | 136 MB | 227,524 |
| VSE | 23 MB | 63,663 |
Finding #1
Snort Missed Everything
Two Snort rulesets were tested against the captured traffic. The snort3-community-rules — the free, publicly available community ruleset — raised zero alerts for every single attack type.
A custom rule from prior academic work, designed specifically to catch Mirai's SYN flood, raised 204,388 alerts for that attack — but nothing for the other five.
This confirmed the hypothesis: signature-based IDS at the time had no coverage for Mirai traffic. The need for a data-driven approach was clear.
Finding #2
Decision Trees Nailed It
Two decision tree algorithms were evaluated: C4.5 (WEKA's J-48) and CART (scikit-learn). Both were trained on flows exported by Argus and Tranalyzer. The analysis ran in two phases.
Phase 1 — classify which of the 6 attack types a given flow belongs to. Both algorithms achieved F-scores of 1.0 across all classes using 10-fold cross-validation and a 66/33 train/test split. The resulting trees were remarkably small: 11 nodes for Argus, 19 for Tranalyzer.
Phase 2 — mixed dataset. The 6-attack Argus data was merged with CTU-13 Scenario 5, which contains labelled normal traffic and other (non-Mirai) botnet traffic. Two partitioning strategies were used: balanced (800 samples per class) and unbalanced (straight 66/33 split).
Both models performed almost perfectly. The unbalanced model achieved F-scores of 0.999 for Normal and 0.995 for Botnet classes, with 1.0 across all 6 Mirai attack types. The trees stayed small — 15 and 19 nodes respectively — making them directly portable to human-readable IDS rules.
Key Insight
The decision trees that distinguished six attack types with near-perfect accuracy had only 11–19 nodes. Small enough to hand-write as Snort rules. The same system that failed completely with signatures could likely be retrofitted with ML-derived rules.
Future Work
What's Left
- →Mirai's scanning and infection behavior wasn't captured due to a Telnet compatibility issue with the virtual network — a separate research thread worth pursuing.
- →Only 6 of 10 attack types generated traffic. The other four (HTTP, DNS, STOMP, UDP Plain) need further investigation.
- →The ML-derived trees could be automatically converted to Snort rules and tested against the full CTU-13 dataset for false-positive rates.
- →Mirai has dozens of known variants. Rules robust enough to generalize across variants — rather than just the original source code — remain an open problem.
Interactive
Network Propagation Visualization
An animated diagram of Mirai's infection chain and C2 architecture — coming soon.
Interested in working together?
Get in touch