Honours Thesis · Dalhousie University · 2018

Mirai Botnet
Traffic Analysis

An air-gapped virtual lab. Six DDoS attack types captured from live malware. Decision trees that classified every flow with near-perfect accuracy — while the industry-standard IDS missed them all.

MiraiDDoSMachine LearningIoT SecurityDecision TreesPythonWEKAWireshark
April 2018

Overview

The Problem with IoT Botnets

Mirai is malware that infects IoT devices — surveillance cameras, home routers, DVRs, baby monitors — and conscripts them into a botnet used to launch Distributed Denial of Service attacks. In 2016 it set records that still stand: a 620 Gbps attack against KrebsOnSecurity, a 1+ Tbps attack against OVH, and a Dyn DNS takedown that pulled Netflix, Twitter, and Amazon offline simultaneously. Over 100,000 devices participated.

The scale was enabled by one thing: insecure IoT devices shipped with factory-default credentials, no update mechanism, and no way to recall them. Mirai spreads by scanning the internet for Telnet access and trying 63 default username/password combinations. Most of the time, it gets in.

Research Goal

No Public Dataset. Build One.

At the time of writing, no public labelled dataset of Mirai attack traffic existed. Existing work had either used honeypots (hard to label confidently) or performed high-level static analysis without measuring detection effectiveness.

The goal of this thesis: build a controlled environment, generate genuine Mirai traffic across all attack types, test it against current IDS solutions, and then apply machine learning to see if traffic could be accurately classified by attack type — and distinguished from normal traffic and other botnet activity.

Methodology

Air-Gapped Virtual Lab

Mirai's source code was leaked in 2016. Using it to generate traffic without infecting real devices required a fully isolated environment. The lab was built in VirtualBox with all machines on an internal-only network, and the host machine was taken completely offline before any malware was executed.

The network consisted of a router (DNS + DHCP), a C&C server running Mirai's command and control infrastructure, and two client VMs acting as IoT devices. One was manually infected; the other was the scan target. Mirai's DNS-based C2 addressing was preserved — the C&C hostname was encrypted with Mirai's own enc utility and hard-coded into the bot binary.

🖥
RouterDNS + DHCP
─────
C&C ServerMySQL + Go
─────
🤖
Client-2Infected bot
─────
💻
Client-CleanScan target

intnet-1 · VirtualBox internal network · air-gapped

Data Collection

Six Attack Types, ~1M Flows

Attacks were launched from the C&C Telnet interface and captured on the infected host using Dumpcap. Raw pcap files were then processed through two flow exporters — Argus and Tranalyzer — producing two feature-set variants of the same traffic. Four attack types failed to generate traffic (UDP Plain, HTTP, DNS, Stomp) due to network configuration or debug mode limitations. Six produced substantial captures.

Attack TypeCapture SizeFlows (Argus)
ACK Flood111 MB193,785
SYN Flood20 MB193,039
UDP Flood96 MB170,406
GRE IP116 MB198,393
GRE ETH136 MB227,524
VSE23 MB63,663

Finding #1

Snort Missed Everything

Two Snort rulesets were tested against the captured traffic. The snort3-community-rules — the free, publicly available community ruleset — raised zero alerts for every single attack type.

A custom rule from prior academic work, designed specifically to catch Mirai's SYN flood, raised 204,388 alerts for that attack — but nothing for the other five.

This confirmed the hypothesis: signature-based IDS at the time had no coverage for Mirai traffic. The need for a data-driven approach was clear.

Finding #2

Decision Trees Nailed It

Two decision tree algorithms were evaluated: C4.5 (WEKA's J-48) and CART (scikit-learn). Both were trained on flows exported by Argus and Tranalyzer. The analysis ran in two phases.

Phase 1 — classify which of the 6 attack types a given flow belongs to. Both algorithms achieved F-scores of 1.0 across all classes using 10-fold cross-validation and a 66/33 train/test split. The resulting trees were remarkably small: 11 nodes for Argus, 19 for Tranalyzer.

Phase 2 — mixed dataset. The 6-attack Argus data was merged with CTU-13 Scenario 5, which contains labelled normal traffic and other (non-Mirai) botnet traffic. Two partitioning strategies were used: balanced (800 samples per class) and unbalanced (straight 66/33 split).

Both models performed almost perfectly. The unbalanced model achieved F-scores of 0.999 for Normal and 0.995 for Botnet classes, with 1.0 across all 6 Mirai attack types. The trees stayed small — 15 and 19 nodes respectively — making them directly portable to human-readable IDS rules.

Key Insight

The decision trees that distinguished six attack types with near-perfect accuracy had only 11–19 nodes. Small enough to hand-write as Snort rules. The same system that failed completely with signatures could likely be retrofitted with ML-derived rules.

Future Work

What's Left

  • Mirai's scanning and infection behavior wasn't captured due to a Telnet compatibility issue with the virtual network — a separate research thread worth pursuing.
  • Only 6 of 10 attack types generated traffic. The other four (HTTP, DNS, STOMP, UDP Plain) need further investigation.
  • The ML-derived trees could be automatically converted to Snort rules and tested against the full CTU-13 dataset for false-positive rates.
  • Mirai has dozens of known variants. Rules robust enough to generalize across variants — rather than just the original source code — remain an open problem.

Interactive

Network Propagation Visualization

An animated diagram of Mirai's infection chain and C2 architecture — coming soon.

Interested in working together?

Get in touch