Xanthus: Automated Reproducible Data Generation for Evaluating Intrusion Detection Systems
I add simple Windows support. As I finished this work when studying in THU, I name this version
thanthus
.
Fairly evaluating and comparing the efficacy of different intrusion detection systems (IDS) requires that experimental data be generated in a similar mechanism and/or shared across these systems. The reality, unfortunately, is that there exist few public repositories (e.g., DARPA 1998/1999/2000, KDD Cup99, DARPA TC Engagement 3) containing experimental data captured solely for the purpose of security analysis. Among those public data repositories, most are outdated because a tremendous amount of manual labor is almost always necessary to capture the data (e.g., DARPA TC program involves a number of teams from across the academia and the industry and it spans over many a year). Consequently, some newly-developed systems, in order to be able to compare against older systems, are evaluated using the data that is a decade or two older than the systems themselves (and usually and unsurprisingly exhibit good results). Given that there is a perpetual arms race between the defenders and the offenders in the realm of cyber security and that new cyber-threats are manufactured every day, a successful defence against a decade-old exploit is hardly an achievement.
Many existing systems, acknowledging this fact and ready to showcase their detection capability, design their own experiments and produce their own dataset as a result. Although the experiments are sometimes carefully described in their associated publications (e.g., in academic projects), such dataset suffers from the following drawbacks:
In the cases where the dataset is made public, later systems can but consume only a subset of the dataset for analysis. Therefore, if they require e.g., additional features from the dataset in the analysis, they must rerun the experiments to capture the data themselves again, instead of simply re-using the available dataset. Moreover, some systems publish only pre-processed dataset, which usually eliminates information from the original, raw dataset that is not relevant to their analysis, even though such information may be relevant for other systems.
When raw dataset is made public, it provides later systems with richer information content. However, the underlying systems that capture the raw dataset (e.g., audit systems) are also constantly evolving, generating finer-grained, more accurate information or offering a completely different perspective through which one understands system behavior (e.g., provenance systems). Security systems that take advantage of such advancement in the underlying systems may very well find even the raw data provided by previous systems insufficient.
If later systems must resort to reproducing dataset themselves as a result of the reasons listed above, they need to rely on descriptions provided by previous systems to ensure high-fidelity experiment replay. Even if we assume that previous systems provide sufficiently detailed descriptions to understand the experiment (which certainly is not always the case), there still exist a number of challenges.
- The experiment must be conducted using the exact software involved with matching versions. In many cases, security experts have since identified and patched vulnerabilities in the exploitable software used in security-related experiments, and thus the software itself usually has been updated to a newer version. Downgrading the target software and its dependencies is therefore necessary to reproduce the experiment. This sometimes cannot be automatically configured through existing package management systems and requires significant manual configuration.
- Some vulnerability may affect only a particular version of the operating system. This requirement no doubt further complicates the experimental setup and demands additional engineering effort.
- Other controllable factors may be omitted in the description that may or may not affect the final results of the experiment. For example, background activities may have been included in the dataset but was not discussed in detail.
Before we go into any detail about using Xanthus for automated, reproducible data generation for security analysis, we describe a pipeline in which we create dataset for a specific attack in a push-button fashion. Xanthus is a higher-level abstracted framework that generates such a pipeline for any attack that existing or future IDS intend to evaluate.
Primer to Xanthus: A Specific Pipeline
We introduce a specific pipeline that automates data capture for a particular attack. In this pipeline, we deploy virtual machines (VM), set up a virtual environment that recreates the attack scenario, and run the attack, while capturing data from a whole-system provenance capture system. Code is publicly available online at GitHub. Please refer to the code while finishing off the rest of this section.
Prerequisites
We assume that you understand the following terms and concepts. If not, click on the item that you do not understand to read more about it:
You may want to understand the following terms and concepts if you want to fully understand the attack that we will describe in the next section:
A Brief Attack Description
You could better understand the pipeline with the knowledge of the attack that we would like to reproduce automatically.
The attacker aims to invade a victim machine through a vulnerable (or exploitable) wget
.
The attacker sets up a malicious (or compromised) HTTP
server that redirects any requests to a malicious FTP
server
that contains a Debian
package with a Trojan backdoor.
The package appears to be the same as its legitimate version and may even work the same way,
but the moment the package is installed on the victim machine, it will initiate a reverse TCP connection to the attacker
who is listening for connections and create a reverse shell that allows the attacker to infiltrate into the victim machine.
When the victim machine attempts to download the benign package from the HTTP
server using wget
,
wget
allows arbitrary remote file upload to the host system.
Meaning that, instead of fetching the intended benign package, it allows redirection of the HTTP
server and downloads
the malicious one.
The user is unaware of such behavior and install the package through the package manager dpkg
.
The installed Trojan software establishes a connection to the attacker and the attack succeeds.
Software Involved
wget
v1.17 or older- Any
Debian
package with a Trojan backdoor. TheDebian
package must be installable (both benign and malicious version). - Functioning
HTTP
andFTP
server dpkg
package managerCamFlow
whole-system provenance capture system
Execution Platform
As expected, Debian
package can only run on any Debian
-based operating systems. This particular pipeline is run on
Ubuntu 18.04
(both the client and the server).
The Pipeline
Installation
To run this pipeline, you need to install at least the following items:
Vagrant
- Oracle
VirtualBox
Usage
If you git clone
the entire repository from GitHub, cd
into wget
directory.
We assume this directory would be your working directory.
We write a Makefile
to run our attack scenario for many times. If you want to run it once only,
modify this line: [ $${cnt} -lt 25 ]
to [ $${cnt} -lt 1 ]
in the Makefile
.
(In Xanthus
, we would be able to configure this easily without actually modifying the code.)
If you are running on Mac
:
make test_mac
On Linux
, you would run:
make test_linux
We do not support Windows
operating system for now.
You would locate the output data file in data/
directory.
Behind the Scenes
This pipeline seems to be very user-friendly. So, one might ask, why do we bother to design and implement Xanthus
?
The truth is, we have done a lot of heavy-lifting for you behind the scenes. Let's take a closer look.
The Makefile
you run starts the vagrant
process, which would boot up two virtual machines, one server
and one
client
(now, take a look into Vagrantfile
).
The server
machine is provisioned by provision/server.sh
script.
It configures an FTP
and an HTTP
server and puts the malicious Debian
package in the FTP
server.
Of course, the user must provide the pipeline with the package.
We build the package ourselves in Kali Linux
with TheFatRat. You are free to use any tools at your disposal.
We also put the benign one in the HTTP
server to trick the user to download it.
The client
machine involves more operations.
First, unlike the server
machine that simply uses a Ubuntu 18.04
base operating system
(as seen in server.vm.box = "bento/ubuntu-18.04"
),
the client
machine uses our customized VirtualBox
box called michaelh/ubuncam
.
This box is built with the following specifications:
- It is built upon the original
Ubuntu 18.04
base box fromVagrant
. - It is installed with
CamFlow
as its provenance-capture system. - It downgrades
wget
to its desired version (v1.17
) that contains the vulnerability. - It can install
Debian
packages in the experiment.
Note that it is always desirable to package such a box and upload it to the VagrantCloud
so that we can
configure once and reuse many times.
One can always use a base box and configure the above specifications on-the-fly,
but it is not guaranteed that the configuration would work in the distant future.
For example, the link to download an older version of wget
may expire without notice.
Xanthus
allows users to either provide a customized virtual box or configure a base box through provisioning.
If an online configuration is provided, Xanthus
would automatically generate a customized box for the user
to prevent future re-configuration or possible failure in future configuration.
The client
machine runs the script in provision/attack
.
The user must provide such a script.
In our case, we automatically generate attack scripts using wget-attack-script-gen.py
.
Xanthus
allows users to provide logic to generate scripts or simply provide scripts to run during the experiment.
Installation
Add this line to your application's Gemfile:
gem 'xanthus'
And then execute:
$ bundle
Or install it yourself as:
$ gem install xanthus
Usage
xanthus version | return Xanthus version number.
xanthus dependencies | installation instructions for system dependencies.
xanthus init <project name> | initialize a new project.
xanthus run | run .xanthus file in the current folder.
Development
To add more features in Xanthus
,
clone this repository
git clone https://github.com/tfjmp/xanthus
cd xanthus
and build the gem by running
gem build xanthus
To install this gem locally on your machine, you can also run
gem install xanthus
After you add a new feature (and test it yourself), you can release a new version of Xanthus
.
First, please update the version number in lib/xanthus/version.rb
, tag the repository git tag -a x.x.x -m 'x.x.x'
, and push the tag git push --tags
.
Then you can run
gem push xanthus-x.x.x.gem
This last step publishes the gem at https://rubygems.org/gems/xanthus.
Contribution
We welcome bug reports and pull requests on GitHub at https://github.com/[USERNAME]/xanthus.
License
This gem is available as an open source project under the MIT License.
Issues and Solutions with VirtualBox
VirtualBox Guest Additions is not as well designed as we may hope. If you encountered the following error:
Vagrant was unable to mount VirtualBox shared folders. This is usually
because the filesystem "vboxsf" is not available. This filesystem is
made available via the VirtualBox Guest Additions and kernel module.
Please verify that these guest additions are properly installed in the
guest. This is not a bug in Vagrant and is usually caused by a faulty
Vagrant box. For context, the command attempted was:
mount -t vboxsf -o uid=900,gid=900 vagrant /vagrant
The error output from the command was:
/sbin/mount.vboxsf: mounting failed with the error: No such device
It is most likely the fault of incompatible GA between the VM and the host. Even though the script might have stop, the VM is still booted. You can vagrant ssh
into the VM and manually input the following two commands:
sudo apt-get -y install dkms build-essential linux-headers-$(uname -r) virtualbox-guest-additions-iso
sudo /opt/VBoxGuestAdditions*/init/vboxadd setup
After this, you may encounter this error:
...
==> default: Machine booted and ready!
[default] GuestAdditions seems to be installed (6.0.20) correctly, but not running.
bash: line 4: setup: command not found
==> default: Checking for guest additions in VM...
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
setup
Stdout from the command:
Stderr from the command:
bash: line 4: setup: command not found
Please add the following into the Vagrant script:
if Vagrant.has_plugin?("vagrant-vbguest")
config.vbguest.auto_update = false
end