MTR Monitor
In December 2017, Hetzner, our hosting provider for the Build Platform, had a major network incident that lasted for almost a whole week. Our users were rightly frustrated.
You can find more information about the incident in our public Post Mortem.
To prevent and monitor these situation in the future, we have set up a transatlantic monitoring system based on MTR reports and Curl-ing important vendors for our platform such are GitHub and DockerHub. This system should report any issues in the network between Germany(Hetzner) and US(GitHub, DockerHub).
This project is part of the effort to have a readily available MTR reports before, during and after incidents, that we can send to Hetzner.
The project consists of two parts. A MTR monitor that continiously tests the
quality of the network by running mtr
from both sides of the Atlantic, and
CURL monitor that continiously tries to eastablish a HTTPS connection to the
other side of the Atlantic.
MTR reports are generated every 5 minutes and uploaded to an S3 bucket. Results of CURL tests are displayed on the Platform — Network Grafana dashboard and are connected to PagerDuty based alerts.
Currently, we have the following routes covered:
- Germany(Hetzner) -> AWS US East 1 (part of Job Runner)
- Germany(Hetzner) -> AWS US West 1 (part of Job Runner)
- Germany(Hetzner) -> AWS US West 2 (part of Job Runner)
- AWS US East 1 -> Builder sb1 in Hetzner (standalone AWS instance with Docker container)
- AWS US West 1 -> Builder sb1 in Hetzner (standalone AWS instance with Docker container)
- AWS US West 2 -> Builder sb1 in Hetzner (standalone AWS instance with Docker container)
The tests from Germany are executed from every Builder machine, where this project is injected as a gem.
The DNS records of the US based MTR monitors are the following:
mtr-monitor.us-east-1.semaphoreci.com
mtr-monitor.us-west-1.semaphoreci.com
mtr-monitor.us-west-2.semaphoreci.com
These records point to the Load Balancer. If you want to SSH into the machines, use the following commands:
ssh [email protected]
ssh [email protected]
ssh [email protected]
To create a new MTR monitor follow this guide.
Location of the generated MTR reports
The MTR monitor generate and stores MTR reports both on the local machine, and uploads them to S3.
Local reports on the machine are located in the /var/log/mtr
directory, and
the following structure:
/var/log/mtr/<name>__<YYYY-DD-MM-HH-MM>__<host-ip-address>_to_<target-ip-address>.txt
For example, if you call your report hetzner-to-us-east-1
and run it at
2017-12-18 12:33:06
, the log will be generated in:
/var/log/mtr/hetzner-to-us-east-1__2017-12-18-12-33__142-21-43-11_to_138-21-32-191.txt
On S3, the path will follow the same convention, but will use a nested directory structure:
s3://<bucket-name>/<name>/<YYYY-DD-MM-HH-MM>/<host-ip-address>_to_<target-ip-address>.txt
s3://<bucket-name>/hetzner-to-us-east-1/2017-12-18-12-33/142-21-43-11_to_138-21-32-191.txt
Report Name
The name of the report is used to group reports with the same purpose on S3 and on the local file system.
We use the following naming convention:
<from>-to-<destination>
Examples:
hetzner-to-github
us-east-1-to-hetzner-sb1
hetzner-to-us-west-2
Using MTR Monitor as a gem
The MTR monitor can be used as a gem and injected into existing Ruby applications. Currently, we inject the MTR monitor into Job Runner.
First, add the mtr_monitor
gem to your Gemfile:
gem 'mtr_monitor'
Secondly, use the report class to generate a report:
name = "google"
domain = "google.com"
s3_bucket = "my-private-bucket-name" # change this
aws_access_key_id = "<KEY>"
aws_secret_access_key = "<KEY>"
= {
:name => name,
:mtr_target => mtr_target,
:s3_bucket => s3_bucket,
:mtr_options => ,
:aws_access_key_id => aws_access_key_id,
:aws_secret_access_key => aws_secret_access_key,
:dig_ip_address => dig_ip_address,
:logdna_ingestion_key => logdna_ingestion_key,
:logger => logger
}
MtrMonitor::Report.new().generate
This above snippet will :
- generate an MTR report on your local system under the
/var/log/mtr
directory - upload the report to the provided S3 bucket
- submit metrics via Watchman and generate a metric "pulse"
If you want to generate reports continuously, create a CRON task that will call the above code. To monitor if the CRON task is running as expected, you should set up an alert on Grafana based on the "pulse" metric.
The pulse metric has the format network.mtr.pulse
and is tagged with the
hostname of the server where the MTR monitor is running and with the name of the
metric.
MTR hops are also submitted to Grafana. Based on these metrics you can observe
the packet loss, avg, best, and worst latency on the network. For more
information read the code in lib/mtr_monitor/metrics.rb
.
Bump gem version
- Change version in
lib/mtr_monitor/version.rb
- Run
bundle
- Push gem to RubyGems manually or let Semaphore do it for you automatically
Update MTR monitor in Job Runner
MTR monitor is run in the mtr_report cron task within Job Runner.
To update the version used in Job Runner:
- Run
bundle update mtr_monitor
. - Make sure the code using gem corresponds to the new version.
- Try to run the task locally.
- Deploy to staging sec1 and check if it works properly.
- Finally, deploy to production build servers.
Using MTR Monitor as a standalone Docker container
The MTR monitor can be used as a standalone Docker container. This is our current approach for monitors that are hitting Germany from the United States.
By default, the containers running on us-east-1, us-west-1, and us-west-2 are automatically deployed on every merge into master in for this repository.
The container on the ec2 machines will trigger a MTR report generation every 5 minutes. Every time a Report is generated the following is executed:
- a new MTR report is generate on your local system under the
/var/log/mtr
directory - the report is uploaded to the provided S3 bucket
- metrics are submitted via Watchman and a pulse is generated
- the MTR cleaner is uninitiated that cleans all reports from the local system that are older then 2 weeks
To monitor if the CRON task is running as expected, you should set up an alert on Grafana based on the "pulse" metric.
The pulse metric has the format network.mtr.pulse
and is tagged with the
hostname of the server where the MTR monitor is running and with the name of the
metric.
MTR hops are also submitted to Grafana. Based on these metrics you can observe
the packet loss, avg, best, and worst latency on the network. For more
information read the code in lib/mtr_monitor/metrics.rb
.
This is deployed as a docker-compose group of docker images. One docker images generates the MTR reports, while the other onw exposes an nginx server that responsd yes to incomming requests.