Grid'5000 tutorial
This tutorial aims at showing how Ruby-Cute can be used to help the scripting of an experiment in the context of the Grid'5000 testbed. The programming language used, as you would expect, is Ruby. We will use a powerful console debugger called Pry which offers several functionalities that can be used for the step-by-step scripting of complex experiments.
In this tutorial the first two sections are dedicated to the installation and basic use of Ruby-Cute. You can skip those sections if you are already acquainted with Ruby-Cute. The other sections show how to use Ruby-cute for scripting complex experiments. This is shown through three examples:
- Infiniband performance test: the experiment will illustrate how to perform a reservation,
the execution of commands in a reserved node and it explains several
pry
commands that will help you with the writing of your experiment. - Running NAS benchmarks in Grid’5000: you will get acquainted with parallel execution using simple SSH or TakTuk.
- Performing network measures within a reserved VLAN: you will learn how to reserve a routed VLAN and to query the G5K metrology API. For this particular experiment we will query the Kwapi service.
The aforementioned experiments are independent, you can perform them in any order. However, you may need some concepts that are explained only in specific sections.
Installing and preparing Ruby cute
The installation procedure is shown in Ruby-Cute install.
After this step you will normally have ruby-cute
and pry
gems installed.
Before using Ruby-Cute you have to create the following file:
$ cat > ~/.grid5000_api.yml << EOF
$ uri: https://api.grid5000.fr/
$ version: 3.0
$ EOF
You will also need 2 gems to do this tutorial:
- net-ssh
- net-scp
If you want to use taktuk you will need to have the executable on the machine where you are running the ruby script
Getting acquainted with the pry console
After instaling ruby-cute
and pry
gems you can lunch a pry console
with ruby-cute loaded by typing:
$ cute
Which will open a pry
console:
[1] pry(main)>
In this console, we can evaluate Ruby code, execute shell commands, consult documentation, explore classes and more. The variable $g5k is available which can be used to access the Grid'5000 API through the G5K Module. For example, let's request the name of the sites available in Grid'5000.
[2] pry(main)> $g5k.site_uids()
=> ["grenoble", "lille", "luxembourg", "lyon", "nancy", "nantes", "rennes", "sophia"]
We can consult the name of the clusters available in a specific site.
[3] pry(main)> $g5k.cluster_uids("grenoble")
=> ["edel", "genepi"]
It is possible to execute shell commands, however all commands have to be prefixed with a dot ".". For example we could generate a pair of SSH keys using:
[7] pry(main)> .ssh-keygen -b 4096 -N "" -t rsa -f ~/my_ssh_jobkey
Another advantage is the possibility of exploring the loaded Ruby modules. Let's explore the Cute module.
[8] pry(main)> cd Cute
[9] pry(Cute):1> ls
constants: Bash Execute G5K TakTuk VERSION
locals: _ __ _dir_ _ex_ _file_ _in_ _out_ _pry_
We can see that the Cute module is composed of other helpful modules such as: G5K, TakTuk, etc. To quit the Cute namespace type:
[10] pry(main)> cd
Let's explore the methods defined in the
G5K Module,
so you can observe which methods can be used with $g5k
variable.
[11] pry(main)> ls Cute::G5K::API
Object.methods: yaml_tag
Cute::G5K::API#methods:
check_deployment deploy environments get_job get_subnets get_vlan_nodes nodes_status reserve site_status wait_for_deploy
cluster_uids deploy_status g5k_user get_jobs get_switch logger release rest site_uids wait_for_job
clusters environment_uids get_deployments get_my_jobs get_switches logger= release_all site sites
We can access the respective YARD documentation of a given method by typing:
[12] pry(main)> show-doc Cute::G5K::API#deploy
In the following section we will see how
pry
can be used to setup an experiment step by step using Ruby-Cute.
Pry can be customized by creating the file .pryrc
. We will create this
file with the following content in order to choose our prefered editor:
$ cat > ~/.pryrc << EOF
Pry.config.editor = "emacs"
EOF
First experiment: Infiniband performance test
Here, we will use Ruby-cute to carry out an experiment. In this experiment, we will ask for two nodes equipped with infiniband and then, we will perform some performance tests using a network benchmark called NETPIPE. NETPIPE performs simple ping-pong tests, bouncing messages of increasing size between two processes. Message sizes are chosen at regular intervals, and with slight perturbations, to provide a complete test of the communication system. For this particular experiment we have the following requirements:
- A pair of SSH keys
- Use of standard environment (no deploy)
- Two nodes connected with infiniband
- MPI benchmark NETPIPE
- A MPI runtime (OpenMPI or MPICH)
We will do it interactively using pry
.
Let's create a directory for keeping all the scripts that we will write throughout the tutorial.
$ mkdir ruby-cute-tutorial
Then, we execute the pry
console form this directory:
$ cd ruby-cute-tutorial
$ cute
First, let's find the sites that offer Infiniband interconnection.
For that we will write a small script from pry
console using the command edit.
[13] pry(main)> edit -n find_infiniband.rb
This will open a new file with your prefered editor. Here we will put the following ruby script:
sites = $g5k.site_uids
sites_infiniband = []
sites.each do |site|
sites_infiniband.push(site) unless $g5k.get_switches(site).select{ |t| t["model"] == "Infiniband" }.empty?
end
Then, we execute it using the play
command which will execute line by line this script in the context of a Pry session.
[21] pry(main)> play find_infiniband.rb
We can observe that the variable sites_infiniband
is now defined, telling us that Grenoble and Nancy sites offer Infiniband interconnection.
[22] pry(main)> sites_infiniband
=> ["grenoble", "nancy"]
Then, create a pair of SSH keys (Necessary for OARSSH):
[23] pry(main)> .ssh-keygen -b 4096 -N "" -t rsa -f ~/my_ssh_jobkey
We send the generated keys to the chosen site (ssh configuration has be set up for the following command to work, see SSH Configuration for more information):
[24] pry(main)> .scp ~/my_ssh* nancy:~/
Now that we have found the sites, let's submit a job. You can use between Grenoble and Nancy sites. If you take a look at Monika you will see that in Nancy we can use the OAR property 'ib_rate=20' and in Grenoble we can use 'ib_rate=10'. More simply you can use the property ib_count=1 which will give you nodes with infiniband whatever the rate.
Given that the MPI bench uses just one MPI process, we will need in realty just one core of a given machine. We will use OAR syntax to ask for two cores in two different nodes with ib_rate=10 in Grenoble.
[25] pry(main)> job = $g5k.reserve(:site => "nancy", :resources => "{ib_rate=20}/nodes=2/core=1",:walltime => '01:00:00', :keys => "~/my_ssh_jobkey" )
2015-12-04 14:07:31.370 => Reserving resources: {ib_rate=20}/nodes=2/core=1,walltime=01:00 (type: ) (in nancy)
2015-12-04 14:07:41.358 => Waiting for reservation 692665
2015-12-04 14:07:41.444 => Reservation 692665 should be available at 2015-12-04 14:07:34 +0100 (0 s)
2015-12-04 14:07:41.444 => Reservation 692665 ready
A hash is returned containing all the information about the job that we have just submitted.
[58] pry(main)> job
=> {"uid"=>692665,
"user_uid"=>"cruizsanabria",
"user"=>"cruizsanabria",
"walltime"=>3600,
"queue"=>"default",
"state"=>"running",
"project"=>"default",
"name"=>"rubyCute job",
"types"=>[],
"mode"=>"PASSIVE",
"command"=>"sleep 3600",
"submitted_at"=>1449234452,
"scheduled_at"=>1449234454,
"started_at"=>1449234454,
"message"=>"FIFO scheduling OK",
"properties"=>"(maintenance = 'NO') AND production = 'NO'",
"directory"=>"/home/cruizsanabria",
"events"=>[],
"links"=>
[{"rel"=>"self", "href"=>"/3.0/sites/nancy/jobs/692665", "type"=>"application/vnd.grid5000.item+json"},
{"rel"=>"parent", "href"=>"/3.0/sites/nancy", "type"=>"application/vnd.grid5000.item+json"}],
"resources_by_type"=>{"cores"=>["graphene-67.nancy.grid5000.fr", "graphene-45.nancy.grid5000.fr"]},
"assigned_nodes"=>["graphene-67.nancy.grid5000.fr", "graphene-45.nancy.grid5000.fr"]}
An important information is the nodes that has been assigned, let's put this information in another variable:
[60] pry(main)> nodes = job["assigned_nodes"]
=> ["graphene-67.nancy.grid5000.fr", "graphene-45.nancy.grid5000.fr"]
Then, we create a file with the name of the reserved machines:
[62] pry(main)> machine_file = Tempfile.open('machine_file')
=> #<File:/tmp/machine_file20151204-28888-1ll3brs>
[64] pry(main)> nodes.each{ |node| machine_file.puts node }
=> ["graphene-67.nancy.grid5000.fr", "graphene-45.nancy.grid5000.fr"]
[66] pry(main)> machine_file.close
We will need to setup SSH options for OAR, we can do it with the OARSSHopts class helper provided by ruby-cute:
[6] pry(main)> grid5000_opt = Cute::OARSSHopts.new(:keys => "~/my_ssh_jobkey")
=> {:user=>"oar", :keys=>"~/my_ssh_jobkey", :port=>6667}
Now, we can communicate using SSH with our nodes. Let's send the machinefile using SCP.
From a pry
console let's load the SCP module to transfer files:
[12] pry(main)> require 'net/scp'
Then, copy-paste the following code in pry console:
Net::SCP.start(nodes.first, "oar", grid5000_opt) do |scp|
scp.upload! machine_file.path, "/tmp/machine_file"
end
The previous code will sent the machine file into the first node. We can check this by performing an SSH connection into the node. Here to illustrate the use of temporary files, let's type the following:
[6] pry(main)> edit -t
and copy-paste the following code:
Net::SSH.start(nodes.first, "oar", grid5000_opt) do |ssh|
puts ssh.exec!("cat /tmp/machine_file")
end
If we save and quit the editor, the code will be evaluated in Pry context. Which will generate the following output:
[12] pry(main)> edit -t
#<Net::SSH::Connection::Channel:0x00000001247150>
graphene-80.nancy.grid5000.fr
graphene-81.nancy.grid5000.fr
=> nil
We confirmed the existence of the file in the first reserved node. Now let's download, compile and execute the benchmark. Create a Ruby file called netpipe:
[12] pry(main)> edit -n netpipe.rb
With the following content:
Net::SSH.start(nodes.first, "oar", grid5000_opt) do |ssh|
netpipe_url = "https://fossies.org/linux/privat/NetPIPE-3.7.2.tar.gz"
ssh.exec!("mkdir -p netpipe_exp")
ssh.exec!("wget -O ~/netpipe_exp/NetPIPE.tar.gz #{netpipe_url}")
ssh.exec!("cd netpipe_exp && tar -zvxf NetPIPE.tar.gz")
ssh.exec!("cd netpipe_exp/NetPIPE-3.7.2 && make mpi")
puts ssh.exec!("mpirun --mca plm_rsh_agent \"oarsh\" -machinefile /tmp/machine_file ~/netpipe_exp/NetPIPE-3.7.2/NPmpi")
end
Then, execute the created script:
[16] pry(main)> play netpipe.rb
#<Net::SSH::Connection::Channel:0x000000021679f0>
Permission denied (publickey,keyboard-interactive).
--------------------------------------------------------------------------
A daemon (pid 4615) died unexpectedly with status 255 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
=> nil
We got an error related to the SSH keys and it is due to the fact that oarsh
cannot not find the appropriate key files.
We can fix this problem by prefixing the mpirun
command with export OAR_JOB_KEY_FILE=~/my_ssh_jobkey
.
Now the code will look like this:
Net::SSH.start(nodes.first, "oar", grid5000_opt) do |ssh|
netpipe_url = "https://fossies.org/linux/privat/NetPIPE-3.7.2.tar.gz"
ssh.exec!("mkdir -p netpipe_exp")
ssh.exec!("wget -O ~/netpipe_exp/NetPIPE.tar.gz #{netpipe_url}")
ssh.exec!("cd netpipe_exp && tar -zvxf NetPIPE.tar.gz")
ssh.exec!("cd netpipe_exp/NetPIPE-3.7.2 && make mpi")
puts ssh.exec!("export OAR_JOB_KEY_FILE=~/my_ssh_jobkey;mpirun --mca plm_rsh_agent \"oarsh\" -machinefile /tmp/machine_file ~/netpipe_exp/NetPIPE-3.7.2/NPmpi")
end
[comment]: # output of the last ssh.exec still doesn't show
After running the script, it will show the output of the benchmark in the pry
console:
[34] pry(main)> play netpipe.rb
#<Net::SSH::Connection::Channel:0x00000002edc6d0>
0: adonis-9
1: adonis-10
Now starting the main loop
0: 1 bytes 32103 times --> 4.58 Mbps in 1.67 usec
1: 2 bytes 59994 times --> 9.22 Mbps in 1.65 usec
2: 3 bytes 60440 times --> 13.79 Mbps in 1.66 usec
3: 4 bytes 40180 times --> 18.34 Mbps in 1.66 usec
4: 6 bytes 45076 times --> 27.07 Mbps in 1.69 usec
5: 8 bytes 29563 times --> 36.16 Mbps in 1.69 usec
6: 12 bytes 37023 times --> 53.84 Mbps in 1.70 usec
7: 13 bytes 24500 times --> 57.97 Mbps in 1.71 usec
8: 16 bytes 26977 times --> 71.61 Mbps in 1.70 usec
9: 19 bytes 32995 times --> 84.65 Mbps in 1.71 usec
10: 21 bytes 36882 times --> 93.27 Mbps in 1.72 usec
11: 24 bytes 38808 times --> 106.69 Mbps in 1.72 usec
12: 27 bytes 41271 times --> 119.77 Mbps in 1.72 usec
The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line. We can try to performn the same test without using infiniband, in order to observe the difference in bandwidth and latency:
Net::SSH.start(nodes.first, "oar", grid5000_opt) do |ssh|
netpipe_url = "https://fossies.org/linux/privat/NetPIPE-3.7.2.tar.gz"
ssh.exec!("mkdir -p netpipe_exp")
ssh.exec!("wget -O ~/netpipe_exp/NetPIPE.tar.gz #{netpipe_url}")
ssh.exec!("cd netpipe_exp && tar -zvxf NetPIPE.tar.gz")
ssh.exec!("cd netpipe_exp/NetPIPE-3.7.2 && make mpi")
mpi_command = "export OAR_JOB_KEY_FILE=~/my_ssh_jobkey;"
mpi_command+= "mpirun --mca plm_rsh_agent \"oarsh\" --mca btl self,sm,tcp -machinefile /tmp/machine_file ~/netpipe_exp/NetPIPE-3.7.2/NPmpi"
ssh.exec(mpi_command)
end
We can modify slightly the previous script to write the result into a file.
We need to use ssh.exec!
to capture the output of the commands.
Net::SSH.start(nodes.first, "oar", grid5000_opt) do |ssh|
netpipe_url = "https://fossies.org/linux/privat/NetPIPE-3.7.2.tar.gz"
ssh.exec!("mkdir -p netpipe_exp")
ssh.exec!("wget -O ~/netpipe_exp/NetPIPE.tar.gz #{netpipe_url}")
ssh.exec!("cd netpipe_exp && tar -zvxf NetPIPE.tar.gz")
ssh.exec!("cd netpipe_exp/NetPIPE-3.7.2 && make mpi")
File.open("output_netpipe.txt", 'w') do |f|
f.puts ssh.exec!("OAR_JOB_KEY_FILE=~/my_ssh_jobkey; mpirun --mca plm_rsh_agent \"oarsh\" -machinefile /tmp/machine_file ~/netpipe_exp/NetPIPE-3.7.2/NPmpi")
end
end
We can check the results by doing:
[16] pry(main)> .cat output_netpipe.txt
0: adonis-9
1: adonis-10
Now starting the main loop
0: 1 bytes 31441 times --> 4.62 Mbps in 1.65 usec
1: 2 bytes 60550 times --> 9.24 Mbps in 1.65 usec
2: 3 bytes 60580 times --> 13.87 Mbps in 1.65 usec
3: 4 bytes 40404 times --> 18.39 Mbps in 1.66 usec
4: 6 bytes 45183 times --> 27.22 Mbps in 1.68 usec
5: 8 bytes 29729 times --> 36.17 Mbps in 1.69 usec
6: 12 bytes 37039 times --> 54.01 Mbps in 1.70 usec
7: 13 bytes 24578 times --> 58.40 Mbps in 1.70 usec
8: 16 bytes 27177 times --> 71.87 Mbps in 1.70 usec
9: 19 bytes 33116 times --> 85.00 Mbps in 1.71 usec
Once finished, we could release the job:
[34] pry(main)> $g5k.release(job)
=> ""
At the end of the experiment you can use the command hist
to see what you have done so far.
This can help you to assemble everything together in a whole script.
[22] pry(main)> hist
1: edit -n find_infiniband.rb
3: play find_infiniband.rb
4: sites_infiniband
5: .ls ~/my_ssh*
6: .scp ~/my_ssh* nancy:~/
7: job = $g5k.reserve(:site => "nancy", :resources => "{ib_rate=20}/nodes=2/core=1",:walltime => '01:00:00', :keys => "~/my_ssh_jobkey" )
8: nodes = job["assigned_nodes"]
9: machine_file = Tempfile.open('machine_file')
10: nodes.each{ |node| machine_file.puts node }
11: machine_file.close
12: grid5000_opt = OARSSHopts.new(:keys => "~/my_ssh_jobkey")
13: require 'net/scp'
14: Net::SCP.start(nodes.first, "oar", grid5000_opt) do |scp|
15: scp.upload! machine_file.path, "/tmp/machine_file"
16: end
17: edit -n netpipe.rb
18: play netpipe.rb
19: edit -n netpipe.rb
20: play netpipe.rb
Running NAS benchmarks in Grid'5000: getting acquainted with parallel command execution
In this experiment, we will run the NAS benchmarks in Grid'5000 and we will script a scalability test for one of the benchmarks. The NAS Parallel Benchmarks (NPB) are a set of benchmarks targeting performance evaluation of highly parallel supercomputers. These benchmarks gather parallel kernels and three simluated applications. They mimic the workload of large scale computational fluid dynamic applications. The objective of this tutorial is to perform a scalability test of the NAS benchmarks. We are going to study how the number of computing units used during the computation reduce the execution time of the application. This experiment has the following requirements:
- 4 or 2 nodes from any Grid'5000 sites
- Use of standard environment (no deploy)
- NAS MPI behchmark
- A MPI runtime (OpenMPI or MPICH)
If you have not created a directory for the tutorial, create it and execute the pry
console from there:
$ mkdir ruby-cute-tutorial
$ cd ruby-cute-tutorial
$ cute
First, let's find the necessary nodes for our experiment. As resources in Grid'5000 could be very busy, we are going
to script a loop that will explore all Grid'5000 sites and find the first site that can provide us with the required nodes.
Open an editor form pry
console:
[5] pry(main)> edit -n find_nodes.rb
and type the following code:
sites = $g5k.site_uids
job = {}
sites.each do |site|
job = $g5k.reserve(:site => site, :cluster => 1, :nodes => 4, :wait => false, :walltime => "01:00:00", :type => :allow_classic_ssh)
begin
job = $g5k.wait_for_job(job, :wait_time => 60)
puts "Nodes assigned #{job['assigned_nodes']}"
break
rescue Cute::G5K::EventTimeout
puts "We waited too long in site #{site} let's release the job and try in another site"
$g5k.release(job)
end
end
Here, we use the method site_uids for getting all available sites.
Then, a job is submitted using the method reserve.
We ask for 4 nodes in a given site and we set the parameter wait
to false which makes the method to return immediately.
Then, we use wait_for_job to set a timeout. If the timeout is reached a Timeout exception will be triggered
that we catch in order to consequently release the submitted job and try to submit it in another site.
Let's execute the script using play
command:
[8] pry(main)> play find_nodes.rb
2015-12-08 16:50:35.582 => Reserving resources: /nodes=4,walltime=01:00 (type: ) (in grenoble)
2015-12-08 16:50:36.465 => Waiting for reservation 1702197
2015-12-08 16:50:41.587 => Reservation 1702197 should be available at 2015-12-08 16:50:37 +0100 (0 s)
2015-12-08 16:50:41.587 => Reservation 1702197 ready
Nodes assigned ["edel-10.grenoble.grid5000.fr", "edel-11.grenoble.grid5000.fr", "edel-12.grenoble.grid5000.fr", "edel-13.grenoble.grid5000.fr"]
=> nil
The variable job
is updated in Pry context.
Up to version 0.10 of ruby-cute, when no job types were specified, the type allow_classic_ssh was activated which enabled the access via default SSH to the reserved machines. You now have to specify it manually.
Let's explore the available modules for the parallel execution of commands in several remote machines. The following example shows how to use the TakTuk module.
nodes = job["assigned_nodes"]
Cute::TakTuk.start(nodes) do |tak|
tak.exec("hostname")
end
Which generates as output:
edel-10.grenoble.grid5000.fr/output/0:edel-10.grenoble.grid5000.fr
edel-12.grenoble.grid5000.fr/output/0:edel-12.grenoble.grid5000.fr
edel-13.grenoble.grid5000.fr/output/0:edel-13.grenoble.grid5000.fr
edel-10.grenoble.grid5000.fr/status/0:0
edel-11.grenoble.grid5000.fr/output/0:edel-11.grenoble.grid5000.fr
edel-12.grenoble.grid5000.fr/status/0:0
edel-13.grenoble.grid5000.fr/status/0:0
edel-11.grenoble.grid5000.fr/status/0:0
The following example shows how to use the Net::SSH::Multi module.
Net::SSH::Multi.start do |session|
nodes.each{ |node| session.use node }
session.exec("hostname")
end
If we type that into pry console we will get:
[edel-10.grenoble.grid5000.fr] edel-10.grenoble.grid5000.fr
[edel-11.grenoble.grid5000.fr] edel-11.grenoble.grid5000.fr
[edel-12.grenoble.grid5000.fr] edel-12.grenoble.grid5000.fr
[edel-13.grenoble.grid5000.fr] edel-13.grenoble.grid5000.fr
It is possible to capture the output of the executed command by adding ! to exec method. For example, let's find the number of cores available in the reserved machines:
results = {}
Net::SSH::Multi.start do |session|
nodes.each{ |node| session.use node }
results = session.exec!("nproc")
end
The exec! method will return a Hash that looks like this:
[27] pry(main)> results
=> {"edel-10.grenoble.grid5000.fr"=>{:stdout=>"8", :status=>0},
"edel-11.grenoble.grid5000.fr"=>{:stdout=>"8", :status=>0},
"edel-12.grenoble.grid5000.fr"=>{:stdout=>"8", :status=>0},
"edel-13.grenoble.grid5000.fr"=>{:stdout=>"8", :status=>0}}
Where the Hash keys are the names of the machines and the values correspond to the output of the commands. Then, we can easily get the total number of cores by typing:
[11] pry(main)> num_cores = results.values.inject(0){ |sum, item| sum+item[:stdout].to_i}
=> 32
Another way to do that is to use the information given by the G5K API regarding the submitted job:
[36] pry(main)> job["resources_by_type"]["cores"].length
=> 32
Let's create a machine file that we will need later on for our experiments:
machine_file = Tempfile.open('machine_file')
nodes.each{ |node| machine_file.puts node }
machine_file.close
After creating the machine file, we need to send it to the other machines. Additionally, we need to download and compile the benchmark. Let's write a small script that help us to perform the aforementioned tasks. Open the editor in pry console:
[17] pry(main)> edit -n NAS-expe.rb
Then, type:
SOURCE_NAS = "http://public.rennes.grid5000.fr/~ddelabroye/NPB3.3.tar"
`wget #{SOURCE_NAS} -O /tmp/NAS.tar`
Cute::TakTuk.start(nodes) do |tak|
tak.put(machine_file.path, "machine_file")
tak.put("/tmp/NAS.tar", "/tmp/NAS.tar")
tak.exec!("cd /tmp/; tar -xvf NAS.tar")
puts tak.exec!("make lu NPROCS=#{num_cores} CLASS=A MPIF77=mpif77 -C /tmp/NPB3.3/NPB3.3-MPI/")
end
We can observe in the previous snippet of code that TakTuk module can be used to transfer files to several remote nodes. put and exec methods can be used in the same block. Finally, execute the script:
[102] pry(main)> play NAS-expe.rb
We can check if each node has the generated binary and the machine file:
Net::SSH::Multi.start do |session|
nodes.each{ |node| session.use node }
session.exec("ls /tmp/NPB3.3/NPB3.3-MPI/bin/")
session.exec("ls ~/machine*")
end
After typing it into pry
console we will get something like:
[genepi-27.grenoble.grid5000.fr] lu.A.32
[genepi-29.grenoble.grid5000.fr] lu.A.32
[genepi-29.grenoble.grid5000.fr] /home/cruizsanabria/machine_file
[genepi-19.grenoble.grid5000.fr] lu.A.32
[genepi-19.grenoble.grid5000.fr] /home/cruizsanabria/machine_file
[genepi-2.grenoble.grid5000.fr] lu.A.32
[genepi-2.grenoble.grid5000.fr] /home/cruizsanabria/machine_file
[genepi-27.grenoble.grid5000.fr] /home/cruizsanabria/machine_file
Which confirms the presence of both files on the nodes.
We can get the path of the binary by typing the following into pry
console.
Net::SSH::Multi.start do |session|
nodes.each{ |node| session.use node }
results = session.exec!("find /tmp/ -name lu.A.32")
end
We will get some errors caused by the find
command:
[32] pry(main)> results
=> {"genepi-27.grenoble.grid5000.fr"=>{:stdout=>"/tmp/NPB3.3/NPB3.3-MPI/bin/lu.A.32", :stderr=>": Permission denied", :status=>1},
"genepi-29.grenoble.grid5000.fr"=>{:stdout=>"/tmp/NPB3.3/NPB3.3-MPI/bin/lu.A.32", :stderr=>": Permission denied", :status=>1},
"genepi-19.grenoble.grid5000.fr"=>{:stderr=>": Permission denied", :stdout=>"/tmp/NPB3.3/NPB3.3-MPI/bin/lu.A.32", :status=>1},
"genepi-2.grenoble.grid5000.fr"=>{:stdout=>"/tmp/NPB3.3/NPB3.3-MPI/bin/lu.A.32", :stderr=>": Permission denied", :status=>1}}
Then, we can assign this to a new variable:
[33] pry(main)> lu_path = results.values.first[:stdout]
=> "/tmp/NPB3.3/NPB3.3-MPI/bin/lu.A.32"
The setup of the experiment is done. It is time to execute the benchmark by typing the following into pry
console:
Net::SSH.start(nodes.first) do |ssh|
results = ssh.exec!("mpirun --mca btl self,sm,tcp -np 32 --machinefile machine_file #{lu_path}")
end
Let's now perform a scalability test of the LU application for 2, 4, 8, 16, 32 processes. Open the editor:
[100] pry(main)> edit -n scalability_NAS.rb
And copy-paste the following script:
num_cores = [2,4,8,16,32]
Cute::TakTuk.start(nodes) do |tak|
num_cores.each do |cores|
puts tak.exec!("make lu NPROCS=#{cores} CLASS=A MPIF77=mpif77 -C /tmp/NPB3.3/NPB3.3-MPI/")
end
results = tak.exec!("find /tmp/ -name lu.A.*")
end
binaries = results.values.first[:output].split("\n")
expe_res = {}
Net::SSH.start(nodes.first) do |ssh|
binaries.each do |binary|
processes = /A\.(\d*)/.match(binary)[1]
expe_res[processes]= {}
result = ssh.exec!("mpirun --mca btl self,sm,tcp -np #{processes} --machinefile machine_file #{binary}")
expe_res[processes][:output]= result
expe_res[processes][:time] =result.split("\n").select{ |t| t["Time in"]}.first
end
end
Then, we execute it:
[102] pry(main)> play scalability_NAS.rb
It will take approximately 2 ~ 3 minutes to run. After finishing a new Hash will be defined called expe_res that we can use to print the results:
num_cores.each{ |cores| puts "#{cores} cores: #{expe_res[cores.to_s][:time]}"}
It will generate:
[107] pry(main)> num_cores.each{ |cores| puts "#{cores} cores: #{expe_res[cores.to_s][:time]}"}
2 cores: Time in seconds = 42.93
4 cores: Time in seconds = 26.50
8 cores: Time in seconds = 12.39
16 cores: Time in seconds = 7.01
32 cores: Time in seconds = 6.00
Finally, we can use the command hist
to try to assemble all we have done so far into a script.
Once finished, we could release the job:
[34] pry(main)> $g5k.release(job)
=> ""
Performing network measurements within a reserved VLAN
In this experiment, we will perform network measurements between two nodes located in different Grid'5000 sites. The network measurements will be carried out in an isolated VLAN. We will first reserved two nodes located in two different Grid'5000 sites in deploy mode and we will ask for two routed VLANs. Once the nodes are ready, an environment will be deployed and the application iperf will be install in all nodes. Then, we will perform some network measurements among the nodes. Finally, we will query the KWAPI using the G5K metrology API to get the network traffic generated during our experiment.
This experiment has the following requirements:
- Two nodes in two different G5K sites
- Environment deployment
- VLAN reservation
- Iperf application
- Access to Network traffic data.
If you have not created a directory for the tutorial, create it and execute the pry
console from there:
$ mkdir ruby-cute-tutorial
$ cd ruby-cute-tutorial
$ cute
Let's create a small script that will help us with the reservation of nodes.
Open the pry
editor:
[35] pry(main)> edit -n multisite.rb
and type:
jobs = {}
threads = []
["nancy","rennes"].each do |site|
threads.push<< Thread.new do
jobs[site] = job = $g5k.reserve(:site => site, :nodes => 1,
:env => 'jessie-x64-min',
:vlan => :routed)
end
end
threads.each{ |t| t.join}
In the script, we have chosen Nancy and Rennes sites. You are encouraged to try other sites as the number of routed VLANs is limited in each site. For the purpose of this tutorial you have to choose a site where Kwapi is available: Grenoble, Nancy, Rennes, Lyon, Nantes. We use the method reserve with parameter env for specifying the environment we want to deploy. This will automatically submit a deploy job and it will deploy the specified environment. The parameter vlan will additionally reserve a VLAN and pass it to Kadeploy to setup the VLAN. After executing this small script we got:
[36] pry(main)> play multisite.rb
2016-01-20 12:48:15.010 => Reserving resources: {type='kavlan'}/vlan=1+/nodes=1,walltime=01:00 (type: deploy) (in nancy)
2016-01-20 12:48:15.010 => Reserving resources: {type='kavlan'}/vlan=1+/nodes=1,walltime=01:00 (type: deploy) (in rennes)
2016-01-20 12:48:16.145 => Waiting for reservation 740698
2016-01-20 12:48:16.246 => Waiting for reservation 802917
2016-01-20 12:48:21.270 => Reservation 740698 should be available at 2016-01-20 12:48:17 +0100 (0 s)
2016-01-20 12:48:26.344 => Reservation 740698 should be available at 2016-01-20 12:48:17 +0100 (0 s)
2016-01-20 12:48:26.404 => Reservation 802917 should be available at 2016-01-20 12:48:13 +0100 (0 s)
2016-01-20 12:48:26.404 => Reservation 802917 ready
2016-01-20 12:48:26.541 => Found VLAN with uid = 4
2016-01-20 12:48:26.541 => Creating deployment
2016-01-20 12:48:27.256 => Waiting for 1 deployment
2016-01-20 12:48:31.296 => Waiting for 1 deployment
2016-01-20 12:48:31.406 => Reservation 740698 should be available at 2016-01-20 12:48:17 +0100 (0 s)
2016-01-20 12:48:31.406 => Reservation 740698 ready
2016-01-20 12:48:31.469 => Found VLAN with uid = 4
2016-01-20 12:48:31.469 => Creating deployment
2016-01-20 12:48:31.869 => Waiting for 1 deployment
2016-01-20 12:48:35.414 => Waiting for 1 deployment
At the end of the process the variable jobs
will be defined and it will contain the jobs' information in each site.
In this variable, we can find information related with the deployment.
[44] pry(main)> jobs["nancy"]["deploy"]
=> [{"created_at"=>1450439620,
"environment"=>"jessie-x64-min",
"key"=>"https://api.grid5000.fr/3.0/sites/nancy/files/cruizsanabria-key-84f3f1dbb1279bc1bddcd618e26c960307d653c5",
"nodes"=>["graphite-4.nancy.grid5000.fr"],
"result"=>{"graphite-4.nancy.grid5000.fr"=>{"macro"=>nil, "micro"=>nil, "state"=>"OK"}},
"site_uid"=>"nancy",
"status"=>"terminated",
"uid"=>"D-b026879e-b185-4e20-8bc5-ea0842a6954b",
"updated_at"=>1450439860,
"user_uid"=>"cruizsanabria",
"vlan"=>14,
"links"=>
[{"rel"=>"self", "href"=>"/3.0/sites/nancy/deployments/D-b026879e-b185-4e20-8bc5-ea0842a6954b", "type"=>"application/vnd.grid5000.item+json"},
{"rel"=>"parent", "href"=>"/3.0/sites/nancy", "type"=>"application/vnd.grid5000.item+json"}]}]
Some important information are: the status of the whole process and the state per node. We can use this information to check if the deployment have finished successfully in all nodes. This data structure is used by the method check_deployment. Let's check the documentation of this method:
[16] pry(main)> show-doc Cute::G5K::API#check_deployment
From: /home/cruizsanabria/Repositories/ruby-cute/lib/cute/g5k_api.rb @ line 1198:
Owner: Cute::G5K::API
Visibility: public
Signature: check_deployment(deploy_info)
Number of lines: 10
It returns an array of machines that did not deploy successfully
= Example
It can be used to try a new deploy:
badnodes = g5k.check_deployment(job["deploy"].last)
g5k.deploy(job,:nodes => badnodes, :env => 'wheezy-x64-base')
g5k.wait_for_deploy(job)
return [Array] machines that did not deploy successfully
param deploy_info [Hash] deployment structure information
We can use this method with the jobs we have just submitted (The output will be probably long, so you will need to scroll up to see what it is shown here):
[47] pry(main)> jobs.each{ |site,job| puts "all nodes OK in site: #{site}" if $g5k.check_deployment(job["deploy"].last).empty?}
all nodes OK in site: rennes
all nodes OK in site: nancy
Now, the reserved nodes are in a VLAN; within this VLAN a DHCP server will assign new IP addresses to the nodes. You can configure your own if you want (please refer to KVLAN tutorial if you want to know more). We can get the new assigned names by doing:
nodes = []
jobs.each{ |site,job| nodes.push($g5k.get_vlan_nodes(job))}
After putting that into pry
we will get something like this:
[50] pry(main)> nodes
=> [["paranoia-6-kavlan-16.rennes.grid5000.fr"], ["graphite-4-kavlan-14.nancy.grid5000.fr"]]
[51] pry(main)> nodes.flatten
=> ["paranoia-6-kavlan-16.rennes.grid5000.fr", "graphite-4-kavlan-14.nancy.grid5000.fr"]
Now, let's install iperf
application in order to perform our network measurements.
Copy-paste the following code into pry
:
nodes = nodes.flatten
Net::SSH::Multi.start do |session|
nodes.each{ |node| session.use("root@#{node}") }
session.exec!("apt-get update")
session.exec("DEBIAN_FRONTEND=noninteractive apt-get install -q -y iperf")
end
You should get something like this:
[paranoia-6-kavlan-16.rennes.grid5000.fr] Reading package lists...
[paranoia-6-kavlan-16.rennes.grid5000.fr] Building dependency tree...
[graphite-4-kavlan-14.nancy.grid5000.fr] Reading package lists...
[graphite-4-kavlan-14.nancy.grid5000.fr] Building dependency tree...
[paranoia-6-kavlan-16.rennes.grid5000.fr]
[paranoia-6-kavlan-16.rennes.grid5000.fr] Reading state information...
[paranoia-6-kavlan-16.rennes.grid5000.fr] The following NEW packages will be installed:
[paranoia-6-kavlan-16.rennes.grid5000.fr] iperf
[paranoia-6-kavlan-16.rennes.grid5000.fr] 0 upgraded, 1 newly installed, 0 to remove and 8 not upgraded.
[paranoia-6-kavlan-16.rennes.grid5000.fr] Need to get 51.4 kB of archives.
[paranoia-6-kavlan-16.rennes.grid5000.fr] After this operation, 179 kB of additional disk space will be used.
[paranoia-6-kavlan-16.rennes.grid5000.fr] Get:1 http://ftp.debian.org/debian/ jessie/main iperf amd64 2.0.5+dfsg1-2 [51.4 kB]
[graphite-4-kavlan-14.nancy.grid5000.fr]
[graphite-4-kavlan-14.nancy.grid5000.fr] Reading state information...
[graphite-4-kavlan-14.nancy.grid5000.fr] The following NEW packages will be installed:
[graphite-4-kavlan-14.nancy.grid5000.fr] iperf
You can check if the application has been successfully installed,
by typing the following into the pry
console:
Net::SSH::Multi.start do |session|
nodes.each{ |node| session.use("root@#{node}") }
session.exec("iperf --version")
end
Which will generate:
[65] pry(main)* end
[paranoia-6-kavlan-16.rennes.grid5000.fr] iperf version 2.0.5 (08 Jul 2010) pthreads
[graphite-4-kavlan-14.nancy.grid5000.fr] iperf version 2.0.5 (08 Jul 2010) pthreads
=> nil
Let's perform some iperf tests, let's write a small script. Open the editor:
[76] pry(main)> edit -n iperf_test.rb
and type:
results = {}
Net::SSH::Multi.start do |session|
session.group :server do
session.use("root@#{nodes[0]}")
end
session.group :client do
session.use("root@#{nodes[1]}")
end
session.with(:server).exec("iperf -s &")
#bandwith
results[:bandwidth]= session.with(:client).exec!("iperf -c #{nodes[0]}")
# bi-directional bandwidth measurement
results[:bidi]= session.with(:client).exec!("iperf -c #{nodes[0]} -r")
# TCP windows size
results[:window]= session.with(:client).exec!("iperf -c #{nodes[0]} -w 2000")
# shutdown server
session.with(:server).exec("skill iperf")
end
Then, if we execute it with the play
command:
[77] pry(main)> play iperf_test.rb
[paranoia-6-kavlan-16.rennes.grid5000.fr] ------------------------------------------------------------
[paranoia-6-kavlan-16.rennes.grid5000.fr] Server listening on TCP port 5001
[paranoia-6-kavlan-16.rennes.grid5000.fr] TCP window size: 85.3 KByte (default)
[paranoia-6-kavlan-16.rennes.grid5000.fr] ------------------------------------------------------------
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 4] local 10.27.204.71 port 5001 connected with 10.19.200.240 port 32769
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ ID] Interval Transfer Bandwidth
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 4] 0.0-10.0 sec 1.12 GBytes 957 Mbits/sec
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 5] local 10.27.204.71 port 5001 connected with 10.19.200.240 port 32770
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 5] 0.0-10.0 sec 1.10 GBytes 947 Mbits/sec
[paranoia-6-kavlan-16.rennes.grid5000.fr] ------------------------------------------------------------
[paranoia-6-kavlan-16.rennes.grid5000.fr] Client connecting to 10.19.200.240, TCP port 5001
[paranoia-6-kavlan-16.rennes.grid5000.fr] TCP window size: 85.0 KByte (default)
[paranoia-6-kavlan-16.rennes.grid5000.fr] ------------------------------------------------------------
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 5] local 10.27.204.71 port 47604 connected with 10.19.200.240 port 5001
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 5] 0.0-10.0 sec 1.12 GBytes 958 Mbits/sec
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 4] local 10.27.204.71 port 5001 connected with 10.19.200.240 port 32771
[paranoia-6-kavlan-16.rennes.grid5000.fr] [ 4] 0.0-10.7 sec 2.25 MBytes 1.77 Mbits/sec
The variable results
will be defined which contains the results for each test.
Let's print the results. Type the following into the pry
console:
results.each do |test, res|
puts "Results of test: #{test}"
res.each { |node,r| puts r[:stdout]}
end
Which will give us:
Results of test: bandwidth
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.12 GBytes 958 Mbits/sec
Results of test: bidi
[ 4] 0.0-10.0 sec 1.12 GBytes 957 Mbits/sec
Results of test: window
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.7 sec 2.25 MBytes 1.77 Mbits/sec
Now let's look at the network traffic that we have generated during our experiment using KWAPI.
Ruby-cute offers the get_metric method to consult the G5K Metrology API.
In order to carry out a query and get the values of a specific probe,
we have to know the time interval of the values and the name of the probe.
Let's get the values for the metric network_in
.
We could get all the names of the probes specific to this metric by typing:
probes = $g5k.get_metric("rennes",:metric => "network_in").uids
Please replace the first parameter with the site you have used in the experiment.
If you type that in pry
you will get:
[13] pry(main)> probes
=> ["parasilo-11-eth0",
"parasilo-11-eth1",
"paravance-48-eth1",
"paravance-48-eth0",
"paravance-2-eth0",
"paravance-2-eth1",
"paranoia-4",
"paranoia-5",
"paranoia-6",
"paranoia-7",
"paravance-72-eth0",
In order to choose the right probes, we need to get the real names of the machines and not the ones assigned by the VLAN. We can consult the job information:
nodes_normal = []
jobs.each{ |site,job| nodes_normal.push(job["assigned_nodes"])}
Which will give us an Array of Arrays that we can flatten by doing:
[97] pry(main)> nodes_normal.flatten!
=> [["paranoia-6.rennes.grid5000.fr"], ["graphite-4.nancy.grid5000.fr"]]
As we are going to fetch the data for Rennes (First node). We could do:
[68] pry(main)> probe_expe = probes.select{ |p| p == nodes_normal[0].split(".")[0] }
So, at this point we already have the probe we want to request. Next step is to get the start time of the interval, we can choose for example, the time at which deployments have finished:
deploy_end = []
jobs.each{ |site,job| deploy_end.push(job["deploy"].last["updated_at"])}
Therefore, we could choose the maximum timestamp from the ones returned:
start = deploy_end.max
Now, we can proceed by performing the query:
$g5k.get_metric("rennes",:metric => "network_in",:query => {:from => start, :to => start+3600, :only => probe_expe.first})
An Array is returned. We can then open an editor and write a small script that will write these values into a file
[33] pry(main)> edit -n get_results.rb
type:
raw_data = $g5k.get_metric("rennes",:metric => "network_in",
:query => {:from => start, :to => start+3600, :only => probe_expe.first})
network_in = raw_data.map{ |r| r["values"]}.flatten
time = raw_data.map{ |r| r["timestamps"]}.flatten
values = Hash[time.zip(network_in)]
File.open("network_in-values.txt",'w+') do |f|
f.puts("time\t bytes")
values.each{ |k,v| f.puts("#{k}\t#{v}")}
end
and execute it with:
pry(main)> play get_results.rb
=> {1453293153.732378=>4498405298908,
1453293155.183099=>4498405298908,
1453293156.582115=>4498405298908,
1453293157.924968=>4498405298908,
1453293159.28666=>4498405298908,
1453293160.655534=>4498405298908,
1453293161.998718=>4498405299219,
We can release the nodes:
[58] pry(main)> jobs.values.each{ |j| $g5k.release(j)}
Conclusions
This tutorial has shown how the scripting of complex experiment can be done using the Ruby scripting language. We saw that in the context of Grid'5000, Ruby-Cute offers useful methods for accessing the platform's services and executing commands in parallel. The aim of this tutorial was to give you some ideas for coding your experiments using Ruby-Cute and we hope it will be useful for your experiments.