Class: ProxmoxWaiter
- Inherits:
-
Object
- Object
- ProxmoxWaiter
- Defined in:
- lib/hybrid_platforms_conductor/hpc_plugins/provisioner/proxmox/proxmox_waiter.rb
Overview
Serve Proxmox reservation requests, like a waiter in a restaurant ;-) Multi-process safe.
Constant Summary collapse
- FUTEX_TIMEOUT =
Integer: Timeout in seconds to get the futex Take into account that some processes can be lengthy while the futex is taken:
-
POST/DELETE operations in the Proxmox API requires tasks to be performed which can take a few seconds, depending on the load.
-
Proxmox API sometimes fails to respond when containers are being locked temporarily (we have a 30 secs timeout for each one).
-
600
- RETRY_QUEUE_WAIT =
Integer: Maximum timeout in seconds before retrying getting the Futex when we are not first in the queue (a rand will be applied to it)
30
Instance Method Summary collapse
-
#create(vm_info) ⇒ Object
Reserve resources for a new container.
-
#destroy(vm_info) ⇒ Object
Destroy a VM.
-
#initialize(config_file, proxmox_user, proxmox_password, proxmox_realm) ⇒ ProxmoxWaiter
constructor
Constructor.
Constructor Details
#initialize(config_file, proxmox_user, proxmox_password, proxmox_realm) ⇒ ProxmoxWaiter
Constructor
- Parameters
-
config_file (String): Path to a JSON file containing a configuration for ProxmoxWaiter. Here is the file structure:
-
proxmox_api_url (String): Proxmox API URL.
-
futex_file (String): Path to the file serving as a futex.
-
logs_dir (String): Path to the directory containing logs [default: ‘.’]
-
api_max_retries (Integer): Max number of API retries
-
api_wait_between_retries_secs (Integer): Number of seconds to wait between API retries
-
pve_nodes (Array<String>): List of PVE nodes allowed to spawn new containers [default: all]
-
vm_ips_list (Array<String>): The list of IPs that are available for the Proxomx containers.
-
vm_ids_range ([Integer, Integer]): Minimum and maximum reservable VM ID
-
coeff_ram_consumption (Integer): Importance coefficient to assign to the RAM consumption when selecting available PVE nodes
-
coeff_disk_consumption (Integer): Importance coefficient to assign to the disk consumption when selecting available PVE nodes
-
expiration_period_secs (Integer): Number of seconds defining the expiration period
-
expire_stopped_vm_timeout_secs (Integer): Number of seconds before defining stopped VMs as expired
-
limits (Hash): Limits to be taken into account while reserving resources. Each property is optional and no property means no limit.
-
nbr_vms_max (Integer): Max number of VMs we can reserve.
-
cpu_loads_thresholds ([Float, Float, Float]): CPU load thresholds from which a PVE node should not be used (as soon as 1 of the value is greater than 1 of those thresholds, discard the node).
-
ram_percent_used_max (Float): Max percentage (between 0 and 1) of RAM that can be reserved on a PVE node.
-
disk_percent_used_max (Float): Max percentage (between 0 and 1) of disk that can be reserved on a PVE node.
-
-
-
proxmox_user (String): Proxmox user to be used to connect to the API.
-
proxmox_password (String): Proxmox password to be used to connect to the API.
-
proxmox_realm (String): Proxmox realm to be used to connect to the API.
46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# File 'lib/hybrid_platforms_conductor/hpc_plugins/provisioner/proxmox/proxmox_waiter.rb', line 46 def initialize(config_file, proxmox_user, proxmox_password, proxmox_realm) @config = JSON.parse(File.read(config_file)) @proxmox_user = proxmox_user @proxmox_password = proxmox_password @proxmox_realm = proxmox_realm # Keep a memory of non-debug stopped containers, so that we can guess if they are expired or not after some time. # Time when we noticed a given container is stopped, per creation date, per VM ID, per PVE node # We add the creation date as a VM ID can be reused (with a different creation date) and we want to make sure we don't think a newly created VM is here for longer that it should. # Hash< String, Hash< Integer, Hash< String, Time > > > # Hash< pve_node, Hash< vm_id, Hash< creation_date, time_seen_as_stopped > > > @non_debug_stopped_containers = {} @log_file = "#{@config['logs_dir'] || '.'}/proxmox_waiter_#{Time.now.utc.strftime('%Y%m%d%H%M%S')}_pid_#{Process.pid}_#{File.basename(config_file, '.json')}.log" FileUtils.mkdir_p File.dirname(@log_file) end |
Instance Method Details
#create(vm_info) ⇒ Object
Reserve resources for a new container. Check resources availability.
- Parameters
-
vm_info (Hash<String,Object>): The VM info to be created, using the same properties as LXC container creation through Proxmox API.
- Result
-
Hash<Symbol, Object> or Symbol: Reserved resource info, or Symbol in case of error. The following properties are set as resource info:
-
pve_node (String): Node on which the container has been created.
-
vm_id (Integer): The VM ID
-
vm_ip (String): The VM IP
Possible error codes returned are:
-
not_enough_resources: There is no available free resources to be reserved
-
no_available_ip: There is no available IP to be reserved
-
no_available_vm_id: There is no available VM ID to be reserved
-
exceeded_number_of_vms: There is already too many VMs running
-
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
# File 'lib/hybrid_platforms_conductor/hpc_plugins/provisioner/proxmox/proxmox_waiter.rb', line 77 def create(vm_info) log "Ask to create #{vm_info}" # Extract the required resources from the desired VM info nbr_cpus = vm_info['cpulimit'] ram_mb = vm_info['memory'] disk_gb = Integer(vm_info['rootfs'].split(':').last) reserved_resource = nil start do pve_node_scores = pve_scores_for(nbr_cpus, ram_mb, disk_gb) # Check if we are not exceeding hard-limits: # * the number of vms to be created # * the free IPs # * the free VM IDs # In such case, even when free resources on PVE nodes are enough to host the new container, we still need to clean-up before. nbr_vms = nbr_vms_handled_by_us if nbr_vms >= @config['limits']['nbr_vms_max'] || free_ips.empty? || free_vm_ids.empty? log 'Hitting at least 1 hard-limit. Check if we can destroy expired containers.' log "[ Hard limit reached ] - Already #{nbr_vms} are created (max is #{@config['limits']['nbr_vms_max']})." if nbr_vms >= @config['limits']['nbr_vms_max'] log '[ Hard limit reached ] - No more available IPs.' if free_ips.empty? log '[ Hard limit reached ] - No more available VM IDs.' if free_vm_ids.empty? clean_up_done = false # Check if we can remove some expired ones @config['pve_nodes'].each do |pve_node| if api_get("nodes/#{pve_node}/lxc").any? { |lxc_info| is_vm_expired?(pve_node, Integer(lxc_info['vmid'])) } destroy_expired_vms_on(pve_node) clean_up_done = true end end if clean_up_done nbr_vms = nbr_vms_handled_by_us if nbr_vms >= @config['limits']['nbr_vms_max'] log "[ Hard limit reached ] - Still too many running VMs after clean-up: #{nbr_vms}." reserved_resource = :exceeded_number_of_vms elsif free_ips.empty? log '[ Hard limit reached ] - Still no available IP' reserved_resource = :no_available_ip elsif free_vm_ids.empty? log '[ Hard limit reached ] - Still no available VM ID' reserved_resource = :no_available_vm_id end else log 'Could not find any expired VM to destroy.' # There was nothing to clean. So wait for other processes to destroy their containers. reserved_resource = if nbr_vms >= @config['limits']['nbr_vms_max'] :exceeded_number_of_vms elsif free_ips.empty? :no_available_ip else :no_available_vm_id end end end if reserved_resource.nil? # Select the best node, first keeping expired VMs if possible. # This is the index of the scores to be checked: if we can choose without recycling VMs, do it by considering score index 0. score_idx = if pve_node_scores.all? { |_pve_node, pve_node_scores| pve_node_scores[0].nil? } # No node was available without removing expired VMs. # Therefore we consider only scores without expired VMs. log 'No PVE node has enough free resources without removing eventual expired VMs' 1 else 0 end selected_pve_node, selected_pve_node_score = pve_node_scores.inject([nil, nil]) do |(best_pve_node, best_score), (pve_node, pve_node_scores)| if pve_node_scores[score_idx].nil? || (!best_score.nil? && pve_node_scores[score_idx] >= best_score) [best_pve_node, best_score] else [pve_node, pve_node_scores[score_idx]] end end if selected_pve_node.nil? # No PVE node can host our request. log 'Could not find any PVE node with enough free resources' reserved_resource = :not_enough_resources else log "[ #{selected_pve_node} ] - PVE node selected with score #{selected_pve_node_score}" # We know on which PVE node we can instantiate our new container. # We have to purge expired VMs on this PVE node before reserving a new creation. destroy_expired_vms_on(selected_pve_node) if score_idx == 1 # Now select the correct VM ID and VM IP. vm_id_or_error, ip = reserve_on(selected_pve_node, nbr_cpus, ram_mb, disk_gb) if ip.nil? # We have an error reserved_resource = vm_id_or_error else # Create the container for real completed_vm_info = vm_info.dup completed_vm_info['vmid'] = vm_id_or_error completed_vm_info['net0'] = "#{completed_vm_info['net0']},ip=#{ip}/32" completed_vm_info['description'] = "#{completed_vm_info['description']}creation_date: #{Time.now.utc.strftime('%FT%T')}\n" log "[ #{selected_pve_node}/#{vm_id_or_error} ] - Create LXC container" wait_for_proxmox_task(selected_pve_node, @proxmox.post("nodes/#{selected_pve_node}/lxc", completed_vm_info)) reserved_resource = { pve_node: selected_pve_node, vm_id: vm_id_or_error, vm_ip: ip } end end end end reserved_resource end |
#destroy(vm_info) ⇒ Object
Destroy a VM.
- Parameters
-
vm_info (Hash<String,Object>): The VM info to be destroyed:
-
vm_id (Integer): The VM ID
-
node (String): The node for which this VM has been created
-
environment (String): The environment for which this VM has been created
-
- Result
-
Hash<Symbol, Object> or Symbol: Released resource info, or Symbol in case of error. The following properties are set as resource info:
-
pve_node (String): Node on which the container has been released (if found).
Possible error codes returned are: None
-
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
# File 'lib/hybrid_platforms_conductor/hpc_plugins/provisioner/proxmox/proxmox_waiter.rb', line 197 def destroy(vm_info) log "Ask to destroy #{vm_info}" found_pve_node = nil start do vm_id_str = vm_info['vm_id'].to_s # Destroy the VM ID # Find which PVE node hosts this VM unless @config['pve_nodes'].any? do |pve_node| api_get("nodes/#{pve_node}/lxc").any? do |lxc_info| if lxc_info['vmid'] == vm_id_str # Make sure this VM is still used for the node and environment we want. # It could have been deleted manually and re-affected to another node/environment automatically, and in this case we should not remove it. = (pve_node, vm_info['vm_id']) if [:node] == vm_info['node'] && [:environment] == vm_info['environment'] destroy_vm_on(pve_node, vm_info['vm_id']) found_pve_node = pve_node true else log "[ #{pve_node}/#{vm_info['vm_id']} ] - This container is not hosting the node/environment to be destroyed: #{[:node]}/#{[:environment]} != #{vm_info['node']}/#{vm_info['environment']}" false end else false end end end log "Could not find any PVE node hosting VM #{vm_info['vm_id']}" end end reserved_resource = {} reserved_resource[:pve_node] = found_pve_node unless found_pve_node.nil? reserved_resource end |