Class: ContentData::ContentData
- Inherits:
-
Object
- Object
- ContentData::ContentData
- Defined in:
- lib/content_data/content_data.rb
Overview
Content Data(CD) object holds files information as contents and instances Files info retrieved from hardware: checksum, size, time modification, server, device and path Those attributes are divided into content and instance attributes:
unique checksum, size are content attributes
time modification, server, device and path are instance attributes
The relationship between content and instances is 1:many meaning that a content can have instances in many servers. content also has time attribute, which has the value of the time of the first instance. This can be changed by using unify_time method which sets all time attributes for a content and it’s instances to the min time off all. Different files(instances) with same content(checksum), are grouped together under that content. Interface methods include:
iterate over contents and instances info,
unify time, add/remove instance, queries, merge, remove directory and more.
Content info data structure:
@contents_info = { Checksum -> [size, *instances*, content_modification_time] }
*instances* = {[server,path] -> instance_modification_time }
Notes:
1. content_modification_time is the instance_modification_time of the first
instances which was added to @contents_info
Constant Summary collapse
- CHUNK_SIZE =
5000
Instance Method Summary collapse
- #==(other) ⇒ Object
- #add_instance(checksum, size, server, path, modification_time) ⇒ Object
- #checksum_instances_size(checksum) ⇒ Object
- #clone_contents_info ⇒ Object
- #clone_instances_info ⇒ Object
-
#content_each_instance(checksum, &block) ⇒ Object
iterator of instances over specific content block is provided with: checksum, size, content modification time, instance modification time, server and file path.
- #content_exists(checksum) ⇒ Object
- #contents_size ⇒ Object
-
#each_content(&block) ⇒ Object
iterator over @contents_info data structure (not including instances) block is provided with: checksum, size and content modification time.
-
#each_instance(&block) ⇒ Object
iterator over @contents_info data structure (including instances) block is provided with: checksum, size, content modification time, instance modification time, server and file path.
- #empty? ⇒ Boolean
-
#from_file(filename) ⇒ Object
TODO validation that file indeed contains ContentData missing Loading db from file using chunks for better memory performance.
- #get_instance_mod_time(checksum, location) ⇒ Object
-
#get_query(variable, params) ⇒ Object
TODO simplify conditions This mehod is experimental and shouldn't be used nil is used to define +/- infinity for to/from method arguments from/to values are exlusive in condition’a calculations Need to take care about ‘==’ operation that is used for object’s comparison.
-
#initialize(other = nil) ⇒ ContentData
constructor
A new instance of ContentData.
- #instance_exists(path, server) ⇒ Object
- #instances_size ⇒ Object
- #read_contents_chunk(filename, file, chunk_size) ⇒ Object
- #read_instances_chunk(filename, file, chunk_size) ⇒ Object
- #remove_content(checksum) ⇒ Object
-
#remove_directory(dir_to_remove, server) ⇒ Object
removes all instances records which are located under input param: dir_to_remove.
-
#remove_instance(server, path) ⇒ Object
removes an instance record both in @instances_info and @instances_info.
- #reset_load_from_file(file_name, file_io, err_msg) ⇒ Object
-
#to_file(filename) ⇒ Object
Write content data to file.
- #to_file_contents_chunk(file, contents_enum, chunk_size) ⇒ Object
- #to_file_instances_chunk(file, contents_enum, chunk_size) ⇒ Object
- #to_s ⇒ Object
-
#unify_time ⇒ Object
for each content, all time fields (content and instances) are replaced with the min time found, while going through all time fields.
-
#unique_id ⇒ ID
Content Data unique identification.
-
#validate(params = nil) ⇒ Boolean
Validates index against file system that all instances hold a correct data regarding files that they represents.
Constructor Details
#initialize(other = nil) ⇒ ContentData
Returns a new instance of ContentData.
34 35 36 37 38 39 40 41 42 |
# File 'lib/content_data/content_data.rb', line 34 def initialize(other = nil) if other.nil? @contents_info = {} # Checksum --> [size, paths-->time(instance), time(content)] @instances_info = {} # location --> checksum to optimize instances query else @contents_info = other.clone_contents_info @instances_info = other.clone_instances_info # location --> checksum to optimize instances query end end |
Instance Method Details
#==(other) ⇒ Object
226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 |
# File 'lib/content_data/content_data.rb', line 226 def ==(other) return false if other.nil? return false if @contents_info.length != other.contents_size other.each_instance { |checksum, size, content_mod_time, instance_mod_time, server, path| return false if instance_exists(path, server) != other.instance_exists(path, server) local_content_info = @contents_info[checksum] return false if local_content_info.nil? return false if local_content_info[0] != size return false if local_content_info[2] != content_mod_time #check instances local_instances = local_content_info[1] return false if other.checksum_instances_size(checksum) != local_instances.length location = [server, path] local_instance_mod_time = local_instances[location] return false if local_instance_mod_time.nil? return false if local_instance_mod_time != instance_mod_time } true end |
#add_instance(checksum, size, server, path, modification_time) ⇒ Object
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'lib/content_data/content_data.rb', line 152 def add_instance(checksum, size, server, path, modification_time) location = [server, path] # file was changed but remove_instance was not called if (@instances_info.include?(location) && @instances_info[location] != checksum) Log.warning("#{server}:#{path} file already exists with different checksum") remove_instance(server, path) end content_info = @contents_info[checksum] if content_info.nil? @contents_info[checksum] = [size, {location => modification_time}, modification_time] else if size != content_info[0] Log.warning('File size different from content size while same checksum') Log.warning("instance location:server:'#{location[0]}' path:'#{location[1]}'") Log.warning("instance mod time:'#{modification_time}'") end #override file if needed content_info[0] = size instances = content_info[1] instances[location] = modification_time end @instances_info[location] = checksum end |
#checksum_instances_size(checksum) ⇒ Object
139 140 141 142 143 |
# File 'lib/content_data/content_data.rb', line 139 def checksum_instances_size(checksum) content_info = @contents_info[checksum] return 0 if content_info.nil? content_info[1].length end |
#clone_contents_info ⇒ Object
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# File 'lib/content_data/content_data.rb', line 60 def clone_contents_info clone_contents_info = {} contents_info_enum = @contents_info.each_key loop { checksum = contents_info_enum.next rescue break instances = @contents_info[checksum] size = instances[0] content_time = instances[2] instances_db = instances[1] instances_db_cloned = {} instances_db_enum = instances_db.each_key loop { location = instances_db_enum.next rescue break instance_mtime = instances_db[location] instances_db_cloned[[location[0].clone,location[1].clone]]=instance_mtime } clone_contents_info[checksum] = [size, instances_db_cloned, content_time] } clone_contents_info end |
#clone_instances_info ⇒ Object
50 51 52 53 54 55 56 57 58 |
# File 'lib/content_data/content_data.rb', line 50 def clone_instances_info clone_instances_info = {} instances_info_enum = @instances_info.each_key loop { location = instances_info_enum.next rescue break clone_instances_info[[location[0].clone, location[1].clone]] = @instances_info[location].clone } clone_instances_info end |
#content_each_instance(checksum, &block) ⇒ Object
iterator of instances over specific content block is provided with: checksum, size, content modification time,
instance modification time, server and file path
118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/content_data/content_data.rb', line 118 def content_each_instance(checksum, &block) content_info = @contents_info[checksum] instances_db_enum = content_info[1].each_key loop { location = instances_db_enum.next rescue break # provide the block with: checksum, size, content modification time,instance modification time, # server and path. instance_modification_time = content_info[1][location] block.call(checksum,content_info[0], content_info[2], instance_modification_time, location[0], location[1]) } end |
#content_exists(checksum) ⇒ Object
184 185 186 |
# File 'lib/content_data/content_data.rb', line 184 def content_exists(checksum) @contents_info.has_key?(checksum) end |
#contents_size ⇒ Object
131 132 133 |
# File 'lib/content_data/content_data.rb', line 131 def contents_size() @contents_info.length end |
#each_content(&block) ⇒ Object
iterator over @contents_info data structure (not including instances) block is provided with: checksum, size and content modification time
85 86 87 88 89 90 91 92 93 |
# File 'lib/content_data/content_data.rb', line 85 def each_content(&block) contents_enum = @contents_info.each_key loop { checksum = contents_enum.next rescue break content_val = @contents_info[checksum] # provide checksum, size and content modification time to the block block.call(checksum,content_val[0], content_val[2]) } end |
#each_instance(&block) ⇒ Object
iterator over @contents_info data structure (including instances) block is provided with: checksum, size, content modification time,
instance modification time, server and file path
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/content_data/content_data.rb', line 98 def each_instance(&block) contents_enum = @contents_info.each_key loop { checksum = contents_enum.next rescue break content_info = @contents_info[checksum] content_info_enum = content_info[1].each_key loop { location = content_info_enum.next rescue break # provide the block with: checksum, size, content modification time,instance modification time, # server and path. instance_modification_time = content_info[1][location] block.call(checksum,content_info[0], content_info[2], instance_modification_time, location[0], location[1]) } } end |
#empty? ⇒ Boolean
180 181 182 |
# File 'lib/content_data/content_data.rb', line 180 def empty? @contents_info.empty? end |
#from_file(filename) ⇒ Object
TODO validation that file indeed contains ContentData missing Loading db from file using chunks for better memory performance
331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 |
# File 'lib/content_data/content_data.rb', line 331 def from_file(filename) # read first line (number of contents) # calculate line number (number of instances) # read number of instances. # loop over instances lines (using chunks) and add instances File.open(filename, 'r') { |file| # Get number of contents (at first line) number_of_contents = file.gets # this gets the next line or return nil at EOF unless (number_of_contents and number_of_contents.match(/^[\d]+$/)) # check that line is of Number format return reset_load_from_file(filename, file, "number of contents should be a number. We got:#{number_of_contents}") end number_of_contents = number_of_contents.to_i # advance file lines over all contents. We need only the instances data to build the content data object # use chunks and GC contents_chunks = number_of_contents / CHUNK_SIZE contents_chunks += 1 if (contents_chunks * CHUNK_SIZE < number_of_contents) chunk_index = 0 while chunk_index < contents_chunks chunk_size = CHUNK_SIZE if chunk_index + 1 == contents_chunks # update last chunk size chunk_size = number_of_contents - (chunk_index * CHUNK_SIZE) end return unless read_contents_chunk(filename, file, chunk_size) GC.start chunk_index += 1 end # get number of instances number_of_instances = file.gets unless (number_of_instances and number_of_instances.match(/^[\d]+$/)) # check that line is of Number format return reset_load_from_file(filename, file, "number of instances should be a Number. We got:#{number_of_instances}") end number_of_instances = number_of_instances.to_i # read in instances chunks and GC instances_chunks = number_of_instances / CHUNK_SIZE instances_chunks += 1 if (instances_chunks * CHUNK_SIZE < number_of_instances) chunk_index = 0 while chunk_index < instances_chunks chunk_size = CHUNK_SIZE if chunk_index + 1 == instances_chunks # update last chunk size chunk_size = number_of_instances - (chunk_index * CHUNK_SIZE) end return unless read_instances_chunk(filename, file, chunk_size) GC.start chunk_index += 1 end } end |
#get_instance_mod_time(checksum, location) ⇒ Object
145 146 147 148 149 150 |
# File 'lib/content_data/content_data.rb', line 145 def get_instance_mod_time(checksum, location) content_info = @contents_info[checksum] return nil if content_info.nil? instances = content_info[1] instance_time = instances[location] end |
#get_query(variable, params) ⇒ Object
TODO simplify conditions This mehod is experimental and shouldn't be used nil is used to define +/- infinity for to/from method arguments from/to values are exlusive in condition’a calculations Need to take care about ‘==’ operation that is used for object’s comparison. In need of case user should define it’s own ‘==’ implemementation.
592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 |
# File 'lib/content_data/content_data.rb', line 592 def get_query(variable, params) raise RuntimeError.new 'This method is experimental and shouldn\'t be used' exact = params['exact'].nil? ? Array.new : params['exact'] from = params['from'] to = params ['to'] is_inside = params['is_inside'] unless ContentInstance.new.instance_variable_defined?("@#{attribute}") raise ArgumentError "#{variable} isn't a ContentInstance variable" end if (exact.nil? && from.nil? && to.nil?) raise ArgumentError 'At least one of the argiments {exact, from, to} must be defined' end if (!(from.nil? || to.nil?) && from.kind_of?(to.class)) raise ArgumentError 'to and from arguments should be comparable one with another' end # FIXME add support for from/to for Strings if ((!from.nil? && !from.kind_of?(Numeric.new.class))\ || (!to.nil? && to.kind_of?(Numeric.new.class))) raise ArgumentError 'from and to options supported only for numeric values' end if (!exact.empty? && (!from.nil? || !to.nil?)) raise ArgumentError 'exact and from/to options are mutually exclusive' end result_index = ContentData.new instances.each_value do |instance| is_match = false var_value = instance.instance_variable_get("@#{variable}") if exact.include? var_value is_match = true elsif (from.nil? || var_value > from) && (to.nil? || var_value < to) is_match = true end if (is_match && is_inside) || (!is_match && !is_inside) checksum = instance.checksum result_index.add_content(contents[checksum]) unless result_index.content_exists(checksum) result_index.add_instance instance end end result_index end |
#instance_exists(path, server) ⇒ Object
188 189 190 |
# File 'lib/content_data/content_data.rb', line 188 def instance_exists(path, server) @instances_info.has_key?([server, path]) end |
#instances_size ⇒ Object
135 136 137 |
# File 'lib/content_data/content_data.rb', line 135 def instances_size() @instances_info.length end |
#read_contents_chunk(filename, file, chunk_size) ⇒ Object
383 384 385 386 387 388 389 390 391 |
# File 'lib/content_data/content_data.rb', line 383 def read_contents_chunk(filename, file, chunk_size) chunk_index = 0 while chunk_index < chunk_size return reset_load_from_file(filename, file, "Expecting content line but " + "reached end of file after line #{$.}") unless file.gets chunk_index += 1 end true end |
#read_instances_chunk(filename, file, chunk_size) ⇒ Object
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 |
# File 'lib/content_data/content_data.rb', line 393 def read_instances_chunk(filename, file, chunk_size) chunk_index = 0 while chunk_index < chunk_size instance_line = file.gets return reset_load_from_file(filename, file, "Expected to read Instance line but reached EOF") unless instance_line parameters = instance_line.split(',') # bugfix: if file name consist a comma then parsing based on comma separating fails if (parameters.size > 5) (4..parameters.size-2).each do |i| parameters[3] = [parameters[3], parameters[i]].join(",") end (4..parameters.size-2).each do |i| parameters.delete_at(4) end end add_instance(parameters[0], parameters[1].to_i, parameters[2], parameters[3], parameters[4].to_i) chunk_index += 1 end true end |
#remove_content(checksum) ⇒ Object
246 247 248 249 250 251 252 253 254 |
# File 'lib/content_data/content_data.rb', line 246 def remove_content(checksum) content_info = @contents_info[checksum] if content_info content_info[1].each_key { |location| @instances_info.delete(location) } @contents_info.delete(checksum) end end |
#remove_directory(dir_to_remove, server) ⇒ Object
removes all instances records which are located under input param: dir_to_remove. found records are removed from both @instances_info and @instances_info. input params: server & dir_to_remove - are used to check each instance unique key (called location) removes also contents, if a contents becomes empty after removing instances
210 211 212 213 214 215 216 217 218 219 220 221 222 223 |
# File 'lib/content_data/content_data.rb', line 210 def remove_directory(dir_to_remove, server) contents_enum = @contents_info.each_key loop { checksum = contents_enum.next rescue break instances = @contents_info[checksum][1] instances.each_key { |location| if location[0] == server and location[1].scan(dir_to_remove).size > 0 instances.delete(location) @instances_info.delete(location) end } @contents_info.delete(checksum) if instances.empty? } end |
#remove_instance(server, path) ⇒ Object
removes an instance record both in @instances_info and @instances_info. input params: server & path - are the instance unique key (called location) removes also the content, if content becomes empty after removing the instance
195 196 197 198 199 200 201 202 203 204 |
# File 'lib/content_data/content_data.rb', line 195 def remove_instance(server, path) location = [server, path] checksum = @instances_info[location] content_info = @contents_info[checksum] return nil if content_info.nil? instances = content_info[1] instances.delete(location) @contents_info.delete(checksum) if instances.empty? @instances_info.delete(location) end |
#reset_load_from_file(file_name, file_io, err_msg) ⇒ Object
419 420 421 422 423 424 425 |
# File 'lib/content_data/content_data.rb', line 419 def reset_load_from_file(file_name, file_io, err_msg) Log.error("unexpected error reading file:#{file_name}\nError message:#{err_msg}") @contents_info = {} # Checksum --> [size, paths-->time(instance), time(content)] @instances_info = {} # location --> checksum to optimize instances query file_io.close nil end |
#to_file(filename) ⇒ Object
Write content data to file. Write is using chunks (for both content chunks and instances chunks) Chunk is used to maximize GC affect. The temporary memory of each chunk is GCed. Without the chunks used in a dipper stack level, GC keeps the temporary objects as part of the stack context.
277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
# File 'lib/content_data/content_data.rb', line 277 def to_file(filename) content_data_dir = File.dirname(filename) FileUtils.makedirs(content_data_dir) unless File.directory?(content_data_dir) File.open(filename, 'w') { |file| file.write("#{@contents_info.length}\n") contents_enum = @contents_info.each_key content_chunks = @contents_info.length / CHUNK_SIZE + 1 chunks_counter = 0 while chunks_counter < content_chunks to_file_contents_chunk(file,contents_enum, CHUNK_SIZE) GC.start chunks_counter += 1 end file.write("#{@instances_info.length}\n") contents_enum = @contents_info.each_key chunks_counter = 0 while chunks_counter < content_chunks to_file_instances_chunk(file,contents_enum, CHUNK_SIZE) GC.start chunks_counter += 1 end } end |
#to_file_contents_chunk(file, contents_enum, chunk_size) ⇒ Object
301 302 303 304 305 306 307 308 309 |
# File 'lib/content_data/content_data.rb', line 301 def to_file_contents_chunk(file, contents_enum, chunk_size) chunk_counter = 0 while chunk_counter < chunk_size checksum = contents_enum.next rescue return content_info = @contents_info[checksum] file.write("#{checksum},#{content_info[0]},#{content_info[2]}\n") chunk_counter += 1 end end |
#to_file_instances_chunk(file, contents_enum, chunk_size) ⇒ Object
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
# File 'lib/content_data/content_data.rb', line 311 def to_file_instances_chunk(file, contents_enum, chunk_size) chunk_counter = 0 while chunk_counter < chunk_size checksum = contents_enum.next rescue return content_info = @contents_info[checksum] instances_db_enum = content_info[1].each_key loop { location = instances_db_enum.next rescue break # provide the block with: checksum, size, content modification time,instance modification time, # server and path. instance_modification_time = content_info[1][location] file.write("#{checksum},#{content_info[0]},#{location[0]},#{location[1]},#{instance_modification_time}\n") } chunk_counter += 1 break if chunk_counter == chunk_size end end |
#to_s ⇒ Object
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
# File 'lib/content_data/content_data.rb', line 256 def to_s return_str = "" contents_str = "" instances_str = "" each_content { |checksum, size, content_mod_time| contents_str << "%s,%d,%d\n" % [checksum, size, content_mod_time] } each_instance { |checksum, size, content_mod_time, instance_mod_time, server, path| instances_str << "%s,%d,%s,%s,%d\n" % [checksum, size, server, path, instance_mod_time] } return_str << "%d\n" % [@contents_info.length] return_str << contents_str return_str << "%d\n" % [@instances_info.length] return_str << instances_str return_str end |
#unify_time ⇒ Object
for each content, all time fields (content and instances) are replaced with the min time found, while going through all time fields.
429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 |
# File 'lib/content_data/content_data.rb', line 429 def unify_time() contents_enum = @contents_info.each_key loop { checksum = contents_enum.next rescue break content_info = @contents_info[checksum] min_time_per_checksum = content_info[2] instances = content_info[1] instances_enum = instances.each_key loop { location = instances_enum.next rescue break instance_mod_time = instances[location] if instance_mod_time < min_time_per_checksum min_time_per_checksum = instance_mod_time end } # update all instances with min time instances_enum = instances.each_key loop { location = instances_enum.next rescue break instances[location] = min_time_per_checksum } # update content time with min time content_info[2] = min_time_per_checksum } end |
#unique_id ⇒ ID
Content Data unique identification
46 47 48 |
# File 'lib/content_data/content_data.rb', line 46 def unique_id @instances_info.hash end |
#validate(params = nil) ⇒ Boolean
Validates index against file system that all instances hold a correct data regarding files that they represents.
There are two levels of validation, controlled by instance_check_level system parameter:
-
shallow - quick, tests instance for file existence and attributes.
-
deep - can take more time, in addition to shallow recalculates hash sum.
467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 |
# File 'lib/content_data/content_data.rb', line 467 def validate(params = nil) # used to answer whether specific param was set param_exists = Proc.new do |param| !(params.nil? || params[param].nil?) end # used to process method parameters centrally process_params = Proc.new do |values| if param_exists.call(:failed) info = values[:details] unless info.nil? checksum = info[0] content_mtime = info[1] size = info[2] inst_mtime = info[3] server = info[4] file_path = info[5] params[:failed].add_instance(checksum, size, server, file_path, inst_mtime) end end end is_valid = true contents_enum = @contents_info.each_key loop { checksum = contents_enum.next rescue break instances = @contents_info[checksum] content_size = instances[0] content_mtime = instances[2] instances_enum = instances[1].each_key loop { unique_path = instances_enum.next rescue break instance_mtime = instances[1][unique_path] instance_info = [checksum, content_mtime, content_size, instance_mtime] instance_info.concat(unique_path) unless check_instance(instance_info) is_valid = false unless params.nil? || params.empty? process_params.call({:details => instance_info}) end end } } is_valid end |