Class: Ole::Storage
- Inherits:
-
Object
- Object
- Ole::Storage
- Defined in:
- lib/ole/storage.rb,
lib/ole/file_system.rb,
lib/ole/property_set.rb
Overview
Introduction
Ole::Storage
is a class intended to abstract away details of the access to OLE2 structured storage files, such as those produced by Microsoft Office, eg *.doc, *.msg etc.
Usage
Usage should be fairly straight forward:
# get the parent ole storage object
ole = Ole::Storage.open 'myfile.msg', 'r+'
# => #<Ole::Storage io=#<File:myfile.msg> root=#<Dirent:"Root Entry">>
# read some data
ole.root[1].read 4
# => "\001\000\376\377"
# get the top level root object and output a tree structure for
# debugging
puts ole.root.to_tree
# =>
- #<Dirent:"Root Entry" size=3840 time="2006-11-03T00:52:53Z">
|- #<Dirent:"__nameid_version1.0" size=0 time="2006-11-03T00:52:53Z">
| |- #<Dirent:"__substg1.0_00020102" size=16 data="CCAGAAAAAADAAA...">
...
|- #<Dirent:"__substg1.0_8002001E" size=4 data="MTEuMA==">
|- #<Dirent:"__properties_version1.0" size=800 data="AAAAAAAAAAABAA...">
\- #<Dirent:"__recip_version1.0_#00000000" size=0 time="2006-11-03T00:52:53Z">
|- #<Dirent:"__substg1.0_0FF60102" size=4 data="AAAAAA==">
...
# write some data, and finish up (note that open is 'r+', so this overwrites
# but doesn't truncate)
ole.root["\001CompObj"].open { |f| f.write "blah blah" }
ole.close
Thanks
-
The code contained in this project was initially based on chicago’s libole (source available at prdownloads.sf.net/chicago/ole.tgz).
-
It was later augmented with some corrections by inspecting pole, and (purely for header definitions) gsf.
-
The property set parsing code came from the apache java project POIFS.
-
The excellent idea for using a pseudo file system style interface by providing #file and #dir methods which mimic File and Dir, was borrowed (along with almost unchanged tests!) from Thomas Sondergaard’s rubyzip.
TODO
-
the custom header cruft for Header and Dirent needs some love.
-
i have a number of classes doing load/save combos: Header, AllocationTable, Dirent, and, in a manner of speaking, but arguably different, Storage itself. they have differing api’s which would be nice to rethink. AllocationTable::Big must be created aot now, as it is used for all subsequent reads.
Defined Under Namespace
Classes: AllocationTable, DirClass, Dirent, FileClass, FormatError, Header, PropertySetSectionProxy, RangesIOMigrateable, RangesIOResizeable
Constant Summary collapse
- VERSION =
'1.2.6'
Instance Attribute Summary collapse
-
#bbat ⇒ Object
readonly
Low level internals, you probably shouldn’t need to mess with these.
-
#close_parent ⇒ Object
readonly
The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable.
-
#dirents ⇒ Object
readonly
The tree structure in its original flattened form.
-
#header ⇒ Object
readonly
Low level internals, you probably shouldn’t need to mess with these.
-
#io ⇒ Object
readonly
The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable.
-
#params ⇒ Object
readonly
options used at creation time.
-
#root ⇒ Object
readonly
The top of the ole tree structure.
-
#sb_file ⇒ Object
readonly
Low level internals, you probably shouldn’t need to mess with these.
-
#sbat ⇒ Object
readonly
Low level internals, you probably shouldn’t need to mess with these.
-
#writeable ⇒ Object
readonly
The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable.
Class Method Summary collapse
Instance Method Summary collapse
- #bat_for_size(size) ⇒ Object
- #clear ⇒ Object
- #close ⇒ Object
- #dir ⇒ Object
-
#dirent_from_path(path) ⇒ Object
tries to get a dirent for path.
- #file ⇒ Object
-
#flush ⇒ Object
the flush method is the main “save” method.
-
#initialize(arg, mode = nil, params = {}) ⇒ Storage
constructor
maybe include an option hash, and allow :close_parent => true, to be more general.
- #inspect ⇒ Object
-
#load ⇒ Object
load document from file.
-
#repack(temp = :file) ⇒ Object
could be useful with mis-behaving ole documents.
- #repack_using_io(temp_io) ⇒ Object
-
#summary_information ⇒ Object
(also: #summary_info)
this will be changed to use with_property_set.
-
#with_property_set(guid, filenames = nil) ⇒ Object
i’m thinking - search for a property set in
filenames
containing a section with guidguid
.
Constructor Details
#initialize(arg, mode = nil, params = {}) ⇒ Storage
maybe include an option hash, and allow :close_parent => true, to be more general. arg
should be either a file, or an IO
object, and needs to be seekable.
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
# File 'lib/ole/storage.rb', line 86 def initialize arg, mode=nil, params={} params, mode = mode, nil if Hash === mode params = {:update_timestamps => true}.merge(params) @params = params # get the io object @close_parent, @io = if String === arg mode ||= 'rb' [true, open(arg, mode)] else raise ArgumentError, 'unable to specify mode string with io object' if mode [false, arg] end # do we have this file opened for writing? don't know of a better way to tell # (unless we parse the mode string in the open case) # hmmm, note that in ruby 1.9 this doesn't work anymore. which is all the more # reason to use mode string parsing when available, and fall back to something like # io.writeable? otherwise. @writeable = begin if mode IO::Mode.new(mode).writeable? else @io.flush true end rescue IOError false end # silence undefined warning in clear @sb_file = nil # if the io object has data, we should load it, otherwise start afresh # this should be based on the mode string rather. @io.size > 0 ? load : clear end |
Instance Attribute Details
#bbat ⇒ Object (readonly)
Low level internals, you probably shouldn’t need to mess with these
82 83 84 |
# File 'lib/ole/storage.rb', line 82 def bbat @bbat end |
#close_parent ⇒ Object (readonly)
The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable
80 81 82 |
# File 'lib/ole/storage.rb', line 80 def close_parent @close_parent end |
#dirents ⇒ Object (readonly)
The tree structure in its original flattened form. only valid after #load, or #flush.
77 78 79 |
# File 'lib/ole/storage.rb', line 77 def dirents @dirents end |
#header ⇒ Object (readonly)
Low level internals, you probably shouldn’t need to mess with these
82 83 84 |
# File 'lib/ole/storage.rb', line 82 def header @header end |
#io ⇒ Object (readonly)
The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable
80 81 82 |
# File 'lib/ole/storage.rb', line 80 def io @io end |
#params ⇒ Object (readonly)
options used at creation time
73 74 75 |
# File 'lib/ole/storage.rb', line 73 def params @params end |
#root ⇒ Object (readonly)
The top of the ole tree structure
75 76 77 |
# File 'lib/ole/storage.rb', line 75 def root @root end |
#sb_file ⇒ Object (readonly)
Low level internals, you probably shouldn’t need to mess with these
82 83 84 |
# File 'lib/ole/storage.rb', line 82 def sb_file @sb_file end |
#sbat ⇒ Object (readonly)
Low level internals, you probably shouldn’t need to mess with these
82 83 84 |
# File 'lib/ole/storage.rb', line 82 def sbat @sbat end |
#writeable ⇒ Object (readonly)
The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable
80 81 82 |
# File 'lib/ole/storage.rb', line 80 def writeable @writeable end |
Class Method Details
.open(arg, mode = nil, params = {}) ⇒ Object
121 122 123 124 125 126 127 128 129 |
# File 'lib/ole/storage.rb', line 121 def self.open arg, mode=nil, params={} ole = new arg, mode, params if block_given? begin yield ole ensure; ole.close end else ole end end |
Instance Method Details
#bat_for_size(size) ⇒ Object
358 359 360 361 |
# File 'lib/ole/storage.rb', line 358 def bat_for_size size # note >=, not > previously. size >= @header.threshold ? @bbat : @sbat end |
#clear ⇒ Object
320 321 322 323 324 325 326 327 328 329 330 331 332 333 |
# File 'lib/ole/storage.rb', line 320 def clear # initialize to equivalent of loading an empty ole document. Log.warn 'creating new ole storage object on non-writable io' unless @writeable @header = Header.new @bbat = AllocationTable::Big.new self @root = Dirent.new self, :type => :root, :name => 'Root Entry' @dirents = [@root] @root.idx = 0 @sb_file.close if @sb_file @sb_file = RangesIOResizeable.new @bbat, :first_block => AllocationTable::EOC @sbat = AllocationTable::Small.new self # throw everything else the hell away @io.truncate 0 end |
#close ⇒ Object
192 193 194 195 196 |
# File 'lib/ole/storage.rb', line 192 def close @sb_file.close flush if @writeable @io.close if @close_parent end |
#dir ⇒ Object
61 62 63 |
# File 'lib/ole/file_system.rb', line 61 def dir @dir ||= DirClass.new self end |
#dirent_from_path(path) ⇒ Object
tries to get a dirent for path. return nil if it doesn’t exist (change it)
67 68 69 70 71 72 73 74 75 76 |
# File 'lib/ole/file_system.rb', line 67 def dirent_from_path path dirent = @root path = file. path path = path.sub(/^\/*/, '').sub(/\/*$/, '').split(/\/+/) until path.empty? return nil if dirent.file? return nil unless dirent = dirent/path.shift end dirent end |
#file ⇒ Object
57 58 59 |
# File 'lib/ole/file_system.rb', line 57 def file @file ||= FileClass.new self end |
#flush ⇒ Object
the flush method is the main “save” method. all file contents are always written directly to the file by the RangesIO objects, all this method does is write out all the file meta data - dirents, allocation tables, file header etc.
maybe add an option to zero the padding, and any remaining avail blocks in the allocation table.
TODO: long and overly complex. simplify and test better. eg, perhaps move serialization of bbat to AllocationTable::Big.
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 |
# File 'lib/ole/storage.rb', line 208 def flush # update root dirent, and flatten dirent tree @root.name = 'Root Entry' @root.first_block = @sb_file.first_block @root.size = @sb_file.size @dirents = @root.flatten # serialize the dirents using the bbat RangesIOResizeable.open @bbat, 'w', :first_block => @header.dirent_start do |io| @dirents.each { |dirent| io.write dirent.to_s } padding = (io.size / @bbat.block_size.to_f).ceil * @bbat.block_size - io.size io.write 0.chr * padding @header.dirent_start = io.first_block end # serialize the sbat # perhaps the blocks used by the sbat should be marked with BAT? RangesIOResizeable.open @bbat, 'w', :first_block => @header.sbat_start do |io| io.write @sbat.to_s @header.sbat_start = io.first_block @header.num_sbat = @bbat.chain(@header.sbat_start).length end # create RangesIOResizeable hooked up to the bbat. use that to claim bbat blocks using # truncate. then when its time to write, convert that chain and some chunk of blocks at # the end, into META_BAT blocks. write out the chain, and those meta bat blocks, and its # done. # this is perhaps not good, as we reclaim all bat blocks here, which # may include the sbat we just wrote. FIXME @bbat.map! do |b| b == AllocationTable::BAT || b == AllocationTable::META_BAT ? AllocationTable::AVAIL : b end # currently we use a loop. this could be better, but basically, # the act of writing out the bat, itself requires blocks which get # recorded in the bat. # # i'm sure that there'd be some simpler closed form solution to this. solve # recursive func: # # num_mbat_blocks = ceil(max((mbat_len - 109) * 4 / block_size, 0)) # bbat_len = initial_bbat_len + num_mbat_blocks # mbat_len = ceil(bbat_len * 4 / block_size) # # the actual bbat allocation table is itself stored throughout the file, and that chain # is stored in the initial blocks, and the mbat blocks. num_mbat_blocks = 0 io = RangesIOResizeable.new @bbat, 'w', :first_block => AllocationTable::EOC # truncate now, so that we can simplify size calcs - the mbat blocks will be appended in a # contiguous chunk at the end. # hmmm, i think this truncate should be matched with a truncate of the underlying io. if you # delete a lot of stuff, and free up trailing blocks, the file size never shrinks. this can # be fixed easily, add an io truncate @bbat.truncate! before = @io.size @io.truncate @bbat.block_size * (@bbat.length + 1) while true # get total bbat size. equivalent to @bbat.to_s.length, but for the factoring in of # the mbat blocks. we can't just add the mbat blocks directly to the bbat, as as this iteration # progresses, more blocks may be needed for the bat itself (if there are no more gaps), and the # mbat must remain contiguous. bbat_data_len = ((@bbat.length + num_mbat_blocks) * 4 / @bbat.block_size.to_f).ceil * @bbat.block_size # now storing the excess mbat blocks also increases the size of the bbat: new_num_mbat_blocks = ([bbat_data_len / @bbat.block_size - 109, 0].max * 4 / @bbat.block_size.to_f).ceil if new_num_mbat_blocks != num_mbat_blocks # need more space for the mbat. num_mbat_blocks = new_num_mbat_blocks elsif io.size != bbat_data_len # need more space for the bat # this may grow the bbat, depending on existing available blocks io.truncate bbat_data_len else break end end # now extract the info we want: ranges = io.ranges bbat_chain = @bbat.chain io.first_block # the extra mbat data is a set of contiguous blocks at the end io.close bbat_chain.each { |b| @bbat[b] = AllocationTable::BAT } # tack on the mbat stuff @header.mbat_start = @bbat.length # need to record this here before tacking on the mbat @header.num_bat = bbat_chain.length num_mbat_blocks.times { @bbat << AllocationTable::META_BAT } # now finally write the bbat, using a not resizable io. # the mode here will be 'r', which allows write atm. RangesIO.open(@io, :ranges => ranges) { |f| f.write @bbat.to_s } # this is the mbat. pad it out. bbat_chain += [AllocationTable::AVAIL] * [109 - bbat_chain.length, 0].max @header.num_mbat = num_mbat_blocks if num_mbat_blocks == 0 @header.mbat_start = AllocationTable::EOC else # write out the mbat blocks now. first of all, where are they going to be? mbat_data = bbat_chain[109..-1] q = @bbat.block_size / 4 mbat_data += [AllocationTable::AVAIL] *((mbat_data.length / q.to_f).ceil * q - mbat_data.length) ranges = @bbat.ranges((0...num_mbat_blocks).map { |i| @header.mbat_start + i }) RangesIO.open(@io, :ranges => ranges) { |f| f.write mbat_data.pack('V*') } end # now seek back and write the header out @io.seek 0 @io.write @header.to_s + bbat_chain[0, 109].pack('V*') @io.flush end |
#inspect ⇒ Object
363 364 365 |
# File 'lib/ole/storage.rb', line 363 def inspect "#<#{self.class} io=#{@io.inspect} root=#{@root.inspect}>" end |
#load ⇒ Object
load document from file.
TODO: implement various allocationtable checks, maybe as a AllocationTable#fsck function :)
-
reterminate any chain not ending in EOC. compare file size with actually allocated blocks per file.
-
pass through all chain heads looking for collisions, and making sure nothing points to them (ie they are really heads). in both sbat and mbat
-
we know the locations of the bbat data, and mbat data. ensure that there are placeholder blocks in the bat for them.
-
maybe a check of excess data. if there is data outside the bbat.truncate.length + 1 * block_size, (eg what is used for truncate in #flush), then maybe add some sort of message about that. it will be automatically thrown away at close time.
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# File 'lib/ole/storage.rb', line 144 def load # we always read 512 for the header block. if the block size ends up being different, # what happens to the 109 fat entries. are there more/less entries? @io.rewind header_block = @io.read 512 @header = Header.new header_block # create an empty bbat. @bbat = AllocationTable::Big.new self mbat_blocks = (0...@header.num_mbat).map { |i| i + @header.mbat_start } bbat_chain = (header_block[Header::SIZE..-1] + @bbat.read(mbat_blocks)).unpack 'V*' # am i using num_bat in the right way? @bbat.load @bbat.read(bbat_chain[0, @header.num_bat]) # get block chain for directories, read it, then split it into chunks and load the # directory entries. semantics changed - used to cut at first dir where dir.type == 0 @dirents = @bbat.read(@header.dirent_start).scan(/.{#{Dirent::SIZE}}/mo). map { |str| Dirent.new self, str }.reject { |d| d.type_id == 0 } # now reorder from flat into a tree # links are stored in some kind of balanced binary tree # check that everything is visited at least, and at most once # similarly with the blocks of the file. # was thinking of moving this to Dirent.to_tree instead. class << @dirents def to_tree idx=0 return [] if idx == Dirent::EOT d = self[idx] d.children = to_tree d.child raise FormatError, "directory #{d.inspect} used twice" if d.idx d.idx = idx to_tree(d.prev) + [d] + to_tree(d.next) end end @root = @dirents.to_tree.first Log.warn "root name was #{@root.name.inspect}" unless @root.name == 'Root Entry' unused = @dirents.reject(&:idx).length Log.warn "#{unused} unused directories" if unused > 0 # FIXME i don't currently use @header.num_sbat which i should # hmm. nor do i write it. it means what exactly again? # which mode to use here? @sb_file = RangesIOResizeable.new @bbat, :first_block => @root.first_block, :size => @root.size @sbat = AllocationTable::Small.new self @sbat.load @bbat.read(@header.sbat_start) end |
#repack(temp = :file) ⇒ Object
could be useful with mis-behaving ole documents. or to just clean them up.
336 337 338 339 340 341 342 343 344 345 346 |
# File 'lib/ole/storage.rb', line 336 def repack temp=:file case temp when :file Tempfile.open 'ole-repack' do |io| io.binmode repack_using_io io end when :mem; StringIO.open(&method(:repack_using_io)) else raise ArgumentError, "unknown temp backing #{temp.inspect}" end end |
#repack_using_io(temp_io) ⇒ Object
348 349 350 351 352 353 354 355 356 |
# File 'lib/ole/storage.rb', line 348 def repack_using_io temp_io @io.rewind IO.copy @io, temp_io clear Storage.open temp_io, nil, @params do |temp_ole| #temp_ole.root.type = :dir Dirent.copy temp_ole.root, root end end |
#summary_information ⇒ Object Also known as: summary_info
this will be changed to use with_property_set
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
# File 'lib/ole/property_set.rb', line 152 def summary_information dirent = root["\005SummaryInformation"] dirent.open do |io| propset = Types::PropertySet.new(io) sections = propset.sections # this will maybe get wrapped up as # section = propset[guid] # maybe taking it one step further, i'd hide the section thing, # and let you use composite keys, like # propset[4, guid] eg in MAPI, and just propset.doc_author. section = sections.find do |s| s.guid == Types::PropertySet::FMTID_SummaryInformation end return PropertySetSectionProxy.new(dirent, sections.index(section)) end end |
#with_property_set(guid, filenames = nil) ⇒ Object
i’m thinking - search for a property set in filenames
containing a section with guid guid
. then yield it. can read/write to it in the block. propsets themselves can have guids, but they are often all null.
134 135 |
# File 'lib/ole/property_set.rb', line 134 def with_property_set guid, filenames=nil end |