Class: Ole::Storage

Inherits:
Object
  • Object
show all
Defined in:
lib/ole/storage.rb,
lib/ole/file_system.rb,
lib/ole/property_set.rb

Overview

Introduction

Ole::Storage is a class intended to abstract away details of the access to OLE2 structured storage files, such as those produced by Microsoft Office, eg *.doc, *.msg etc.

Usage

Usage should be fairly straight forward:

# get the parent ole storage object
ole = Ole::Storage.open 'myfile.msg', 'r+'
# => #<Ole::Storage io=#<File:myfile.msg> root=#<Dirent:"Root Entry">>
# read some data
ole.root[1].read 4
# => "\001\000\376\377"
# get the top level root object and output a tree structure for
# debugging
puts ole.root.to_tree
# =>
- #<Dirent:"Root Entry" size=3840 time="2006-11-03T00:52:53Z">
  |- #<Dirent:"__nameid_version1.0" size=0 time="2006-11-03T00:52:53Z">
  |  |- #<Dirent:"__substg1.0_00020102" size=16 data="CCAGAAAAAADAAA...">
  ...
  |- #<Dirent:"__substg1.0_8002001E" size=4 data="MTEuMA==">
  |- #<Dirent:"__properties_version1.0" size=800 data="AAAAAAAAAAABAA...">
  \- #<Dirent:"__recip_version1.0_#00000000" size=0 time="2006-11-03T00:52:53Z">
     |- #<Dirent:"__substg1.0_0FF60102" size=4 data="AAAAAA==">
	 ...
# write some data, and finish up (note that open is 'r+', so this overwrites
# but doesn't truncate)
ole.root["\001CompObj"].open { |f| f.write "blah blah" }
ole.close

Thanks

  • The code contained in this project was initially based on chicago’s libole (source available at prdownloads.sf.net/chicago/ole.tgz).

  • It was later augmented with some corrections by inspecting pole, and (purely for header definitions) gsf.

  • The property set parsing code came from the apache java project POIFS.

  • The excellent idea for using a pseudo file system style interface by providing #file and #dir methods which mimic File and Dir, was borrowed (along with almost unchanged tests!) from Thomas Sondergaard’s rubyzip.

TODO

  • the custom header cruft for Header and Dirent needs some love.

  • i have a number of classes doing load/save combos: Header, AllocationTable, Dirent, and, in a manner of speaking, but arguably different, Storage itself. they have differing api’s which would be nice to rethink. AllocationTable::Big must be created aot now, as it is used for all subsequent reads.

Defined Under Namespace

Classes: AllocationTable, DirClass, Dirent, FileClass, FormatError, Header, PropertySetSectionProxy, RangesIOMigrateable, RangesIOResizeable

Constant Summary collapse

VERSION =
'1.2.6'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(arg, mode = nil, params = {}) ⇒ Storage

maybe include an option hash, and allow :close_parent => true, to be more general. arg should be either a file, or an IO object, and needs to be seekable.



86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# File 'lib/ole/storage.rb', line 86

def initialize arg, mode=nil, params={}
	params, mode = mode, nil if Hash === mode
	params = {:update_timestamps => true}.merge(params)
	@params = params
	
	# get the io object
	@close_parent, @io = if String === arg
		mode ||= 'rb'
		[true, open(arg, mode)]
	else
		raise ArgumentError, 'unable to specify mode string with io object' if mode
		[false, arg]
	end
	# do we have this file opened for writing? don't know of a better way to tell
	# (unless we parse the mode string in the open case)
	# hmmm, note that in ruby 1.9 this doesn't work anymore. which is all the more
	# reason to use mode string parsing when available, and fall back to something like
	# io.writeable? otherwise.
	@writeable = begin
		if mode
			IO::Mode.new(mode).writeable?
		else
			@io.flush
			true
		end
	rescue IOError
		false
	end
	# silence undefined warning in clear
	@sb_file = nil
	# if the io object has data, we should load it, otherwise start afresh
	# this should be based on the mode string rather.
	@io.size > 0 ? load : clear
end

Instance Attribute Details

#bbatObject (readonly)

Low level internals, you probably shouldn’t need to mess with these



82
83
84
# File 'lib/ole/storage.rb', line 82

def bbat
  @bbat
end

#close_parentObject (readonly)

The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable



80
81
82
# File 'lib/ole/storage.rb', line 80

def close_parent
  @close_parent
end

#direntsObject (readonly)

The tree structure in its original flattened form. only valid after #load, or #flush.



77
78
79
# File 'lib/ole/storage.rb', line 77

def dirents
  @dirents
end

#headerObject (readonly)

Low level internals, you probably shouldn’t need to mess with these



82
83
84
# File 'lib/ole/storage.rb', line 82

def header
  @header
end

#ioObject (readonly)

The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable



80
81
82
# File 'lib/ole/storage.rb', line 80

def io
  @io
end

#paramsObject (readonly)

options used at creation time



73
74
75
# File 'lib/ole/storage.rb', line 73

def params
  @params
end

#rootObject (readonly)

The top of the ole tree structure



75
76
77
# File 'lib/ole/storage.rb', line 75

def root
  @root
end

#sb_fileObject (readonly)

Low level internals, you probably shouldn’t need to mess with these



82
83
84
# File 'lib/ole/storage.rb', line 82

def sb_file
  @sb_file
end

#sbatObject (readonly)

Low level internals, you probably shouldn’t need to mess with these



82
83
84
# File 'lib/ole/storage.rb', line 82

def sbat
  @sbat
end

#writeableObject (readonly)

The underlying io object to/from which the ole object is serialized, whether we should close it, and whether it is writeable



80
81
82
# File 'lib/ole/storage.rb', line 80

def writeable
  @writeable
end

Class Method Details

.open(arg, mode = nil, params = {}) ⇒ Object



121
122
123
124
125
126
127
128
129
# File 'lib/ole/storage.rb', line 121

def self.open arg, mode=nil, params={}
	ole = new arg, mode, params
	if block_given?
		begin   yield ole
		ensure; ole.close
		end
	else ole
	end
end

Instance Method Details

#bat_for_size(size) ⇒ Object



358
359
360
361
# File 'lib/ole/storage.rb', line 358

def bat_for_size size
	# note >=, not > previously.
	size >= @header.threshold ? @bbat : @sbat
end

#clearObject



320
321
322
323
324
325
326
327
328
329
330
331
332
333
# File 'lib/ole/storage.rb', line 320

def clear
	# initialize to equivalent of loading an empty ole document.
	Log.warn 'creating new ole storage object on non-writable io' unless @writeable
	@header = Header.new
	@bbat = AllocationTable::Big.new self
	@root = Dirent.new self, :type => :root, :name => 'Root Entry'
	@dirents = [@root]
	@root.idx = 0
	@sb_file.close if @sb_file
	@sb_file = RangesIOResizeable.new @bbat, :first_block => AllocationTable::EOC
	@sbat = AllocationTable::Small.new self
	# throw everything else the hell away
	@io.truncate 0
end

#closeObject



192
193
194
195
196
# File 'lib/ole/storage.rb', line 192

def close
	@sb_file.close
	flush if @writeable
	@io.close if @close_parent
end

#dirObject



61
62
63
# File 'lib/ole/file_system.rb', line 61

def dir
	@dir ||= DirClass.new self
end

#dirent_from_path(path) ⇒ Object

tries to get a dirent for path. return nil if it doesn’t exist (change it)



67
68
69
70
71
72
73
74
75
76
# File 'lib/ole/file_system.rb', line 67

def dirent_from_path path
	dirent = @root
	path = file.expand_path path
	path = path.sub(/^\/*/, '').sub(/\/*$/, '').split(/\/+/)
	until path.empty?
		return nil if dirent.file?
		return nil unless dirent = dirent/path.shift
	end
	dirent
end

#fileObject



57
58
59
# File 'lib/ole/file_system.rb', line 57

def file
	@file ||= FileClass.new self
end

#flushObject

the flush method is the main “save” method. all file contents are always written directly to the file by the RangesIO objects, all this method does is write out all the file meta data - dirents, allocation tables, file header etc.

maybe add an option to zero the padding, and any remaining avail blocks in the allocation table.

TODO: long and overly complex. simplify and test better. eg, perhaps move serialization of bbat to AllocationTable::Big.



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
# File 'lib/ole/storage.rb', line 208

def flush
	# update root dirent, and flatten dirent tree
	@root.name = 'Root Entry'
	@root.first_block = @sb_file.first_block
	@root.size = @sb_file.size
	@dirents = @root.flatten

	# serialize the dirents using the bbat
	RangesIOResizeable.open @bbat, 'w', :first_block => @header.dirent_start do |io|
		@dirents.each { |dirent| io.write dirent.to_s }
		padding = (io.size / @bbat.block_size.to_f).ceil * @bbat.block_size - io.size
		io.write 0.chr * padding
		@header.dirent_start = io.first_block
	end

	# serialize the sbat
	# perhaps the blocks used by the sbat should be marked with BAT?
	RangesIOResizeable.open @bbat, 'w', :first_block => @header.sbat_start do |io|
		io.write @sbat.to_s
		@header.sbat_start = io.first_block
		@header.num_sbat = @bbat.chain(@header.sbat_start).length
	end

	# create RangesIOResizeable hooked up to the bbat. use that to claim bbat blocks using
	# truncate. then when its time to write, convert that chain and some chunk of blocks at
	# the end, into META_BAT blocks. write out the chain, and those meta bat blocks, and its
	# done.
	# this is perhaps not good, as we reclaim all bat blocks here, which
	# may include the sbat we just wrote. FIXME
	@bbat.map! do |b|
		b == AllocationTable::BAT || b == AllocationTable::META_BAT ?
			AllocationTable::AVAIL : b
	end
	
	# currently we use a loop. this could be better, but basically,
	# the act of writing out the bat, itself requires blocks which get
	# recorded in the bat.
	#
	# i'm sure that there'd be some simpler closed form solution to this. solve
	# recursive func:
	#
	#   num_mbat_blocks = ceil(max((mbat_len - 109) * 4 / block_size, 0))
	#   bbat_len = initial_bbat_len + num_mbat_blocks
	#   mbat_len = ceil(bbat_len * 4 / block_size)
	#
	# the actual bbat allocation table is itself stored throughout the file, and that chain
	# is stored in the initial blocks, and the mbat blocks.
	num_mbat_blocks = 0
	io = RangesIOResizeable.new @bbat, 'w', :first_block => AllocationTable::EOC
	# truncate now, so that we can simplify size calcs - the mbat blocks will be appended in a
	# contiguous chunk at the end.
	# hmmm, i think this truncate should be matched with a truncate of the underlying io. if you
	# delete a lot of stuff, and free up trailing blocks, the file size never shrinks. this can
	# be fixed easily, add an io truncate
	@bbat.truncate!
	before = @io.size
	@io.truncate @bbat.block_size * (@bbat.length + 1)
	while true
		# get total bbat size. equivalent to @bbat.to_s.length, but for the factoring in of
		# the mbat blocks. we can't just add the mbat blocks directly to the bbat, as as this iteration
		# progresses, more blocks may be needed for the bat itself (if there are no more gaps), and the
		# mbat must remain contiguous.
		bbat_data_len = ((@bbat.length + num_mbat_blocks) * 4 / @bbat.block_size.to_f).ceil * @bbat.block_size
		# now storing the excess mbat blocks also increases the size of the bbat:
		new_num_mbat_blocks = ([bbat_data_len / @bbat.block_size - 109, 0].max * 4 / @bbat.block_size.to_f).ceil
		if new_num_mbat_blocks != num_mbat_blocks
			# need more space for the mbat.
			num_mbat_blocks = new_num_mbat_blocks
		elsif io.size != bbat_data_len
			# need more space for the bat
			# this may grow the bbat, depending on existing available blocks
			io.truncate bbat_data_len
		else
			break
		end
	end

	# now extract the info we want:
	ranges = io.ranges
	bbat_chain = @bbat.chain io.first_block
	# the extra mbat data is a set of contiguous blocks at the end
	io.close
	bbat_chain.each { |b| @bbat[b] = AllocationTable::BAT }
	# tack on the mbat stuff
	@header.mbat_start = @bbat.length # need to record this here before tacking on the mbat
	@header.num_bat = bbat_chain.length
	num_mbat_blocks.times { @bbat << AllocationTable::META_BAT }

	# now finally write the bbat, using a not resizable io.
	# the mode here will be 'r', which allows write atm. 
	RangesIO.open(@io, :ranges => ranges) { |f| f.write @bbat.to_s }

	# this is the mbat. pad it out.
	bbat_chain += [AllocationTable::AVAIL] * [109 - bbat_chain.length, 0].max
	@header.num_mbat = num_mbat_blocks
	if num_mbat_blocks == 0
		@header.mbat_start = AllocationTable::EOC
	else
		# write out the mbat blocks now. first of all, where are they going to be?
		mbat_data = bbat_chain[109..-1]
		q = @bbat.block_size / 4
		mbat_data += [AllocationTable::AVAIL] *((mbat_data.length / q.to_f).ceil * q - mbat_data.length)
		ranges = @bbat.ranges((0...num_mbat_blocks).map { |i| @header.mbat_start + i })
		RangesIO.open(@io, :ranges => ranges) { |f| f.write mbat_data.pack('V*') }
	end

	# now seek back and write the header out
	@io.seek 0
	@io.write @header.to_s + bbat_chain[0, 109].pack('V*')
	@io.flush
end

#inspectObject



363
364
365
# File 'lib/ole/storage.rb', line 363

def inspect
	"#<#{self.class} io=#{@io.inspect} root=#{@root.inspect}>"
end

#loadObject

load document from file.

TODO: implement various allocationtable checks, maybe as a AllocationTable#fsck function :)

  1. reterminate any chain not ending in EOC. compare file size with actually allocated blocks per file.

  2. pass through all chain heads looking for collisions, and making sure nothing points to them (ie they are really heads). in both sbat and mbat

  3. we know the locations of the bbat data, and mbat data. ensure that there are placeholder blocks in the bat for them.

  4. maybe a check of excess data. if there is data outside the bbat.truncate.length + 1 * block_size, (eg what is used for truncate in #flush), then maybe add some sort of message about that. it will be automatically thrown away at close time.



144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# File 'lib/ole/storage.rb', line 144

def load
	# we always read 512 for the header block. if the block size ends up being different,
	# what happens to the 109 fat entries. are there more/less entries?
	@io.rewind
	header_block = @io.read 512
	@header = Header.new header_block

	# create an empty bbat.
	@bbat = AllocationTable::Big.new self
	mbat_blocks = (0...@header.num_mbat).map { |i| i + @header.mbat_start }
	bbat_chain = (header_block[Header::SIZE..-1] + @bbat.read(mbat_blocks)).unpack 'V*'
	# am i using num_bat in the right way?
	@bbat.load @bbat.read(bbat_chain[0, @header.num_bat])
	
	# get block chain for directories, read it, then split it into chunks and load the
	# directory entries. semantics changed - used to cut at first dir where dir.type == 0
	@dirents = @bbat.read(@header.dirent_start).scan(/.{#{Dirent::SIZE}}/mo).
		map { |str| Dirent.new self, str }.reject { |d| d.type_id == 0 }

	# now reorder from flat into a tree
	# links are stored in some kind of balanced binary tree
	# check that everything is visited at least, and at most once
	# similarly with the blocks of the file.
	# was thinking of moving this to Dirent.to_tree instead.
	class << @dirents
		def to_tree idx=0
			return [] if idx == Dirent::EOT
			d = self[idx]
			d.children = to_tree d.child
			raise FormatError, "directory #{d.inspect} used twice" if d.idx
			d.idx = idx
			to_tree(d.prev) + [d] + to_tree(d.next)
		end
	end

	@root = @dirents.to_tree.first
	Log.warn "root name was #{@root.name.inspect}" unless @root.name == 'Root Entry'
	unused = @dirents.reject(&:idx).length
	Log.warn "#{unused} unused directories" if unused > 0

	# FIXME i don't currently use @header.num_sbat which i should
	# hmm. nor do i write it. it means what exactly again?
	# which mode to use here?
	@sb_file = RangesIOResizeable.new @bbat, :first_block => @root.first_block, :size => @root.size
	@sbat = AllocationTable::Small.new self
	@sbat.load @bbat.read(@header.sbat_start)
end

#repack(temp = :file) ⇒ Object

could be useful with mis-behaving ole documents. or to just clean them up.



336
337
338
339
340
341
342
343
344
345
346
# File 'lib/ole/storage.rb', line 336

def repack temp=:file
	case temp
	when :file
		Tempfile.open 'ole-repack' do |io|
			io.binmode
			repack_using_io io
		end
	when :mem;  StringIO.open(&method(:repack_using_io))
	else raise ArgumentError, "unknown temp backing #{temp.inspect}"
	end
end

#repack_using_io(temp_io) ⇒ Object



348
349
350
351
352
353
354
355
356
# File 'lib/ole/storage.rb', line 348

def repack_using_io temp_io
	@io.rewind
	IO.copy @io, temp_io
	clear
	Storage.open temp_io, nil, @params do |temp_ole|
		#temp_ole.root.type = :dir
		Dirent.copy temp_ole.root, root
	end
end

#summary_informationObject Also known as: summary_info

this will be changed to use with_property_set



152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/ole/property_set.rb', line 152

def summary_information
	dirent = root["\005SummaryInformation"]
	dirent.open do |io|
		propset = Types::PropertySet.new(io)
		sections = propset.sections
		# this will maybe get wrapped up as
		# section = propset[guid]
		# maybe taking it one step further, i'd hide the section thing,
		# and let you use composite keys, like
		# propset[4, guid] eg in MAPI, and just propset.doc_author.
		section = sections.find do |s|
			s.guid == Types::PropertySet::FMTID_SummaryInformation
		end
		return PropertySetSectionProxy.new(dirent, sections.index(section))
	end
end

#with_property_set(guid, filenames = nil) ⇒ Object

i’m thinking - search for a property set in filenames containing a section with guid guid. then yield it. can read/write to it in the block. propsets themselves can have guids, but they are often all null.



134
135
# File 'lib/ole/property_set.rb', line 134

def with_property_set guid, filenames=nil
end