Module: FormatParser::ZIPParser::OfficeFormats
- Included in:
- FormatParser::ZIPParser
- Defined in:
- lib/parsers/zip_parser/office_formats.rb
Overview
Based on an unscientific sample of 63 documents I could find on my hard drive, all docx/pptx/xlsx files contain, at the minimum, the following files:
[Content_types].xml
_rels/.rels
docProps/core.xml
docPropx/app.xml
Additionally, per file type, they contain the following:
word/document.xml
xl/workbook.xml
ppt/presentation.xml
These are sufficient to say with certainty that a ZIP is in fact an Office document. Also that unscientific sample revealed that I came to dislike MS Office so much as to only have 63 documents on my entire workstation.
We do not perform the actual decoding of the Office documents here, because to read their contents we need to:
-
inflate the compressed part files (potential for deflate bombs)
-
parse the document XML (potential for XML parser exploitation)
which are real threats and require adequate mitigation. For our purposes the token detection of specific filenames should be enough to say with certainty that a document is an Office document, and not just a ZIP.
Constant Summary collapse
- OFFICE_MARKER_FILES =
Set.new([ '[Content_Types].xml', '_rels/.rels', 'docProps/core.xml', 'docProps/app.xml', ])
Instance Method Summary collapse
- #office_document?(filenames_set) ⇒ Boolean
- #office_file_format_and_mime_type_from_entry_set(filenames_set) ⇒ Object
Instance Method Details
#office_document?(filenames_set) ⇒ Boolean
36 37 38 |
# File 'lib/parsers/zip_parser/office_formats.rb', line 36 def office_document?(filenames_set) OFFICE_MARKER_FILES.subset?(filenames_set) end |
#office_file_format_and_mime_type_from_entry_set(filenames_set) ⇒ Object
40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/parsers/zip_parser/office_formats.rb', line 40 def office_file_format_and_mime_type_from_entry_set(filenames_set) if filenames_set.include?('word/document.xml') [:docx, 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'] elsif filenames_set.include?('xl/workbook.xml') [:xlsx, 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'] elsif filenames_set.include?('ppt/presentation.xml') [:pptx, 'application/vnd.openxmlformats-officedocument.presentationml.presentation'] else [:unknown, 'application/zip'] end end |