Class: TaliaCore::ActiveSourceParts::Xml::GenericReader
- Extended by:
- TaliaCore::ActiveSourceParts::Xml::GenericReaderImportStatements::Handlers, TaliaUtil::IoHelper
- Includes:
- GenericReaderAddStatements, GenericReaderHelpers, GenericReaderImportStatements, TaliaUtil::IoHelper, TaliaUtil::Progressable, TaliaUtil::UriHelper
- Defined in:
- lib/talia_core/active_source_parts/xml/generic_reader.rb
Overview
Superclass for importers/readers of generic xml files. The idea is that the user can very easily create subclasses of this that can import almost any XML format imaginable - see the SourceReader class for a simple example.
The result of the “import” is a hash (available through #sources) which contains all the data from the import file in a standardized format. This hash can then be processed by the ActiveSource class to create the actual sources.
Writing XML importers
Writing an importer is quite easy, all it takes is to subclass this class and then describe the structure of the element using the methods defined here.
The reader subclass should declare handlers for the various XML tags that are in the file. See GenericReaderImportStatements for an explanation of how the handlers work and how they are declared. This module also contains methods to retrieve data from the XML in order to use it in the import
The GenericReaderAddStatements contain the methods that are used to add data to the source that is currently being imported.
In addition to the SourceReader class that can be used as an example, the other modules also contain some code examples for the mechanism.
There are also some GenericReaderHelpers that can be used during the import.
Using an Importer
The default way of using an importer is usually indirectly, through ActiveSource.create_from_xml. For direct use the sources_from_url or sources_from methods can be called - these are the entry points for the import process.
Result of the import operation
The result of an import is an Array that contains a number of hashes. Each of those can be passed to ActiveSource.new to create a new source object with the given attributes.
Progress Reporting
The class implements the TaliaUtil::Progressable interface, and if a progressor object is assigned, it will report the progress to it during the import operation.
Direct Known Subclasses
Defined Under Namespace
Classes: State
Class Attribute Summary collapse
-
.create_handlers ⇒ Object
readonly
Returns the registered handlers.
Class Method Summary collapse
-
.can_use_root ⇒ Object
Set the reader to allow the use of root elements for import.
-
.sources_from(source, progressor = nil, base_url = nil) ⇒ Object
Read the sources from the given IO stream.
-
.sources_from_url(url, options = nil, progressor = nil) ⇒ Object
See the IoHelper class for help on the options.
-
.use_root ⇒ Object
True if the reader should also check the root element, instead of only checking the children.
Instance Method Summary collapse
-
#add_source_with_check(source_attribs) ⇒ Object
This will add the given source to the global result.
-
#base_file_url ⇒ Object
This is the “base” for resolving file URLs.
-
#base_file_url=(new_base_url) ⇒ Object
Assign a new base_file_url.
-
#call_handler(element) ⇒ Object
Call the handler method for the given element.
-
#check_objects(objects) ⇒ Object
Pass in a list of elements that are to be used as objects in RDF triples.
-
#chk_create ⇒ Object
Checks if the current status has an attribute hash, which means that there is a “current” source being created at the moment.
-
#create_handlers ⇒ Object
Returns a hash with all handlers that “create” (that is, they create a new source when called).
-
#initialize(source) ⇒ GenericReader
constructor
Create a new reader.
-
#read_children_of(element, progress = nil, &block) ⇒ Object
Read source data from each child of the given element using read_source.
-
#read_children_with_progress(element, &block) ⇒ Object
As read_children of, using the standard progressor of the reader.
-
#read_source(element, &block) ⇒ Object
Read a single source from a XML elem.
-
#set_element(predicate, object, required) ⇒ Object
Add a property to the source that is currently being imported.
-
#sources ⇒ Object
Build a list of sources.
-
#use_root ⇒ Object
Same as use_root of the current class.
Methods included from TaliaUtil::IoHelper
base_for, file_url, open_from_url, open_generic
Methods included from TaliaCore::ActiveSourceParts::Xml::GenericReaderImportStatements::Handlers
Methods included from GenericReaderHelpers
#current_is_a?, #get_absolute_file_url, #join_files, #join_url, #parse_date, #source_exists?, #to_iso8601
Methods included from GenericReaderAddStatements
#add, #add_date, #add_date_interval, #add_file, #add_i18n, #add_rel
Methods included from GenericReaderImportStatements
#add_part, #add_source, #all_elements, #from_attribute, #from_element, #nested
Methods included from TaliaUtil::UriHelper
Methods included from TaliaUtil::Progressable
#progressor, #progressor=, #run_with_progress
Constructor Details
#initialize(source) ⇒ GenericReader
Create a new reader. This parses the XML contained from the source and makes the resulting XML document available to the reader
129 130 131 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 129 def initialize(source) @doc = Hpricot.XML(source) end |
Class Attribute Details
.create_handlers ⇒ Object (readonly)
Returns the registered handlers
105 106 107 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 105 def create_handlers @create_handlers end |
Class Method Details
.can_use_root ⇒ Object
Set the reader to allow the use of root elements for import
94 95 96 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 94 def can_use_root @use_root = true end |
.sources_from(source, progressor = nil, base_url = nil) ⇒ Object
Read the sources from the given IO stream. You may specify a base url to help the reader to decide from where files should be opened.
86 87 88 89 90 91 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 86 def sources_from(source, progressor = nil, base_url=nil) reader = self.new(source) reader.base_file_url = base_url if(base_url) reader.progressor = progressor reader.sources end |
.sources_from_url(url, options = nil, progressor = nil) ⇒ Object
See the IoHelper class for help on the options. A progressor may be supplied on which the importer will report it’s progress.
80 81 82 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 80 def sources_from_url(url, = nil, progressor = nil) open_generic(url, ) { |io| sources_from(io, progressor, url) } end |
.use_root ⇒ Object
True if the reader should also check the root element, instead of only checking the children
100 101 102 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 100 def use_root @use_root || false end |
Instance Method Details
#add_source_with_check(source_attribs) ⇒ Object
This will add the given source to the global result. source_attribs is a hash with the attributes of one source. If that source already exists in the global results, the two versions will be merged:
-
If the property is a list of values (an Array) in both the new and the old version, these lists will be joined.
-
Otherwise, the old property will be overwritten by the new one
The source_attribs must contain a URI, and they *must not* change a type field that is anything else than nil or TaliaCore::SourceTypes::DummySource
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 187 def add_source_with_check(source_attribs) assit_kind_of(Hash, source_attribs) # Check if we have a URI if((uri = source_attribs['uri']).blank?) raise(RuntimeError, "Problem reading from XML: Source without URI (#{source_attribs.inspect})") else source_attribs['uri'] = irify(uri) # "Irify" the URI (see UriHelper module) @sources[uri] ||= {} # This is the hash in the global result for our uri @sources[uri].each do |key, value| # Loop through existing results next unless(new_value = source_attribs.delete(key)) # Skip all existing that are not in the new attributes # Assert that we don't change a type away from DummySource - this would indicate some problem w/ the data assit(!((key.to_sym == :type) && (value != 'TaliaCore::SourceTypes::DummySource') && (value != new_value)), "Type should not change during import, may be a format problem. (From #{value} to #{new_value})") if(new_value.is_a?(Array) && value.is_a?(Array)) # If both new and old are Array-types, the new elements will be appended # and duplicates will be removed @sources[uri][key] = (value + new_value).uniq else # Otherwise just replace the old value with the new one @sources[uri][key] = new_value end end # Everything that is only in the new attributes can be merged in @sources[uri].merge!(source_attribs) end end |
#base_file_url ⇒ Object
This is the “base” for resolving file URLs. If a file URL is found to be relative, it will be relative to this URL.
If no base URL was specified this will use the file system path to TALIA_ROOT
168 169 170 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 168 def base_file_url @base_file_url ||= TALIA_ROOT end |
#base_file_url=(new_base_url) ⇒ Object
Assign a new base_file_url
173 174 175 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 173 def base_file_url=(new_base_url) @base_file_url = base_for(new_base_url) end |
#call_handler(element) ⇒ Object
Call the handler method for the given element. If a block is given, that will be called instead. Pass in the XML element to read from.
This saves the @current State object before calling the handler, and restores it after the call is complete. Thus nested calls will have their own state, but the state will be restored once you return to the parent handler.
If a block is given, that block will be executed as the handler. Otherwise the system checks for the “<element.name>_handler” method, and calls it. (See also element_handler)
If no block is given and no handler is found, an error is logged.
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 266 def call_handler(element) handler_name = "#{element.name}_handler".to_sym if(self.respond_to?(handler_name) || block_given?) parent_state = @current # Save the state for recursive calls attributes = nil begin creating = (create_handlers[handler_name] || block_given?) @current = State.new @current.attributes = creating ? {} : nil @current.element = element block_given? ? yield : self.send(handler_name) attributes = @current.attributes ensure @current = parent_state # Reset the state to previous value end attributes else TaliaCore.logger.warn("Unknown element in import: #{element.name}") false end end |
#check_objects(objects) ⇒ Object
Pass in a list of elements that are to be used as objects in RDF triples. This method will check the objects and remove any blank ones (which should not be added).
If no non-blank element is found in the input, this will always return nil
324 325 326 327 328 329 330 331 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 324 def check_objects(objects) if(objects.kind_of?(Array)) objects.reject! { |obj| obj.blank? } (objects.size == 0) ? nil : objects else objects.blank? ? nil : objects end end |
#chk_create ⇒ Object
Checks if the current status has an attribute hash, which means that there is a “current” source being created at the moment.
290 291 292 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 290 def chk_create raise(RuntimeError, "Illegal operation when not creating a source") unless(@current.attributes) end |
#create_handlers ⇒ Object
Returns a hash with all handlers that “create” (that is, they create a new source when called). This is taken from the class’ create_handlers accessor
216 217 218 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 216 def create_handlers @handlers ||= (self.class.create_handlers || {}) end |
#read_children_of(element, progress = nil, &block) ⇒ Object
Read source data from each child of the given element using read_source. Optionally reports the progress to the given progressor.
240 241 242 243 244 245 246 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 240 def read_children_of(element, progress = nil, &block) element.children.each do |element| progress.inc if(progress) next unless(element.is_a?(Hpricot::Elem)) # only use XML elements read_source(element, &block) end end |
#read_children_with_progress(element, &block) ⇒ Object
As read_children of, using the standard progressor of the reader
231 232 233 234 235 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 231 def read_children_with_progress(element, &block) run_with_progress('Xml Read', element.children.size) do |prog| read_children_of(element, prog, &block) end end |
#read_source(element, &block) ⇒ Object
Read a single source from a XML elem. Pass in the XML element and an (optional) block. This will call the handler (or block, see call_handler) and add the result to the global result set using add_source_with_check
225 226 227 228 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 225 def read_source(element, &block) attribs = call_handler(element, &block) add_source_with_check(attribs) if(attribs) end |
#set_element(predicate, object, required) ⇒ Object
Add a property to the source that is currently being imported. If no object is given, the method just exits, unless required is set, in which case an error will be raised for an empty object.
Database properties will be added as a single string, while other (semantic) properties will always be added into an array (even if there is just a single object).
This is the base code for adding elements, which is used for the add_* methods in GenericReaderAddStatements. This method should not usually be used directly.
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 302 def set_element(predicate, object, required) chk_create object = check_objects(object) if(!object) raise(ArgumentError, "No object given, but is required for #{predicate}.") if(required) return end predicate = predicate.respond_to?(:uri) ? predicate.uri.to_s : predicate.to_s if(ActiveSource.db_attr?(predicate)) assit(!object.is_a?(Array)) @current.attributes[predicate] = object else @current.attributes[predicate] ||= [] @current.attributes[predicate] << object end end |
#sources ⇒ Object
Build a list of sources. This will return an array of hashes, and each hash can be used to create a new source with ActiveSource.new.
The result will be cached and once read, subsequent calls will return the same set of “sources” again
*Example of Result*:
[
{
'uri' => 'http://foobar.com',
'type' => 'TaliaCore::Collection',
'http://rdfbar/foo' => '<http://taliainstall/otherthing'
},
{
'uri' => 'http://taliainstall/otherthing',
'type' => 'TaliaCore::DataTypes::DummySource'
}
]
152 153 154 155 156 157 158 159 160 161 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 152 def sources return @sources if(@sources) @sources = {} if(use_root && self.respond_to?("#{@doc.root.name}_handler".to_sym)) run_with_progress('XmlRead', 1) { read_source(@doc.root) } else read_children_with_progress(@doc.root) end @sources.values end |
#use_root ⇒ Object
Same as use_root of the current class
249 250 251 |
# File 'lib/talia_core/active_source_parts/xml/generic_reader.rb', line 249 def use_root self.class.use_root end |