Class: Traject::ExperimentalNokogiriStreamingReader
- Inherits:
-
Object
- Object
- Traject::ExperimentalNokogiriStreamingReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/experimental_nokogiri_streaming_reader.rb
Overview
An EXPERIMENTAL HALF-FINISHED implementation of a streaming/pull reader using Nokogiri. Not ready for use, not stable API, could go away.
This was my first try at a NokogiriReader implementation, it didn't work out, at least without a lot more work. I think we'd need to re-do it to build the Nokogiri::XML::Nodes by hand as the source is traversed, instead of relying on #outer_xml -- outer_xml returning a string results in a double-parsing, with the expected 50% performance hit. Picadillos in Nokogiri JRuby namespace handling don't help.
All in all, it's possible something could be gotten here with a lot more work, it's also possible Nokogiri's antipathy to namespaces could keep getting in the way.
Defined Under Namespace
Classes: PathTracker
Instance Attribute Summary collapse
-
#clipboard ⇒ Object
readonly
Returns the value of attribute clipboard.
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#path_tracker ⇒ Object
readonly
Returns the value of attribute path_tracker.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #default_namespaces ⇒ Object
- #each ⇒ Object
- #each_record_xpath ⇒ Object
- #extra_xpath_hooks ⇒ Object
-
#initialize(input_stream, settings) ⇒ ExperimentalNokogiriStreamingReader
constructor
A new instance of ExperimentalNokogiriStreamingReader.
Constructor Details
#initialize(input_stream, settings) ⇒ ExperimentalNokogiriStreamingReader
Returns a new instance of ExperimentalNokogiriStreamingReader.
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 17 def initialize(input_stream, settings) @settings = Traject::Indexer::Settings.new settings @input_stream = input_stream @clipboard = Traject::Util.is_jruby? ? Concurrent::Map.new : Concurrent::Hash.new if each_record_xpath @path_tracker = PathTracker.new(each_record_xpath, clipboard: self.clipboard, namespaces: default_namespaces, extra_xpath_hooks: extra_xpath_hooks) end default_namespaces # trigger validation validate_limited_xpath(each_record_xpath, key_name: "each_record_xpath") end |
Instance Attribute Details
#clipboard ⇒ Object (readonly)
Returns the value of attribute clipboard.
15 16 17 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15 def clipboard @clipboard end |
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
15 16 17 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15 def input_stream @input_stream end |
#path_tracker ⇒ Object (readonly)
Returns the value of attribute path_tracker.
15 16 17 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15 def path_tracker @path_tracker end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
15 16 17 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 15 def settings @settings end |
Instance Method Details
#default_namespaces ⇒ Object
77 78 79 80 81 82 83 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 77 def default_namespaces @default_namespaces ||= (settings["nokogiri.namespaces"] || {}).tap { |ns| unless ns.kind_of?(Hash) raise ArgumentError, "nokogiri.namespaces must be a hash, not: #{ns.inspect}" end } end |
#each ⇒ Object
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 85 def each unless each_record_xpath # forget streaming, just read it and return it once, done. yield Nokogiri::XML.parse(input_stream) return end reader = Nokogiri::XML::Reader(input_stream) reader.each do |reader_node| if reader_node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT path_tracker.push(reader_node) if path_tracker.match? yield path_tracker.current_node_doc end path_tracker.run_extra_xpath_hooks if reader_node.self_closing? path_tracker.pop end end if reader_node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT path_tracker.pop end end end |
#each_record_xpath ⇒ Object
34 35 36 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 34 def each_record_xpath @each_record_xpath ||= settings["nokogiri.each_record_xpath"] end |
#extra_xpath_hooks ⇒ Object
38 39 40 41 42 43 44 45 46 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 38 def extra_xpath_hooks @extra_xpath_hooks ||= begin (settings["nokogiri_reader.extra_xpath_hooks"] || {}).tap do |hash| hash.each_pair do |limited_xpath, callable| validate_limited_xpath(limited_xpath, key_name: "nokogiri_reader.extra_xpath_hooks") end end end end |