Class: Traject::ExperimentalNokogiriStreamingReader::PathTracker
- Inherits:
-
Object
- Object
- Traject::ExperimentalNokogiriStreamingReader::PathTracker
- Defined in:
- lib/traject/experimental_nokogiri_streaming_reader.rb
Overview
initialized with the specification (a very small subset of xpath) for what records to yield-on-each. Tests to see if a Nokogiri::XML::Reader node matches spec.
'//record' or anchored to root: '/body/head/meta' same thing as './body/head/meta' or 'head/meta'
Elements can (and must, to match) have XML namespaces, if and only if they are registered with settings nokogiri.namespaces
sadly JRuby Nokogiri has an incompatibility with true nokogiri, and doesn't preserve our namespaces on outer_xml, so in JRuby we have to track them ourselves, and then also do yet ANOTHER parse in nokogiri. This may make this in Java even LESS performant, I'm afraid.
Instance Attribute Summary collapse
-
#clipboard ⇒ Object
readonly
Returns the value of attribute clipboard.
-
#current_path ⇒ Object
readonly
Returns the value of attribute current_path.
-
#extra_xpath_hooks ⇒ Object
readonly
Returns the value of attribute extra_xpath_hooks.
-
#inverted_namespaces ⇒ Object
readonly
Returns the value of attribute inverted_namespaces.
-
#namespaces_stack ⇒ Object
readonly
Returns the value of attribute namespaces_stack.
-
#path_spec ⇒ Object
readonly
Returns the value of attribute path_spec.
Instance Method Summary collapse
- #current_node_doc ⇒ Object
-
#fix_namespaces(doc) ⇒ Object
no-op unless it's jruby, and then we use our namespace stack to correctly add namespaces to the Nokogiri::XML::Document, cause in Jruby outer_xml on the Reader doesn't do it for us.
- #floating? ⇒ Boolean
-
#initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {}) ⇒ PathTracker
constructor
A new instance of PathTracker.
- #is_jruby? ⇒ Boolean
- #match? ⇒ Boolean
- #match_path?(path_to_match, floating:) ⇒ Boolean
-
#pop ⇒ Object
removes the last slash-separated component from current_path.
-
#push(reader_node) ⇒ Object
adds a component to slash-separated current_path, with namespace prefix.
- #run_extra_xpath_hooks ⇒ Object
Constructor Details
#initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {}) ⇒ PathTracker
Returns a new instance of PathTracker.
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 133 def initialize(str_spec, clipboard:, namespaces: {}, extra_xpath_hooks: {}) @inverted_namespaces = namespaces.invert @clipboard = clipboard # We're guessing using a string will be more efficient than an array @current_path = "" @floating = false @path_spec, @floating = parse_path(str_spec) @namespaces_stack = [] @extra_xpath_hooks = extra_xpath_hooks.collect do |path, callable| , floating = parse_path(path) { path: , floating: floating, callable: callable } end end |
Instance Attribute Details
#clipboard ⇒ Object (readonly)
Returns the value of attribute clipboard.
132 133 134 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132 def clipboard @clipboard end |
#current_path ⇒ Object (readonly)
Returns the value of attribute current_path.
132 133 134 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132 def current_path @current_path end |
#extra_xpath_hooks ⇒ Object (readonly)
Returns the value of attribute extra_xpath_hooks.
132 133 134 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132 def extra_xpath_hooks @extra_xpath_hooks end |
#inverted_namespaces ⇒ Object (readonly)
Returns the value of attribute inverted_namespaces.
132 133 134 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132 def inverted_namespaces @inverted_namespaces end |
#namespaces_stack ⇒ Object (readonly)
Returns the value of attribute namespaces_stack.
132 133 134 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132 def namespaces_stack @namespaces_stack end |
#path_spec ⇒ Object (readonly)
Returns the value of attribute path_spec.
132 133 134 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 132 def path_spec @path_spec end |
Instance Method Details
#current_node_doc ⇒ Object
195 196 197 198 199 200 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 195 def current_node_doc return nil unless @current_node # yeah, sadly we got to have nokogiri parse it again fix_namespaces(Nokogiri::XML.parse(@current_node.outer_xml)) end |
#fix_namespaces(doc) ⇒ Object
no-op unless it's jruby, and then we use our namespace stack to correctly add namespaces to the Nokogiri::XML::Document, cause in Jruby outer_xml on the Reader doesn't do it for us. :(
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 241 def fix_namespaces(doc) if is_jruby? # Only needed in jruby, nokogiri's jruby implementation isn't weird # around namespaces in exactly the same way as MRI. We need to keep # track of the namespaces in outer contexts ourselves, and then see # if they are needed ourselves. :( namespaces = namespaces_stack.compact.reduce({}, :merge) default_ns = namespaces.delete("xmlns") namespaces.each_pair do |attrib, uri| ns_prefix = attrib.sub(/\Axmlns:/, '') # gotta make sure it's actually used in the doc to not add it # unecessarily. GAH. if doc.xpath("//*[starts-with(name(), '#{ns_prefix}:')][1]").empty? && doc.xpath("//@*[starts-with(name(), '#{ns_prefix}:')][1]").empty? next end doc.root.add_namespace_definition(ns_prefix, uri) end if default_ns doc.root.default_namespace = default_ns # OMG nokogiri, really? default_ns = doc.root.namespace doc.xpath("//*[namespace-uri()='']").each do |node| node.namespace = default_ns end end end return doc end |
#floating? ⇒ Boolean
212 213 214 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 212 def floating? !!@floating end |
#is_jruby? ⇒ Boolean
170 171 172 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 170 def is_jruby? Traject::Util.is_jruby? end |
#match? ⇒ Boolean
216 217 218 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 216 def match? match_path?(path_spec, floating: floating?) end |
#match_path?(path_to_match, floating:) ⇒ Boolean
220 221 222 223 224 225 226 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 220 def match_path?(path_to_match, floating:) if floating? current_path.end_with?(path_to_match) else current_path == path_to_match end end |
#pop ⇒ Object
removes the last slash-separated component from current_path
203 204 205 206 207 208 209 210 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 203 def pop current_path.slice!( current_path.rindex('/')..-1 ) @current_node = nil if is_jruby? namespaces_stack.pop end end |
#push(reader_node) ⇒ Object
adds a component to slash-separated current_path, with namespace prefix.
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 175 def push(reader_node) namespace_prefix = reader_node.namespace_uri && inverted_namespaces[reader_node.namespace_uri] # gah, reader_node.name has the namespace prefix in there node_name = reader_node.name.gsub(/[^:]+:/, '') node_str = if namespace_prefix namespace_prefix + ":" + node_name else reader_node.name end current_path << ("/" + node_str) if is_jruby? namespaces_stack << reader_node.namespaces end @current_node = reader_node end |
#run_extra_xpath_hooks ⇒ Object
228 229 230 231 232 233 234 235 236 |
# File 'lib/traject/experimental_nokogiri_streaming_reader.rb', line 228 def run_extra_xpath_hooks return unless @current_node extra_xpath_hooks.each do |hook_spec| if match_path?(hook_spec[:path], floating: hook_spec[:floating]) hook_spec[:callable].call(current_node_doc, clipboard) end end end |