Class: Traject::NokogiriReader
- Inherits:
-
Object
- Object
- Traject::NokogiriReader
- Includes:
- Enumerable
- Defined in:
- lib/traject/nokogiri_reader.rb
Overview
A Trajet reader which reads XML, and yields zero to many Nokogiri::XML::Document objects as source records in the traject pipeline.
It does process the entire input document with Nokogiri::XML.parse, DOM-parsing,
so will take RAM for the entire input document, until iteration completes.
(There is a separate half-finished ExperimentalStreamingNokogiriReader
available, but it is
experimental, half-finished, may disappear or change in backwards compat at any time, problematic,
not recommended for production use, etc.)
You can have it yield the entire input XML as a single traject source record
(default), or you can use setting nokogiri.each_record_xpath
to split
the source up into separate records to yield into traject pipeline -- each one
will be it's own Nokogiri::XML::Document.
Settings
- nokogiri.default_namespaces: Set namespace prefixes that can be used in
other settings, including
extract_xpath
from NokogiriMacros. - nokogiri.each_record_xpath: if set to a string xpath, will take all matching nodes
from the input doc, and yield the individually as source records to the pipeline.
If you need to use namespaces here, you need to have them registered with
nokogiri.default_namespaces
. If your source docs use namespaces, you DO need to use them in your each_record_xpath. - nokogiri.strict_mode: if set to
true
or"true"
, ask Nokogiri to parse in 'strict' mode, it will raise aNokogiri::XML::SyntaxError
if the XML is not well-formed, instead of trying to take it's best-guess correction. https://nokogiri.org/tutorials/ensuring_well_formed_markup.html - nokogiri_reader.extra_xpath_hooks: Experimental in progress, see below.
nokogiri_reader.extra_xpath_hooks: For handling nodes outside of your each_record_xpath
What if you want to use each_record_xpath to yield certain nodes as source documents, but there is additional some other info in other parts of the input document you need? This came up when developing the OaiPmhNokogiriReader, which yields "//oai:record" as pipeline source documents, but also needed to look at "//oai:resumptionToken" to scrape the entire results.
There is a semi-finished/in-progress feature that meets that use case -- unclear if it will meet all use cases for this general issue.
Setting nokogiri_reader.extra_xpath_hooks
can be set to a Hash where the keys are xpaths (if using
namespaces must be must be registered with nokogiri.default_namespaces
), and the value is a lambda/
proc/callable object, taking two arguments.
provide "nokogiri_reader.extra_xpath_hooks", {
"//oai:resumptionToken" =>
lambda do |node, clipboard|
clipboard[:resumption_token] = node.text
end"
}
The first arg is the matching node. What's this clipboard? Well, what are you gonna do with what you get out of there, that you can do in a thread-safe way in the middle of nokogiri processing? The second arg is a thread-safe Hash "clipboard" that you can store things in, and later access via reader.clipboard.
There's no great thread-safe way to get reader.clipboard in a normal nokogiri pipeline though, (the reader can change in multi-file handling so there can be a race condition if you try naively, don't!) Which is why this feature needs some work for general applicability. The OaiPmhReader manually creates it's readers outside the usual nokogiri flow, so can use it.
Instance Attribute Summary collapse
-
#clipboard ⇒ Object
readonly
Returns the value of attribute clipboard.
-
#input_stream ⇒ Object
readonly
Returns the value of attribute input_stream.
-
#path_tracker ⇒ Object
readonly
Returns the value of attribute path_tracker.
-
#settings ⇒ Object
readonly
Returns the value of attribute settings.
Instance Method Summary collapse
- #default_namespaces ⇒ Object
- #each ⇒ Object
- #each_record_xpath ⇒ Object
- #extra_xpath_hooks ⇒ Object
-
#initialize(input_stream, settings) ⇒ NokogiriReader
constructor
A new instance of NokogiriReader.
Constructor Details
#initialize(input_stream, settings) ⇒ NokogiriReader
Returns a new instance of NokogiriReader.
66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/traject/nokogiri_reader.rb', line 66 def initialize(input_stream, settings) @settings = Traject::Indexer::Settings.new settings @input_stream = input_stream @clipboard = Traject::Util.is_jruby? ? Concurrent::Map.new : Concurrent::Hash.new default_namespaces # trigger validation validate_xpath(each_record_xpath, key_name: "each_record_xpath") if each_record_xpath extra_xpath_hooks.each_pair do |xpath, _callable| validate_xpath(xpath, key_name: "extra_xpath_hooks") end end |
Instance Attribute Details
#clipboard ⇒ Object (readonly)
Returns the value of attribute clipboard.
64 65 66 |
# File 'lib/traject/nokogiri_reader.rb', line 64 def clipboard @clipboard end |
#input_stream ⇒ Object (readonly)
Returns the value of attribute input_stream.
64 65 66 |
# File 'lib/traject/nokogiri_reader.rb', line 64 def input_stream @input_stream end |
#path_tracker ⇒ Object (readonly)
Returns the value of attribute path_tracker.
64 65 66 |
# File 'lib/traject/nokogiri_reader.rb', line 64 def path_tracker @path_tracker end |
#settings ⇒ Object (readonly)
Returns the value of attribute settings.
64 65 66 |
# File 'lib/traject/nokogiri_reader.rb', line 64 def settings @settings end |
Instance Method Details
#default_namespaces ⇒ Object
86 87 88 89 90 91 92 |
# File 'lib/traject/nokogiri_reader.rb', line 86 def default_namespaces @default_namespaces ||= (settings["nokogiri.namespaces"] || {}).tap { |ns| unless ns.kind_of?(Hash) raise ArgumentError, "nokogiri.namespaces must be a hash, not: #{ns.inspect}" end } end |
#each ⇒ Object
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/traject/nokogiri_reader.rb', line 94 def each config_proc = if settings["nokogiri.strict_mode"] proc { |config| config.strict } end whole_input_doc = Nokogiri::XML.parse(input_stream, &config_proc) if each_record_xpath whole_input_doc.xpath(each_record_xpath, default_namespaces).each do |matching_node| # We want to take the matching node, and make it into root in a new Nokogiri document. # This is tricky to do as performant as possible (we want to re-use the existing libxml node), # while preserving namespaces properly (especially in jruby). Some uses of noko api that seem # like they should work don't, esp in jruby. child_doc = Nokogiri::XML::Document.new reparent_node_to_root(child_doc, matching_node) yield child_doc child_doc = nil # hopefully make things easier on the GC. end else # caller wants whole doc as a traject source record yield whole_input_doc end run_extra_xpath_hooks(whole_input_doc) ensure # hopefully make things easier on the GC. whole_input_doc = nil end |
#each_record_xpath ⇒ Object
78 79 80 |
# File 'lib/traject/nokogiri_reader.rb', line 78 def each_record_xpath @each_record_xpath ||= settings["nokogiri.each_record_xpath"] end |
#extra_xpath_hooks ⇒ Object
82 83 84 |
# File 'lib/traject/nokogiri_reader.rb', line 82 def extra_xpath_hooks @extra_xpath_hooks ||= settings["nokogiri_reader.extra_xpath_hooks"] || {} end |