Class: Cacofonix::Normaliser

Inherits:
Object
  • Object
show all
Defined in:
lib/cacofonix/utils/normaliser.rb

Overview

A standalone class that can be used to normalise ONIX files into a standardised form. If you’re accepting ONIX files from a wide range of suppliers, you’re guarunteed to get all sorts of dialects.

This will create a new file that:

  • is UTF-8 encoded

  • uses reference tags, not short

  • has no named entities (ndash, etc) other than & < and >

Usage:

Cacofonix::Normaliser.process("oldfile.xml", "newfile.xml")

Dependencies:

At this stage the class depends on several external apps, all commonly available on *nix systems: xsltproc, isutf8, iconv and sed

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(oldfile, newfile = nil) ⇒ Normaliser

NB: Newfile argument is deprecated.

Raises:

  • (ArgumentError)


41
42
43
44
45
46
47
48
49
50
51
# File 'lib/cacofonix/utils/normaliser.rb', line 41

def initialize(oldfile, newfile = nil)
  raise ArgumentError, "#{oldfile} does not exist" unless File.file?(oldfile)
  raise "xsltproc app not found" unless app_available?("xsltproc")
  raise "tr app not found"       unless app_available?("tr")

  @oldfile = oldfile
  @newfile = newfile
  @curfile = next_tempfile
  FileUtils.cp(@oldfile, @curfile)
  @head    = File.open(@oldfile, "r") { |f| f.read(1024) }
end

Class Method Details

.process(oldfile, newfile) ⇒ Object

normalise oldfile and save it as newfile. oldfile will be left untouched



34
35
36
# File 'lib/cacofonix/utils/normaliser.rb', line 34

def process(oldfile, newfile)
  self.new(oldfile).normalise_to_path(newfile)
end

Instance Method Details

#app_available?(app) ⇒ Boolean

check the specified app is available on the system

Returns:

  • (Boolean)


87
88
89
# File 'lib/cacofonix/utils/normaliser.rb', line 87

def app_available?(app)
  `which #{app}`.strip == "" ? false : true
end

#next_tempfileObject

generate a temp filename



93
94
95
96
97
98
99
100
# File 'lib/cacofonix/utils/normaliser.rb', line 93

def next_tempfile
  p = nil
  Tempfile.open("onix") do |tf|
    p = tf.path
    tf.close!
  end
  p
end

#normalise_to_path(newfile) ⇒ Object

Raises:

  • (ArgumentError)


58
59
60
61
62
# File 'lib/cacofonix/utils/normaliser.rb', line 58

def normalise_to_path(newfile)
  raise ArgumentError, "#{newfile} already exists" if File.file?(newfile)
  @curfile = normalise_to_tempfile
  FileUtils.cp(@curfile, newfile)
end

#normalise_to_tempfileObject

Processes oldfile and puts the normalised result in a tempfile, returning the path to that tempfile.



67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/cacofonix/utils/normaliser.rb', line 67

def normalise_to_tempfile
  src = @curfile

  # remove short tags
  if @head.include?("ONIXmessage")
    dest = next_tempfile
    to_reference_tags(src, dest)
    src = dest
  end

  # remove control chars
  dest = next_tempfile
  remove_control_chars(src, dest)
  dest
end

#remove_control_chars(src, dest) ⇒ Object

XML files shouldn’t contain low ASCII control chars. Strip them.



117
118
119
120
121
# File 'lib/cacofonix/utils/normaliser.rb', line 117

def remove_control_chars(src, dest)
  inpath = File.expand_path(src)
  outpath = File.expand_path(dest)
  `cat #{inpath} | tr -d "\\000-\\010\\013\\014\\016-\\037" > #{outpath}`
end

#runObject

This is deprecated - use normalise_to_path with a path.



54
55
56
# File 'lib/cacofonix/utils/normaliser.rb', line 54

def run
  normalise_to_path(@newfile)
end

#to_reference_tags(src, dest) ⇒ Object

uses an XSLT stylesheet provided by edituer to convert a file from short tags to long tags.

more detail here:

http://www.editeur.org/files/ONIX%203/ONIX%20tagname%20converter%20v2.htm


108
109
110
111
112
113
# File 'lib/cacofonix/utils/normaliser.rb', line 108

def to_reference_tags(src, dest)
  inpath = File.expand_path(src)
  outpath = File.expand_path(dest)
  xsltpath = File.dirname(__FILE__) + "/../../../support/switch-onix-2.1-short-to-reference.xsl"
  `xsltproc -o #{outpath} #{xsltpath} #{inpath}`
end