Class: Hermeneutics::Entities

Inherits:
Object
  • Object
show all
Defined in:
lib/hermeneutics/escape.rb

Overview

Translate HTML and XML character entities: “&” to “&” and vice versa.

What actually happens

HTML pages usually come in with characters encoded &lt; for < and &euro; for .

Further, they may contain a meta tag in the header like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta charset="utf-8" />                        (HTML5)

or

<?xml version="1.0" encoding="UTF-8" ?>         (XHTML)

When charset is utf-8 and the file contains the byte sequence “303244”/“xc3xa4” then there will be displayed a character “ä”.

When charset is iso8859-15 and the file contains the byte sequence “344”/“xe4” then there will be displayed a character “ä”, too.

The sequence “&auml;” will produce an “ä” in any case.

What you should do

Generating your own HTML pages you will always be safe when you only produce entity tags as &auml; and &euro; or &#x00e4; and &#x20ac; respectively.

What this module does

This module translates strings to a HTML-masked version. The encoding will not be changed and you may demand to keep 8-bit-characters.

Examples

Entities.encode "<"                           #=> "&lt;"
Entities.decode "&lt;"                        #=> "<"
Entities.encode "äöü"                         #=> "&auml;&ouml;&uuml;"
Entities.decode "&auml;&ouml;&uuml;"          #=> "äöü"

Constant Summary collapse

SPECIAL_ASC =

:stopdoc:

{
  '"' => "quot",    "&" => "amp",     "<" => "lt",      ">" => "gt",
}
RE_ASC =
/[#{SPECIAL_ASC.keys.map { |x| Regexp.quote x }.join}]/
SPECIAL =
{
  "\u00a0" => "nbsp",
                    "¡" => "iexcl",   "¢" => "cent",    "£" => "pound",   "" => "euro",    "¥" => "yen",     "Š" => "Scaron",
                                                                          "¤" => "curren",                    "¦" => "brvbar",
  "§" => "sect",    "š" => "scaron",  "©" => "copy",    "ª" => "ordf",    "«" => "laquo",   "¬" => "not",     "­" => "shy",
                    "¨" => "uml",
  "®" => "reg",     "¯" => "macr",

  "°" => "deg",     "±" => "plusmn",  "²" => "sup2",    "³" => "sup3",                      "µ" => "micro",   "" => "para",
                                                                          "´" => "acute",
  "·" => "middot",                    "¹" => "sup1",    "º" => "ordm",    "»" => "raquo",   "Œ" => "OElig",   "œ" => "oelig",
                    "¸" => "cedil",                                                         "¼" => "frac14",  "½" => "frac12",
  "Ÿ" => "Yuml",    "¿" => "iquest",
  "¾" => "frac34",

  "À" => "Agrave",  "Á" => "Aacute",  "Â" => "Acirc",   "Ã" => "Atilde",  "Ä" => "Auml",    "Å" => "Aring",   "Æ" => "AElig",
  "Ç" => "Ccedil",  "È" => "Egrave",  "É" => "Eacute",  "Ê" => "Ecirc",   "Ë" => "Euml",    "Ì" => "Igrave",  "Í" => "Iacute",
  "Î" => "Icirc",   "Ï" => "Iuml",
  "Ð" => "ETH",     "Ñ" => "Ntilde",  "Ò" => "Ograve",  "Ó" => "Oacute",  "Ô" => "Ocirc",   "Õ" => "Otilde",  "Ö" => "Ouml",
  "×" => "times",   "Ø" => "Oslash",  "Ù" => "Ugrave",  "Ú" => "Uacute",  "Û" => "Ucirc",   "Ü" => "Uuml",    "Ý" => "Yacute",
  "Þ" => "THORN",   "ß" => "szlig",

  "à" => "agrave",  "á" => "aacute",  "â" => "acirc",   "ã" => "atilde",  "ä" => "auml",    "å" => "aring",   "æ" => "aelig",
  "ç" => "ccedil",  "è" => "egrave",  "é" => "eacute",  "ê" => "ecirc",   "ë" => "euml",    "ì" => "igrave",  "í" => "iacute",
  "î" => "icirc",   "ï" => "iuml",
  "ð" => "eth",     "ñ" => "ntilde",  "ò" => "ograve",  "ó" => "oacute",  "ô" => "ocirc",   "õ" => "otilde",  "ö" => "ouml",
  "÷" => "divide",  "ø" => "oslash",  "ù" => "ugrave",  "ú" => "uacute",  "û" => "ucirc",   "ü" => "uuml",    "ý" => "yacute",
  "þ" => "thorn",   "ÿ" => "yuml",

  "" => "bsquo",   "" => "lsquo",   "" => "bdquo",   "" => "ldquo",   "" => "lsaquo",  "" => "rsaquo",
  "" => "ndash",   "" => "mdash",   "" => "permil",  "" => "hellip",  "" => "dagger",  "" => "Dagger",
}.update SPECIAL_ASC
NAMES =
SPECIAL.invert

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(keep_8bit: nil) ⇒ Entities

:call-seq:

new( keep_8bit: bool)     -> ent

Creates an Entities converter.

ent = Entities.new keep_8bit: true


123
124
125
# File 'lib/hermeneutics/escape.rb', line 123

def initialize keep_8bit: nil
  @keep_8bit = keep_8bit
end

Instance Attribute Details

#keep_8bitObject

:startdoc:



114
115
116
# File 'lib/hermeneutics/escape.rb', line 114

def keep_8bit
  @keep_8bit
end

Class Method Details

.decode(str) ⇒ Object

:call-seq:

Entities.decode( str)       -> str

Replace HTML-style masks by normal characters:

Entities.decode "&lt;"                       #=> "<"
Entities.decode "&auml;&ouml;&uuml;"         #=> "äöü"

Unmasked 8-bit-characters (+“ä”+ instead of “&auml;”) will be kept but translated to a unique encoding.

s = "ä &ouml; ü"
s.encode! "utf-8"
Entities.decode s                            #=> "ä ö ü"

s = "\xe4 &ouml; \xfc &#x20ac;"
s.force_encoding "iso-8859-15"
Entities.decode s                            #=> "ä ö ü €"
                                                 (in iso8859-15)


197
198
199
200
201
# File 'lib/hermeneutics/escape.rb', line 197

def decode str
  str.gsub /&(.+?);/ do
    (named_decode $1) or (numeric_decode $1) or $&
  end
end

.encode(str) ⇒ Object



173
174
175
# File 'lib/hermeneutics/escape.rb', line 173

def encode str
  std.encode str
end

.stdObject



169
170
171
# File 'lib/hermeneutics/escape.rb', line 169

def std
  @std ||= new
end

Instance Method Details

#decode(str) ⇒ Object



161
162
163
# File 'lib/hermeneutics/escape.rb', line 161

def decode str
  self.class.decode str
end

#encode(str) ⇒ Object

:call-seq:

ent.encode( str)      -> str

Create a string thats characters are masked the HTML style:

ent = Entities.new
ent.encode "&<\""    #=> "&amp;&lt;&quot;"
ent.encode "äöü"     #=> "&auml;&ouml;&uuml;"

The result will be in the same encoding as the source even if it will not contain any 8-bit characters (what can only happen when keep_8bit is set).

ent = Entities.new true

uml = "<ä>".encode "UTF-8"
ent.encode uml             #=> "&lt;\xc3\xa4&gt;" in UTF-8

uml = "<ä>".encode "ISO-8859-1"
ent.encode uml             #=> "&lt;\xe4&gt;"     in ISO-8859-1


148
149
150
151
152
153
154
155
156
157
158
159
# File 'lib/hermeneutics/escape.rb', line 148

def encode str
  r = str.new_string
  r.gsub! RE_ASC do |x| "&#{SPECIAL_ASC[ x]};" end
  unless @keep_8bit then
    r.gsub! /[^\0-\x7f]/ do |c|
      c.encode! __ENCODING__
      s = SPECIAL[ c] || ("#x%04x" % c.ord)
      "&#{s};"
    end
  end
  r
end