Class: Hermeneutics::Entities
- Inherits:
-
Object
- Object
- Hermeneutics::Entities
- Defined in:
- lib/hermeneutics/escape.rb
Overview
Translate HTML and XML character entities: “&” to “&” and vice versa.
What actually happens
HTML pages usually come in with characters encoded < for < and € for €.
Further, they may contain a meta tag in the header like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta charset="utf-8" /> (HTML5)
or
<?xml version="1.0" encoding="UTF-8" ?> (XHTML)
When charset
is utf-8
and the file contains the byte sequence “303244”/
“xc3xa4” then there will be displayed a character “ä”.
When charset
is iso8859-15
and the file contains the byte sequence “344”/
“xe4” then there will be displayed a character “ä”, too.
The sequence “ä” will produce an “ä” in any case.
What you should do
Generating your own HTML pages you will always be safe when you only produce entity tags as ä and € or ä and € respectively.
What this module does
This module translates strings to a HTML-masked version. The encoding will not be changed and you may demand to keep 8-bit-characters.
Examples
Entities.encode "<" #=> "<"
Entities.decode "<" #=> "<"
Entities.encode "äöü" #=> "äöü"
Entities.decode "äöü" #=> "äöü"
Constant Summary collapse
- SPECIAL_ASC =
:stopdoc:
{ '"' => "quot", "&" => "amp", "<" => "lt", ">" => "gt", }
- RE_ASC =
/[#{SPECIAL_ASC.keys.map { |x| Regexp.quote x }.join}]/
- SPECIAL =
{ "\u00a0" => "nbsp", "¡" => "iexcl", "¢" => "cent", "£" => "pound", "€" => "euro", "¥" => "yen", "Š" => "Scaron", "¤" => "curren", "¦" => "brvbar", "§" => "sect", "š" => "scaron", "©" => "copy", "ª" => "ordf", "«" => "laquo", "¬" => "not", "" => "shy", "¨" => "uml", "®" => "reg", "¯" => "macr", "°" => "deg", "±" => "plusmn", "²" => "sup2", "³" => "sup3", "µ" => "micro", "¶" => "para", "´" => "acute", "·" => "middot", "¹" => "sup1", "º" => "ordm", "»" => "raquo", "Œ" => "OElig", "œ" => "oelig", "¸" => "cedil", "¼" => "frac14", "½" => "frac12", "Ÿ" => "Yuml", "¿" => "iquest", "¾" => "frac34", "À" => "Agrave", "Á" => "Aacute", "Â" => "Acirc", "Ã" => "Atilde", "Ä" => "Auml", "Å" => "Aring", "Æ" => "AElig", "Ç" => "Ccedil", "È" => "Egrave", "É" => "Eacute", "Ê" => "Ecirc", "Ë" => "Euml", "Ì" => "Igrave", "Í" => "Iacute", "Î" => "Icirc", "Ï" => "Iuml", "Ð" => "ETH", "Ñ" => "Ntilde", "Ò" => "Ograve", "Ó" => "Oacute", "Ô" => "Ocirc", "Õ" => "Otilde", "Ö" => "Ouml", "×" => "times", "Ø" => "Oslash", "Ù" => "Ugrave", "Ú" => "Uacute", "Û" => "Ucirc", "Ü" => "Uuml", "Ý" => "Yacute", "Þ" => "THORN", "ß" => "szlig", "à" => "agrave", "á" => "aacute", "â" => "acirc", "ã" => "atilde", "ä" => "auml", "å" => "aring", "æ" => "aelig", "ç" => "ccedil", "è" => "egrave", "é" => "eacute", "ê" => "ecirc", "ë" => "euml", "ì" => "igrave", "í" => "iacute", "î" => "icirc", "ï" => "iuml", "ð" => "eth", "ñ" => "ntilde", "ò" => "ograve", "ó" => "oacute", "ô" => "ocirc", "õ" => "otilde", "ö" => "ouml", "÷" => "divide", "ø" => "oslash", "ù" => "ugrave", "ú" => "uacute", "û" => "ucirc", "ü" => "uuml", "ý" => "yacute", "þ" => "thorn", "ÿ" => "yuml", "‚" => "bsquo", "‘" => "lsquo", "„" => "bdquo", "“" => "ldquo", "‹" => "lsaquo", "›" => "rsaquo", "–" => "ndash", "—" => "mdash", "‰" => "permil", "…" => "hellip", "†" => "dagger", "‡" => "Dagger", }.update SPECIAL_ASC
- NAMES =
SPECIAL.invert
Instance Attribute Summary collapse
-
#keep_8bit ⇒ Object
:startdoc:.
Class Method Summary collapse
-
.decode(str) ⇒ Object
:call-seq: Entities.decode( str) -> str.
- .encode(str) ⇒ Object
- .std ⇒ Object
Instance Method Summary collapse
- #decode(str) ⇒ Object
-
#encode(str) ⇒ Object
:call-seq: ent.encode( str) -> str.
-
#initialize(keep_8bit: nil) ⇒ Entities
constructor
:call-seq: new( keep_8bit: bool) -> ent.
Constructor Details
Instance Attribute Details
#keep_8bit ⇒ Object
:startdoc:
114 115 116 |
# File 'lib/hermeneutics/escape.rb', line 114 def keep_8bit @keep_8bit end |
Class Method Details
.decode(str) ⇒ Object
:call-seq:
Entities.decode( str) -> str
Replace HTML-style masks by normal characters:
Entities.decode "<" #=> "<"
Entities.decode "äöü" #=> "äöü"
Unmasked 8-bit-characters (+“ä”+ instead of “ä”) will be kept but translated to a unique encoding.
s = "ä ö ü"
s.encode! "utf-8"
Entities.decode s #=> "ä ö ü"
s = "\xe4 ö \xfc €"
s.force_encoding "iso-8859-15"
Entities.decode s #=> "ä ö ü €"
(in iso8859-15)
197 198 199 200 201 |
# File 'lib/hermeneutics/escape.rb', line 197 def decode str str.gsub /&(.+?);/ do (named_decode $1) or (numeric_decode $1) or $& end end |
.encode(str) ⇒ Object
173 174 175 |
# File 'lib/hermeneutics/escape.rb', line 173 def encode str std.encode str end |
.std ⇒ Object
169 170 171 |
# File 'lib/hermeneutics/escape.rb', line 169 def std @std ||= new end |
Instance Method Details
#decode(str) ⇒ Object
161 162 163 |
# File 'lib/hermeneutics/escape.rb', line 161 def decode str self.class.decode str end |
#encode(str) ⇒ Object
:call-seq:
ent.encode( str) -> str
Create a string thats characters are masked the HTML style:
ent = Entities.new
ent.encode "&<\"" #=> "&<""
ent.encode "äöü" #=> "äöü"
The result will be in the same encoding as the source even if it will not contain any 8-bit characters (what can only happen when keep_8bit
is set).
ent = Entities.new true
uml = "<ä>".encode "UTF-8"
ent.encode uml #=> "<\xc3\xa4>" in UTF-8
uml = "<ä>".encode "ISO-8859-1"
ent.encode uml #=> "<\xe4>" in ISO-8859-1
148 149 150 151 152 153 154 155 156 157 158 159 |
# File 'lib/hermeneutics/escape.rb', line 148 def encode str r = str.new_string r.gsub! RE_ASC do |x| "&#{SPECIAL_ASC[ x]};" end unless @keep_8bit then r.gsub! /[^\0-\x7f]/ do |c| c.encode! __ENCODING__ s = SPECIAL[ c] || ("#x%04x" % c.ord) "&#{s};" end end r end |