UTF8 Utils

This library provides a means of cleaning UTF8 strings with invalid characters.

It provides functionality very similar to ActiveSupport's tidy_bytes method, but works for Ruby 1.8.6 - 1.9.x. Once I sort out any potentially embarrassing issues with it, I'll probably try patching it into ActiveSupport.

The Problem

Here's what happens when you try to access a string with invalid UTF-8 characters in Ruby 1.9:

ruby-1.9.1-p378 > "my messed up \x92 string".split(//)
ArgumentError: invalid byte sequence in UTF-8
        from (irb):3:in `split'
        from (irb):3
        from /Users/norman/.rvm/rubies/ruby-1.9.1-p378/bin/irb:17:in `<main>'

The Solution

ruby-1.9.1-p378 > "my messed up \x92 string".to_utf8_codepoints.tidy_bytes.to_s.split(//u)
 => ["m", "y", " ", "m", "e", "s", "s", "e", "d", " ", "u", "p", " ", "’", " ", "s", "t", "r", "i", "n", "g"]

Amazing in its brevity and elegance, huh? Ok, maybe not really but if you have some badly encoded data you need to clean up, it can save you from ripping out your hair.

Note that like ActiveSupport, it naively assumes if you have invalid UTF8 characters, they are either Windows CP1251 or ISO8859-1. In practice this isn't a bad assumption, but may not always work.

Getting it

gem install utf8_utils

Using it

require "utf8_utils"

# Traverse codepoints
"hello-world".to_utf8_codepoints.each_codepoint do |codepoint|
    puts codepoint.valid?
 end

 # tidy bytes
 good_string = bad_string.to_utf8_codepoints.tidy_bytes.to_s

API Docs

http://norman.github.com/utf8_utils

Credits

Created by Norman Clarke, with some code stolen borrowed from ActiveRecord.

Copyright (c) 2010, released under the MIT license.