StringSplitter

Build Status Gem Version

NAME

StringSplitter - String#split on steroids

INSTALLATION

gem "string_splitter"

SYNOPSIS

require "string_splitter"

ss = StringSplitter.new

Same as String#split

ss.split("foo bar baz")
ss.split("foo bar baz", " ")
ss.split("foo bar baz", /\s+/)
# => ["foo", "bar", "baz"]

ss.split("foo", "")
ss.split("foo", //)
# => ["f", "o", "o"]

ss.split("", "...")
ss.split("", /.../)
# => []

Split at the first delimiter

ss.split("foo:bar:baz:quux", ":", at: 1)
ss.split("foo:bar:baz:quux", ":", select: 1)
# => ["foo", "bar:baz:quux"]

Split at the last delimiter

ss.split("foo:bar:baz:quux", ":", at: -1)
# => ["foo:bar:baz", "quux"]

Split at multiple delimiter positions

ss.split("1:2:3:4:5:6:7:8:9", ":", at: [1..3, -1])
# => ["1", "2", "3", "4:5:6:7:8", "9"]

Split at all but the first and last delimiters

ss.split("1:2:3:4:5:6", ":", except: [1, -1])
ss.split("1:2:3:4:5:6", ":", reject: [1, -1])
# => ["1:2", "3", "4", "5:6"]

Split from the right

ss.rsplit("1:2:3:4:5:6:7:8:9", ":", at: [1..3, -1])
# => ["1", "2:3:4:5:6", "7", "8", "9"]

Split with negative, descending, and infinite ranges

ss.split("1:2:3:4:5:6:7:8:9", ":", at: ..-3)
# => ["1", "2", "3", "4", "5", "6", "7:8:9"]

ss.split("1:2:3:4:5:6:7:8:9", ":", at: 4...)
# => ["1:2:3:4", "5", "6", "7", "8:9"]

ss.split("1:2:3:4:5:6:7:8:9", ":", at: [1, 5..3, -2..])
# => ["1", "2:3", "4", "5", "6:7", "8", "9"]

Full control via a block

result = ss.split("1:2:3:4:5:6:7:8", ":") do |split|
  split.pos % 2 == 0
end
# => ["1:2", "3:4", "5:6", "7:8"]
string = "banana".chars.sort.join # "aaabnn"

ss.split(string, "") do |split|
    split.rhs != split.lhs
end
# => ["aaa", "b", "nn"]

DESCRIPTION

Many languages have built-in split functions/methods for strings. They behave similarly (notwithstanding the occasional surprise), and handle a few common cases, e.g.:

  • limiting the number of splits
  • including the separator(s) in the results
  • removing (some) empty fields

But, because the API is squeezed into two overloaded parameters (the delimiter and the limit), achieving the desired results can be tricky. For instance, while String#split removes empty trailing fields (by default), it provides no way to remove all empty fields. Likewise, the cramped API means there's no way to, e.g., combine a limit (positive integer) with the option to preserve empty fields (negative integer), or use backreferences in a delimiter pattern without including its captured subexpressions in the result.

If split was being written from scratch, without the baggage of its legacy API, it's possible that some of these options would be made explicit rather than overloading the parameters. And, indeed, this is possible in some implementations, e.g. in Crystal:

":foo:bar:baz:".split(":", remove_empty: false)
# => ["", "foo", "bar", "baz", ""]

":foo:bar:baz:".split(":", remove_empty: true)
# => ["foo", "bar", "baz"]

StringSplitter takes this one step further by moving the configuration out of the method altogether and delegating the strategy — i.e. which splits should be accepted or rejected — to a block:

ss = StringSplitter.new

ss.split("foo:bar:baz", ":") { |split| split.index == 0 }
# => ["foo", "bar:baz"]

ss.split("foo:bar:baz:quux", ":") do |split|
  split.position == 1 || split.position == 3
end
# => ["foo", "bar:baz", "quux"]

As a shortcut, the common case of splitting (or not splitting) at one or more positions is supported by dedicated options:

ss.split("foo:bar:baz:quux", ":", select: [1, -1])
# => ["foo", "bar:baz", "quux"]

ss.split("foo:bar:baz:quux", ":", reject: [1, -1])
# => ["foo:bar", "baz:quux"]

WHY?

I wanted to split semi-structured output into fields without having to resort to a regex or a full-blown parser.

As an example, the nominally unstructured output of many Unix commands is often formatted in a way that's tantalizingly close to being machine-readable, apart from a few pesky exceptions, e.g.:

$ ls -l

-rw-r--r-- 1 user users   87 Jun 18 18:16 CHANGELOG.md
-rw-r--r-- 1 user users  254 Jun 19 21:21 Gemfile
drwxr-xr-x 3 user users 4096 Jun 19 22:56 lib
-rw-r--r-- 1 user users 8952 Jun 18 18:16 LICENSE.md
-rw-r--r-- 1 user users 3134 Jun 19 22:59 README.md

These lines can almost be parsed into an array of fields by splitting them on whitespace. The exception is the date (columns 6-8), i.e.:

line = "-rw-r--r-- 1 user users   87 Jun 18 18:16 CHANGELOG.md"
line.split

gives:

["-rw-r--r--", "1", "user", "users", "87", "Jun", "18", "18:16", "CHANGELOG.md"]

instead of:

["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", "CHANGELOG.md"]

One way to work around this is to parse the whole line, e.g.:

line.match(/^(\S+) \s+ (\d+) \s+ (\S+) \s+ (\S+) \s+ (\d+) \s+ (\S+ \s+ \d+ \s+ \S+) \s+ (.+)$/x)

But that requires us to specify everything. What we really want is a version of split which allows us to veto splitting for the 6th and 7th delimiters (and to stop after the 8th delimiter), i.e. control over which splits are accepted, rather than being restricted to the single, baked-in strategy provided by the limit parameter.

By providing a simple way to accept or reject each split, StringSplitter makes cases like this easy to handle, either via a block:

ss.split(line) do |split|
  case split.position when 1..5, 8 then true end
end
# => ["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", "CHANGELOG.md"]

Or via its option shortcut:

ss.split(line, at: [1..5, 8])
# => ["-rw-r--r--", "1", "user", "users", "87", "Jun 18 18:16", "CHANGELOG.md"]

CAVEATS

Differences from String#split

Unlike String#split, StringSplitter doesn't trim the string before splitting if the delimiter is omitted or a single space, e.g.:

" foo bar baz ".split          # => ["foo", "bar", "baz"]
" foo bar baz ".split(" ")     # => ["foo", "bar", "baz"]

ss.split(" foo bar baz ")      # => ["", "foo", "bar", "baz", ""]
ss.split(" foo bar baz ", " ") # => ["", "foo", "bar", "baz", ""]

String#split omits the nil values of unmatched optional captures:

"foo:bar:baz".scan(/(:)|(-)/)  # => [[":", nil], [":", nil]]
"foo:bar:baz".split(/(:)|(-)/) # => ["foo", ":", "bar", ":", "baz"]

StringSplitter preserves them by default (if include_captures is true, as it is by default), though they can be omitted from spread captures by passing :compact as the value of the spread_captures option:

s1 = StringSplitter.new(spread_captures: true)
s2 = StringSplitter.new(spread_captures: false)
s3 = StringSplitter.new(spread_captures: :compact)

s1.split("foo:bar:baz", /(:)|(-)/) # => ["foo", ":", nil, "bar", ":", nil, "baz"]
s2.split("foo:bar:baz", /(:)|(-)/) # => ["foo", [":", nil], "bar", [":", nil], "baz"]
s3.split("foo:bar:baz", /(:)|(-)/) # => ["foo", ":", "bar", ":", "baz"]

COMPATIBILITY

StringSplitter is tested and supported on all versions of Ruby supported by the ruby-core team, i.e., currently, Ruby 2.5 and above.

VERSION

0.7.3

SEE ALSO

Gems

  • rsplit - a reverse-split implementation (only works with string delimiters)

Articles

AUTHOR

chocolateboy

COPYRIGHT AND LICENSE

Copyright © 2018-2020 by chocolateboy.

This is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.