Top Level Namespace
Defined Under Namespace
Modules: Rblines
Constant Summary collapse
- TOKENIZER =
This regular expression matches a group of characters that can include any character except for parentheses and whitespace characters (which include spaces, tabs, and line breaks) or any character that is a parenthesis or punctuation mark (.?!-). The group can also include any whitespace characters that follow these characters. Breaking it down further:
-
( and ) indicate a capturing group
-
(?: ) is a non-capturing group, meaning it matches the pattern but doesn’t capture the matched text
-
[^()s]+ matches one or more characters that are not parentheses or whitespace characters
-
| indicates an alternative pattern
- ().?!-
-
matches any character that is a parenthesis or punctuation mark (.?!-)
-
s* matches zero or more whitespace characters (spaces, tabs, or line breaks) that follow the previous pattern.
-
/((?:[^()\s]+|[().?!-])\s*)/
- PARAGRAPH_PATTERN =
This pattern matches one or more newline characters ‘n`, and any spaces between them. It is used to split the text into paragraphs.
-
(?:n *) is a non-capturing group that must start with a n and be followed by zero or more spaces.
-
((?:n *)+) is the previous non-capturing group repeated one or more times.
-
/((?:\n *)+)/
- SPACE_PATTERN =
/(\s+)/
Instance Method Summary collapse
-
#concatenate_paragraphs_and_add_chr182(text) ⇒ String
Split paragraphs and concatenate them.
-
#split_paragraphs(text) ⇒ Array<String>
Splits a string into a list of paragraphs.
-
#tokenize_text(text) ⇒ Array<String>
Tokenizes the text based on the TOKENIZER pattern.
Instance Method Details
#concatenate_paragraphs_and_add_chr182(text) ⇒ String
Split paragraphs and concatenate them. Then add a character ‘¶’ between paragraphs. For example, if the text is “HellonWorldnThis is a test”, the result will be: “Hello¶World¶This is a test”
53 54 55 |
# File 'lib/rblines/redlines.rb', line 53 def concatenate_paragraphs_and_add_chr182(text) split_paragraphs(text).join(" ¶ ") end |
#split_paragraphs(text) ⇒ Array<String>
Splits a string into a list of paragraphs. One or more ‘n` splits the paragraphs. For example, if the text is “HellonWorldnThis is a test”, the result will be:
- ‘Hello’, ‘World’, ‘This is a test’
41 42 43 44 45 |
# File 'lib/rblines/redlines.rb', line 41 def split_paragraphs(text) text.split(PARAGRAPH_PATTERN) .map(&:strip) .reject(&:empty?) end |
#tokenize_text(text) ⇒ Array<String>
Tokenizes the text based on the TOKENIZER pattern.
31 32 33 |
# File 'lib/rblines/redlines.rb', line 31 def tokenize_text(text) text.scan(TOKENIZER).flatten end |