ActiveRecord support for ts_vectors
This small library extends ActiveRecord to support PostgreSQL's tsvector
type. PostgreSQL's tsvector
stores an array of keywords that can be efficiently queried. The tsvector
type is therefore particularly suitable for storing tags, since it's possible to store all tags in a single column instead of a table.
Requirements
- ActiveRecord
- PostgreSQL with
tsvector
support. (This has been included out of the box since 8.3.)
Usage
Add gem to your Gemfile
:
gem 'ar-tsvectors', :require => 'activerecord_tsvectors'
To test, add a text
column to your table in a migration:
add_column :posts, :tags, :text
Then extend your ActiveRecord model with a declaration:
class Post < ActiveRecord::Base
ts_vector :tags
end
Now you can assign tags:
post = Post.new
post. = ['pizza', 'pepperoni', 'Italian', 'mushrooms']
post.save
You can now query based on the tags:
Post.('pizza')
Post.(['pizza', 'pepperoni'])
For queries to perform well, you must add an index on the column. This is slightly more complicated; again, in a migration:
execute("create index index_posts_on_tags on posts using gin(to_tsvector('simple', tags))")
Options
The ts_vector
mixin takes the following options:
:configuration
- the name of the PostgreSQL text configuration to use. Defaults tosimple
, which is appropriate for keyword-type indexing.:format
- the format of the column, must be either:text
(the default) or:tsvector
.:normalize
- a proc that overrides the default normalization. It must take a single string and return a normalized copy of the string.
Methods
Declaring a tsvector
column tags
will dynamically add the following methods:
with_all_tags(t)
- returns a scope that searches fort
, which may be either a single value or an array of values. Only rows matching all values will be returned.without_all_tags(t)
- returns a scope that excludest
, which may be either a single value or an array of values. Only rows matching all values will be excluded.with_any_tags(t)
- returns a scope that searches fort
, which may be either a single value or an array of values. Rows matching at least one of the provided values will be returned.without_any_tags(t)
- returns a scope that excludest
, which may be either a single value or an array of values. Rows matching at least one of the provided values will be excluded.order_by_tags_rank(t, direction = 'DESC')
- returns a scope that orders the rows by the rank, ie. a score computed based on the overlap oft
and the row's value. See below for an example.
Ranking
It's trivial to rank results according to how many values match a given row:
@posts = Post.(["pizza", "pepperoni"]).
(["pizza", "pepperoni"])
This orders the rows such that rows matching both pizza
and pepperoni
will be ordered above rows matching either pizza
or pepperoni
but not both.
Normalization
The default behaviour is to downcase and strip leading and trailing spaces from assigned values. By default, matches are case-sensitive; so something tagged with PIZZA
will not match a search for pizza
. To ensure case insensitivity, you can provide a normalization function, like so:
class Post < ActiveRecord::Base
ts_vector :tags, :normalize => lambda { |v| v.downcase }
end
You can also use this to strip unwanted characters:
class Post < ActiveRecord::Base
ts_vector :tags, :normalize => lambda { |v| v.downcase.gsub(/[^\w]/, '') }
end
The values are normalized both when performing queries, and when assigning new values:
post.tags = [' WTF#$%^ &*??!']
post.tags
=> ['wtf']
Using ts_vector directly
It's possible to use the tsvector
type for the column directly. In other words, in your migration would have something like:
add_column :posts, :tags, :tsvector
This is supported by supplying a format option to the ts_vector
declaration in your model:
class Post < ActiveRecord::Base
ts_vector :tags, :format => :tsvector
end
When you do this, the way you define a table index is slightly different:
execute("create index index_posts_on_tags on posts using gin(tags)")
While using a tsvector
column works, it has some drawbacks:
The
tsvector
type is lossy. When you assign values to atsvector
, the text will be tokenized by PostgreSQL's text parser, which will remove all punctuation and quotes from the text. In this system, a quoted tag such as"New York"
will be split into two different tagsNew
andYork
, and it's no longer possible to search forNew York
as a distinct tag.Due to a limitation in ActiveRecord, stored column values (on
INSERT
andUPDATE
) are passed to PostgreSQL as strings, and are therefore not normalized using the text configuration's rules. This means that if you override the normalization function, you must make sure you always strip and downcase in addition to whatever other normalization you do, otherwise queries will potentially not match all rows. A forthcoming version of ActiveRecord will provide the plumbing that will allow us to solve this issue.