ActiveRecord support for ts_vectors
This small library extends ActiveRecord to support PostgreSQL's tsvector
type. PostgreSQL's tsvector
stores an array of keywords that can be efficiently queried. The tsvector
type is therefore particularly suitable for storing tags.
Requirements
- ActiveRecord
- PostgreSQL with
tsvector
support. (This has been included out of the box since 8.3.)
Usage
Add gem to your Gemfile
:
gem 'ar-tsvectors', :require => 'activerecord_tsvectors'
To test, add a tsvector
column to your table in a migration:
add_column :posts, :tags, :tsvector
Then extend your ActiveRecord model with a declaration:
class Post < ActiveRecord::Base
ts_vector :tags
end
Now you can assign tags:
post = Post.new
post. = ['pizza', 'pepperoni', 'Italian', 'mushrooms']
post.save
You can now query based on the tags:
Post.('pizza')
Post.(['pizza', 'pepperoni'])
Note that for queries to use indexes, you need to create an index on the column. This is slightly more complicated; again, in a migration:
execute("create index index_posts_on_tags on posts using gin(to_tsvector('simple', tags))")
Methods
Declaring a tsvector
column tags
will dynamically add the following methods:
with_all_tags(t)
- returns a scope that searches fort
, which may be either a single value or an array of values. Only rows matching all values will be returned.without_all_tags(t)
- returns a scope that excludest
, which may be either a single value or an array of values. Only rows matching all values will be excluded.with_any_tags(t)
- returns a scope that searches fort
, which may be either a single value or an array of values. Rows matching at least one of the provided values will be returned.without_any_tags(t)
- returns a scope that excludest
, which may be either a single value or an array of values. Rows matching at least one of the provided values will be excluded.order_by_tags_rank(t, direction = 'DESC')
- returns a scope that orders the rows by the rank, ie. a score computed based on the overlap oft
and the row's value. See below for an example.
Ranking
It's trivial to rank results according to how many values match a given row:
@posts = Post.(["pizza", "pepperoni"]).
(["pizza", "pepperoni"])
This orders the rows such that rows matching both pizza
and pepperoni
will be ordered above rows matching either pizza
or pepperoni
but not both.
Normalization
The default behaviour is to downcase and strip leading and trailing spaces from assigned values. By default, matches are case-sensitive; so something tagged with PIZZA
will not match a search for pizza
. To ensure case insensitivity, you can provide a normalization function, like so:
class Post < ActiveRecord::Base
ts_vector :tags, :normalize => lambda { |v| v.downcase }
end
You can also use this to strip unwanted characters:
class Post < ActiveRecord::Base
ts_vector :tags, :normalize => lambda { |v| v.downcase.gsub(/[^\w]/, '') }
end
The values are normalized both when performing queries, and when assigning new values:
post.tags = [' WTF#$%^ &*??!']
post.tags
=> ['wtf']
Limitations
Currently, the library will always use the built-in simple
configuration, which only performs basic normalization, and does not perform stemming.
Due to a limitation in ActiveRecord, stored column values (on INSERT
and UPDATE
) are passed to PostgreSQL as strings, and are therefore not normalized using the text configuration's rules. This means that if you override the normalization function, you must make sure you always strip and downcase in addition to whatever other normalization you do, otherwise queries will potentially not match all rows.
A forthcoming version of ActiveRecord will provide the plumbing that will allow us to solve this issue.