RDig

RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.

RDig depends on Ferret (>= 0.10.0) and, for parsing HTML, on either Hpricot (>= 0.4) or the RubyfulSoup library (>= 1.0.4). As I know no way to specify such an OR dependency in a gem specification, the gem depends on Hpricot. If this is a problem for you, install the gem with –force and manually do a gem install rubyful_soup.

basic usage

Index creation

create a config file based on the template in doc/examples
to create an index:
```
rdig -c CONFIGFILE
```
to run a query against the index (just to try it out)
```
rdig -c CONFIGFILE -q 'your query'
```
this will dump the first 10 search results to STDOUT

Handle search in your application:

require 'rdig'
require 'rdig_config'   # load your config file here
search_results = RDig.searcher.search(query)

see RDig::Search::Searcher for more information.

usage in rails

add to config/environment.rb :
```
require 'rdig'
require 'rdig_config'
```
place rdig_config.rb into config/ directory.
build index:
```
rdig -c config/rdig_config.rb
```

in your controller that handles the search form:

search_results = RDig.searcher.search(params[:query])
@results = search_results[:list]
@hitcount = search_results[:hitcount]

search result paging

Use the :first_doc and :num_docs options to implement paging through search results. (:num_docs is 10 by default, so without using these options only the first 10 results will be retrieved)

sample configuration

from doc/examples/config.rb. The tag_selector properties are called with a BeautifulSoup instance as parameter. See the RubyfulSoup Site for more info about this cool lib. You can also have a look at the html_content_extractor unit test.

:include:doc/examples/config.rb