How Tos

Please add your own tutorials or how-tos to this page.
(You might want to check out the FerretArticles section as well)


How to Integrate Ferret With Rails

acts_as_ferret

acts_as_ferret is the recommended way to integrate ferret with rails. It's maintainers have put up an svn-repository and trac at  http://projects.jkraemer.net/acts_as_ferret/

Thanks to Kasper Weibel who started this great plugin. There's a page on this wiki as well: FerretOnRails !

using ferret with activerecord

Aslak Hellesoy has another piece;

 http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord

How to integrate Ferret with rails on the rail wiki (a little outdated)

Jan Prill has written up a great howto on integrating ferret with rails. You can check it out here:

 http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails . The recommended way for integration is acts_as_ferret.


How to index all files under a directory

Check out Brian McCallister?'s  blog for a description of how to do this.


How to create a persistent index.

I'll just quickly explain the :create and :create_if_missing options in index. To create a new index, use :create => true. This will create a new index, regardless of whether in index already exists in the specified directory. So, if you are going to use this option you should only use it the first time, ie;

index = Index::Index.new(:path=>'/tmp/ferret',:create=>true)
# Add fields etc. run searches etc.
index.close

# and in a new session
index = Index::Index.new(:path=>'/tmp/ferret')
# Add fields etc. run searches etc.
index.close

If you want Ferret to only create and index when one is missing, you can explicitly set :create_if_missing => true. This is the default behaviour in 0.1.1. If you want an exception thrown if there is no index then set :create_if_missing => false.


How to search within any field of a document.

# And if you want to search all fields by default;
index = Index::Index.new(:default_field => "*")
# Note :default_field is already an option and currently defaults to "" 
# but I'll probably make it default to "*" in the next version

# Now to search for foo in all fields in documents after Jan 1st 2005;
topdocs = index.search("foo AND created: >= 20050101")

Please see more on this  here.


How to use Ferret on an Existing Java Lucene Index

Unless it is very easy for you to reindex all of your documents using Lucene, I recommend you make a copy of your index. This hasn't been extensively tested so I can't guarantee you won't corrupt your index. So on *nix, that would be;

    cp -R /path/to/index /path/to/index_copy

Then simple open the index as you usually would in Ferret;

index = Index::Index.new(:path=>'/path/to/index_copy')
# Add fields etc. run searches etc.
index.close

Note: it appears this doesn't work if the Lucene index uses the ".cfs" (Compound file format). When I IndexWriter?.setUseCompoundFile(false) in the Java program it works great.


How to use the C Indexer

Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely usable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion;

    svn co svn://www.davebalmain.com/cferret/trunk cferret

Then change directory into cferret and run make. If you've got gcc 4.0 or above you'll get a lot of warnings which you can ignore. (I do plan to fix that). Running make will just compile all of the object files and run the unit tests. If the tests don't all pass, please let me know at dbalmain@…. Then look at bench.c to see how to use it.


How to use Index::Index in Multi-Threaded Applications

The Index::Index class it thread-safe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change as an index updated. Clients should thus not rely on a given document having the same number between requests. There are two possible solutions to this. You can synchronize on the index;

index.synchronize do
  topdocs = index.search("foo AND created: >= 20050101")
  docs = []
  topdocs.each {|doc, score| docs << index[doc]}
end

docs.each do |doc|
  # You can now do whatever you want with your documents
end

Or perhaps you can do this a little more easily with the block search method;

docs = []
index.search_each("foo AND created: >= 20050101") do |doc, score|
  docs << index[doc]
end

docs.each do |doc|
  # You can now do whatever you want with your documents
end

You can also use the synchronize shown in the first example to run transactions on the index.

Things get a little more complicated when you have multiple separate processes accessing an index, for example in a rails web app served with multiple dispatch threads. To handle this, you need to make sure your index is flushed as soon you perform an update, ie add or delete a document. Otherwise, other processes trying to update the index will time out while trying to get the write lock. Here is an example;

# Change all instances of the name David Grey to the correct David Gray
index.search_each('artist:"David Grey"') do |doc_num, score|
  document = index[doc_num]
  document[:artist] = "David Gray"
  index.delete(doc_num)
  index << document #Note that the document will now have a new document number
end
index.flush # <= this method should be called if you want other processes to be able to update the index.

How to not use the main Index::Index class

Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more efficient to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader?, Index::IndexWriter?, and Index::IndexSearcher?. I'll cover the searcher here first, and most others will follow suite.

include Ferret::Search

#In the current release (0.1.3 ) there is bug in the IndexSearcher. In the initialize function
#the line that calls FSDirectory.open should be change to FSDirectory.new(args, true), believe this is fixed in the dev build
sr = Index::IndexSearcher("path/to/index")

#Then to use the searcher we can do something like:
#You can include options in the QueryParser if you want. I decided to leave them blank for now
qp = Ferret::QueryParser.new( ) 
#Need to get the field names out, or you wont be searching much
qp.field = sr.reader.get_field_names.to_a

#Then just search as follows
sr.search(qp.parse("whatever"))

That should allow to get a read-only searcher up and running with out requiring any write-locks or such. One important note on this. If you have an external process change your index, you will need to reset the reader object to get those changes.

include Ferret::Store
sr.reader = Index::IndexReader.open(FSDirectory.get_directory("path/to/index"))

How to remove all documents from index

If you want to build index anew you need to remove all documents from your index first. You could do it with following code.

index.size.times {|i| index.delete(i)}

How to use keys for document

Ferret contains very useful concept of document keys. You could think about the key like as document field that unique across the index. Ok. Some code could help you understand a bit more. Let's imaging that we want to index Document object.

document = Document.find(some_id) #Document our business class that we want to index with Ferret

index << {:id => document.id, :text => document.text}

If you run this code you will have indexed document. It is exactly what we need. But what will be if we run this code again?? Then we would have 2 Document objects with the same id in our index. But it is wrong!! We need to store just one Document.

In this situation you could help Ferret index keys. In the code below we set that key of index will be id field. So after we execute code we will have only one document in index.

index = Index::Index.new(:key => :id)

index << {:id => 23, :data => "This is the data..."}

index << {:id => 23, :data => "This is the new data..."}

Remember also that we could get very quickly document by its key (and I love Ferret for this feature)

index["23"] # Get document with key 23
index[112]  # Get document with internal number 112. It is NOT the key field. 
            # It is just the internal Ferret id. This number is subject to change
            # whenever the document is updated or other documents are deleted and
            # the index is optimized.

#Now we will remove by key
index.remove("23") #Remove Document with id=23 from index. The same as following statement
index.remove("id:23")

How to index an IMAP directory

John Wells has written some lines of code to index via IMAP using Ferret. Code can be found in  this thread on the Ruby Forum.


How to do location-based searches (search by zip code)

You can find some example code posted on the tourbus blog at  http://blog.tourb.us/archives/ferret-and-location-based-searches


How to build a ferret index from documents with different mime types

The FerretHelper? module and Ferret Finder utility

Stuart Rackham wrote  Helpers and Utilities for indexing the filesystem. With his Tools you are able to use commands like

$ ff -i ~/doc ~/projects  # Create new index of doc and projects directories
$ ff instantiation ruby   # Find docs with both words
$ ff "array ruby -python" # Find docs with array and ruby but not python
$ ff file:*ruby*.txt      # Find docs with file names like *ruby*.txt

His library is utilizing the following tools for conversion to indexable txt: - PDF to text conversion with pdftotext - HTML to text conversion with html2text - Open Document to text conversion with odt2txt - Word to text conversion with antiword

Converting these common document types for indexing will be a task that everyone is facing who wants to do desktop search. If that's intersting for you, you might want to have a look at RDig as well (following right underneath...)


How to crawl internet-sites, an intranet or the filesystem and index the crawled documents - RDig

Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the file-system. Have a look at  RDig: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.


How to index word documents

 antiword is a great tool for converting word documents to text. You can use this to batch convert your word documents to text so you can index them with Lucene. You can see a web demo of this in action at  scattrbrain


How to make sure that the index gets valid UTF-8 text

Paul Battley has a good blog post on correcting UTF-8 text at  http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/. Basically you use the iconv library (a standard library) and do this;

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

How to launch DRb server on reboot (linux)

Many people have had a difficult time getting their DRb server to launch at reboot on newer Linux distributions. This is caused by a PATH issue that comes about when users have installed Ruby in /usr/local/bin and their linux distribution utilizes SELinux. Here's a fix (and a startup script):

#!/bin/bash
#
# This script starts and stops the ferret DRb server
# chkconfig: 2345 89 36
# description: Ferret search engine for ruby apps.
#
# save the current directory
CURDIR=`pwd`
PATH=/usr/local/bin:$PATH

RORPATH="/path/to/ror_root"

case "$1" in
  start)
     cd $RORPATH
     echo "Starting ferret DRb server."
     FERRET_USE_LOCAL_INDEX=1 \
                script/runner -e production \
                vendor/plugins/acts_as_ferret/script/ferret_start
     ;;
  stop)
     cd $RORPATH
     echo "Stopping ferret DRb server."
     FERRET_USE_LOCAL_INDEX=1 \
                script/runner -e production \
                vendor/plugins/acts_as_ferret/script/ferret_stop
     ;;
  *)
     echo $"Usage: $0 {start, stop}"
     exit 1
     ;;
esac

cd $CURDIR

How to create synonym based searching

Most code listed here is based off examples from the "Lucene in Action" book along with examples of how to create filters/analyzers from the ferret mailing list. The wordnet_prolog_2_ferret.rb script is based on the Lucece program Syns2Index.java.

Creating the analyzer

The SynonymAnalyzer is fairly simple, like most analyzers. It is very similar to the StandardAnalyzer except for a few exceptions noted below.

A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The SynonymAnalyzer also requires a SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the StandardAnalyzer this class does not run tokens through the HyphenFilter because if there are hyphenated words that have synonyms, it would be nice to capture those.

class SynonymAnalyzer < Ferret::Analysis::Analyzer
  include Ferret::Analysis
  
  def initialize(synonym_engine, stop_words = FULL_ENGLISH_STOP_WORDS, lower = true)
    @synonym_engine = synonym_engine
    @lower = lower
    @stop_words = stop_words
  end
  
  def token_stream(field, str)
    ts = StandardTokenizer.new(str)
    ts = LowerCaseFilter.new(ts) if @lower
    ts = StopFilter.new(ts, @stop_words)
    ts = SynonymTokenFilter.new(ts, @synonym_engine)
  end
end

Creating the token filter

SynonymTokenFilter does the job of taking a token from the supplied token stream and injecting all the synonyms for that token.

Couple of interesting piece of code here, starting with the 'next' method. The first thing it does is check the @synonym_stack to see if there are any synonyms left in it and if so then return that instead of the next token in the @token_stream. If @synonym_stack is empty then it proceeds to finding the next token and if it's not nil it calls add_synonyms_to_stack.

The add_synonyms_to_stack method takes the supplied token, calls the get_synonym method of the @synonym_engine and then loops over the results and adding them to the stack. While adding them to the stack it turns them into tokens that have the same start position and end position as the original token. It also makes sure to set the position increment to 0. That is very important because you want all the synonyms and the original token to have the same positions.

class SynonymTokenFilter < Ferret::Analysis::TokenStream
  include Ferret::Analysis
  
  def initialize(token_stream, synonym_engine)
    @token_stream = token_stream
    @synonym_stack = []
    @synonym_engine = synonym_engine
  end
  
  def text=(text)
    @token_stream.text = text
  end
  
  def next
    return @synonym_stack.pop if @synonym_stack.size > 0
    
    if token = @token_stream.next
      add_synonyms_to_stack(token) unless token.nil?
    end
    
    return token
  end
  
  private
  def add_synonyms_to_stack(token)
    synonyms = @synonym_engine.get_synonyms(token.text)
    
    return if synonyms.nil?
    
    synonyms.each do |s|
      @synonym_stack.push(
        Token.new(s, token.start, token.end, 0))
    end
  end
end

Create the synonym engine

The WordnetSynonymEngine does the actual job of querying an existing ferret index for the synonyms for any word passed to get_synonyms. The engine creates a searcher object to use for every call to get_synonyms. The 'existing ferret index' mentioned previously is created by wordnet_prolog_2_ferret.rb that'll be described in the next section.

When get_synonyms is called it creates a simple TermQuery object on the "word" field in the index and returns the first result it finds from the @searcher's search_each method.

Any synonym engine must implement get_synonyms, and the results get_synonyms returns must be an array.

# Accesses a ferret index created from the wordnet synonym database
class WordnetSynonymEngine
  include Ferret::Search
  
  def initialize(wordnet_index_location)
    @searcher = Searcher.new(index_location)
  end
  
  def get_synonyms(word)
    @searcher.search_each(TermQuery.new(:word, word)) do |doc_id, score|
      return @searcher[doc_id][:syn]
    end
    
    return nil
  end
end

The engine described above is based on the example in the "Lucene in Action" book; however, other engines can easily be created.

Here's an example of using a YAML based synonym engine.

# Accesses a YAML file for synonym lookup.
class YAMLSynonymEngine
  
  def initialize(index_location)
    @searcher = YAML.load_file(index_location)
  end
  
  def get_synonyms(word)
    return @searcher[word]
  end
end

Fairly simple class that loads the file specified by the index_location parameter into the @searcher variable. Then any call to get_synonyms just returns the lookup for @searcher's indexer method. If YAML doesn't find anything it returns nil, but if it does find something it returns that. Again, the engines must return an array so this YAML engine requires that the YAML file be set up using a multi-line inline collection. Here's an short example:

# Notice that multi-word keys must be in quotes.
ferret: ['black-footed ferret', 'mustela nigripes', 'ferret out']
'black-footed ferret': ['ferret', 'mustela nigripes']
'ferret out': ['ferret']
'mustela nigripes': ['black-footed ferret', 'ferret']

Both engines work the same:

>> w = WordnetSynonymEngine.new("#{RAILS_ROOT}/index/#{ENV['RAILS_ENV']}/wordnet")
>> w.get_synonyms('ferret')
=> ["black-footed ferret", "mustela nigripes", "ferret out"]
>>
>> y = YAMLSynonymEngine.new("#{RAILS_ROOT}/extras/synonyms.yaml")
>> y.get_synonyms('ferret')
=> ["black-footed ferret", "mustela nigripes", "ferret out"]

Creating the Wordnet synonym index

Ferret version

This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a Lucene index.

To use this script download the  prolog wordnet database and extract it. Run the script without any arguments to see the usage. The file you will want to use is 'wn_s.pl'.

The index is built on the idea that there are two fields. A "word" field and a "syn" field. The word field is the word to look up, and the syn field is an array of all the synonyms. When ferret returns the syn field it will return the array as it was indexed.

require 'rubygems'
require 'ferret'

def index(index_dir, word2nums, num2words)
  row = 0
  mod = 1
  
  # override the specific index if it already exists

  field_infos = Ferret::Index::FieldInfos.new()
  field_infos.add_field(:word, :index => :untokenized, :term_vector => :no)
  field_infos.add_field(:syn, :index => :no, :term_vector => :no)
  index = Ferret::Index::Index.new(:path => index_dir, :field_infos => field_infos)
  word2nums.each do |key, value|
    doc = {:word => key}
    n = index_word(word2nums, num2words, key, doc)
    if n > 0
      if ((row = row + 1) % mod) == 0
        puts "\nrow=#{row}/#{word2nums.size} doc=#{doc}"
        mod = mod * 2
      end
      index << doc
    end # else degenerate
  end  
end

# Given 2 maps fills a document for 1 word
def index_word(word2nums, num2words, key, doc)
  words = []
  word2nums[key].each do |value|
    words << num2words[value] unless num2words[value].nil?
  end
  words.flatten!
  words.uniq!
  
  num = 0
  words.delete(key) # remove itself
  
  doc[:syn] = []
  words.each do |value|
    num = num + 1
    doc[:syn] << value
  end

  num
end

def usage
  puts "ruby wordnet_prolog_to_ferret.rb <prolog file> <index dir>"
end

if ARGV.size.eql? 2
  @prolog_filename = ARGV[0]
  @index_dir = ARGV[1]
else
  usage;
  exit(1);
end

# make sure the prolog file is readable
unless File.readable?(@prolog_filename)
  puts "Error: cannot read Prolog file: #{@prolog_filename}"
  exit(1)
end

# exit if the target index directory already exists
if File.exists?(@index_dir)
  puts "Error: index directory already exists: #{@index_dir}"
  puts "Please specify a name of a non-existant directory"
  exit(1)
end

puts "Opening Prolog file #{@prolog_filename}"
File.open(@prolog_filename, "r") do |file|
  word2nums = {}
  num2words = {}
  rejected_words = 0
  
  mod = 1; # used for
  row = 1; # status updates
  
  puts "[1/2] Parsing #{@prolog_filename}"
  while (line = file.gets)
    # occasional progress
    if ((row = row +1) % mod) == 0 # periodically print out line we read in
      mod = mod * 2
      puts "\n#{row} #{line} word2num size: #{word2nums.size} num2words size: #{num2words.size} rejected words=#{rejected_words}"
    end
    
    # syntax check
    unless line[0..1] == "s("
      puts "OUCH: #{line}"
      exit(1);
    end
    
    # parse line
    line = line[2..-4]
    line_parts = line.split(',')
    line_parts[2] = line_parts[2].slice(1..-2).downcase # trim single quotes off word
    
    # 1/2: word2nums map
    # append to entry or add new one
    lis = word2nums[line_parts[2]]
    if lis.nil?
      word2nums[line_parts[2]] = [line_parts[0]]
    else
      lis << line_parts[0]
    end
    
    # 2/2: num2words map
    lis = num2words[line_parts[0]]
    if lis.nil?
      num2words[line_parts[0]] = [line_parts[2]]
    else
      lis << line_parts[2]
    end
  end
  
  puts "\n[2/2] Building index to store synonyms, map sizes are #{word2nums.size} and #{num2words.size}"
  index(@index_dir, word2nums, num2words)
end

YAML Version

For completeness sake here is a quick version thrown together to create a YAML version of the wordnet database. Due to the length of the code I'm only including the relevant methods that have changed. This script will take some time to complete and will use a lot of resources (over 250 megs of memory to create). The resultant YAML file will require a little over 50 megs of memory in usage when loaded for searching.

The file that is output has a different format then the example YAML file listed above but it works exactly the same.

require 'yaml' # instead of require 'ferret'

def index(index_dir, word2nums, num2words)
  row = 0
  mod = 1
  
  doc = {}
  word2nums.each do |key, value|
    n = index_word(word2nums, num2words, key, doc)
    if n > 0
      if ((row = row + 1) % mod) == 0
        puts "\nrow=#{row}/#{word2nums.size} doc_count=#{doc.size}"
        mod = mod * 2
      end
    end
  end
  
  File.open(index_dir, 'w') do |out|
    YAML.dump(doc, out)
  end
end

# Given 2 maps fills a document for 1 word
def index_word(word2nums, num2words, key, doc)
  words = []
  word2nums[key].each do |value|
    words << num2words[value] unless num2words[value].nil?
  end
  words.flatten!
  words.uniq!
  
  num = 0
  words.delete(key) # remove itself
  
  doc[key] = [] if words.size > 0
  words.each do |value|
    num = num + 1
    doc[key] << value
  end

  num
end

Integrating with acts_as_ferret and Rails

Using this with Rails and acts_as_ferret is easy. Store these files in your "#{RAILS_ROOT}/lib" directory so they are loaded by the Rails system when it starts up.

Then modify one of your existing aaf enabled models similar to the following:

class Test << ActiveRecord::Base
  acts_as_ferret(
    :fields => [:your, :fields, :here],
    :store_class_name => true,
    :ferret => {
      :or_default => false,
      :analyzer => SynonymAnalyzer.new(
#        YAMLSynonymEngine.new("#{RAILS_ROOT}/extras/synonyms.yaml"), [])
        WordnetSynonymEngine.new('#{RAILS_ROOT}/index/#{ENV['RAILS_ENV']}/wordnet'), [])
    },
end

Move the indexes you created in the section above into the relevant areas. Delete the engine reference you don't want to use. Then you're all set up.

Outstanding Issues

There are still some issues that need to be taken care of: 1. You can not do a synonym based search for words with spaces yet. Since the tokenizer breaks words up by spaces it will not find these in the index.

2. Currently aaf doesn't support different analyzers for searching/indexing so doing this actually causes the synonym insertion to be done twice (once during indexing and another time during query generation). Really it's only needed once: during indexing if you want it more transparent to the user, or during query generation if you want to be able to give the user control of when to search for synonyms). More on this second option in a minute.

3. Currently I get errors when trying to use this with a ferret index running on Drb.

Above I mentioned allowing the user to control the search for synonyms. I was considering a construct of "%{word or words}" to add to the grammar. This would give the user the ability to do "rabbits %{ferret}" and the resulting query would look like:

rabbits ferret|"black-footed ferret"|"mustela nigripes"|"ferret out"

That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in French braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well.

None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopefully become.