[[FerretToc]] = How Tos = Please add your own tutorials or how-tos to this page.[[BR]] ''(You might want to check out the FerretArticles section as well)'' ---- == How to Integrate Ferret With Rails == === acts_as_ferret === acts_as_ferret is the recommended way to integrate ferret with rails. It's maintainers have put up an svn-repository and trac at http://projects.jkraemer.net/acts_as_ferret/ Thanks to Kasper Weibel who started this great plugin. ''' There's a page on this wiki as well: FerretOnRails ! ''' === using ferret with activerecord === Aslak Hellesoy has another piece; http://aslakhellesoy.com/articles/2005/11/18/using-ferret-with-activerecord === How to integrate Ferret with rails on the rail wiki (a little outdated) === Jan Prill has written up a great howto on integrating ferret with rails. You can check it out here: http://wiki.rubyonrails.com/rails/pages/HowToIntegrateFerretWithRails . The recommended way for integration is acts_as_ferret. ------------------ == How to index all files under a directory == Check out Brian McCallister's [http://kasparov.skife.org/blog/src/ruby/ferret.html blog] for a description of how to do this. ------------------ == How to create a persistent index. == I'll just quickly explain the :create and :create_if_missing options in index. To create a new index, use :create => true. This will create a new index, regardless of whether in index already exists in the specified directory. So, if you are going to use this option you should only use it the first time, ie; {{{ #!ruby index = Index::Index.new(:path=>'/tmp/ferret',:create=>true) # Add fields etc. run searches etc. index.close # and in a new session index = Index::Index.new(:path=>'/tmp/ferret') # Add fields etc. run searches etc. index.close }}} If you want Ferret to only create and index when one is missing, you can explicitly set :create_if_missing => true. This is the default behaviour in 0.1.1. If you want an exception thrown if there is no index then set :create_if_missing => false. ------------------ == How to search within any field of a document. == {{{ #!ruby # And if you want to search all fields by default; index = Index::Index.new(:default_field => "*") # Note :default_field is already an option and currently defaults to "" # but I'll probably make it default to "*" in the next version # Now to search for foo in all fields in documents after Jan 1st 2005; topdocs = index.search("foo AND created: >= 20050101") }}} Please see more on this [http://blog.davebalmain.com/articles/2005/10/23/searching-multiple-fields-in-ferret here]. -------------------- == How to use Ferret on an Existing Java Lucene Index == Unless it is very easy for you to reindex all of your documents using Lucene, I recommend you make a copy of your index. This hasn't been extensively tested so I can't guarantee you won't corrupt your index. So on *nix, that would be; {{{ cp -R /path/to/index /path/to/index_copy }}} Then simple open the index as you usually would in Ferret; {{{ #!ruby index = Index::Index.new(:path=>'/path/to/index_copy') # Add fields etc. run searches etc. index.close }}} Note: it appears this doesn't work if the Lucene index uses the ".cfs" (Compound file format). When I IndexWriter.setUseCompoundFile(false) in the Java program it works great. ---------------------- == How to use the C Indexer == Currently cFerret isn't really ready for release and it'll only run on linux (compile and pass the tests on Mac OS X v.4), but it is definitely usable if you know your way around C. If you want to index a lot of documents, it may be worth looking into. Indexing all of the text documents on my PC took over 20 minutes with Java Lucene and less than a minute with cFerret. And the indexes were identical. To get cFerret, you'll need subversion; {{{ svn co svn://www.davebalmain.com/cferret/trunk cferret }}} Then change directory into cferret and run make. If you've got gcc 4.0 or above you'll get a lot of warnings which you can ignore. (I do plan to fix that). Running make will just compile all of the object files and run the unit tests. If the tests don't all pass, please let me know at dbalmain@gmail.com. Then look at bench.c to see how to use it. ---------------------- == How to use Index::Index in Multi-Threaded Applications == The Index::Index class it thread-safe so it should run well in threaded applications. One thing to note is that document numbers are ephemeral, ie they may change as an index updated. Clients should thus not rely on a given document having the same number between requests. There are two possible solutions to this. You can synchronize on the index; {{{ #!ruby index.synchronize do topdocs = index.search("foo AND created: >= 20050101") docs = [] topdocs.each {|doc, score| docs << index[doc]} end docs.each do |doc| # You can now do whatever you want with your documents end }}} Or perhaps you can do this a little more easily with the block search method; {{{ #!ruby docs = [] index.search_each("foo AND created: >= 20050101") do |doc, score| docs << index[doc] end docs.each do |doc| # You can now do whatever you want with your documents end }}} You can also use the synchronize shown in the first example to run transactions on the index. Things get a little more complicated when you have multiple separate processes accessing an index, for example in a rails web app served with multiple dispatch threads. To handle this, you need to make sure your index is flushed as soon you perform an update, ie add or delete a document. Otherwise, other processes trying to update the index will time out while trying to get the write lock. Here is an example; {{{ #!ruby # Change all instances of the name David Grey to the correct David Gray index.search_each('artist:"David Grey"') do |doc_num, score| document = index[doc_num] document[:artist] = "David Gray" index.delete(doc_num) index << document #Note that the document will now have a new document number end index.flush # <= this method should be called if you want other processes to be able to update the index. }}} ----------------------- == How to not use the main Index::Index class == Right now it is fairly simple to use the Index::Index class. It handles most of the index updating and locking for you. The problem is, it is doing a lot of extra work to make sure that you are always searching on the latest index. It is actually a lot more efficient to have one object for updating the index and as many others as you like for searching the index. This gives you more control on what is going on in the index and leads to greater efficiency. These are, Index::IndexReader, Index::IndexWriter, and Index::IndexSearcher. I'll cover the searcher here first, and most others will follow suite. {{{ #!ruby include Ferret::Search #In the current release (0.1.3 ) there is bug in the IndexSearcher. In the initialize function #the line that calls FSDirectory.open should be change to FSDirectory.new(args, true), believe this is fixed in the dev build sr = Index::IndexSearcher("path/to/index") #Then to use the searcher we can do something like: #You can include options in the QueryParser if you want. I decided to leave them blank for now qp = Ferret::QueryParser.new( ) #Need to get the field names out, or you wont be searching much qp.field = sr.reader.get_field_names.to_a #Then just search as follows sr.search(qp.parse("whatever")) }}} That should allow to get a read-only searcher up and running with out requiring any write-locks or such. One important note on this. If you have an external process change your index, you will need to reset the reader object to get those changes. {{{ #!ruby include Ferret::Store sr.reader = Index::IndexReader.open(FSDirectory.get_directory("path/to/index")) }}} ------------------- == How to remove all documents from index == If you want to build index anew you need to remove all documents from your index first. You could do it with following code. {{{ #!ruby index.size.times {|i| index.delete(i)} }}} -------------------- == How to use keys for document == Ferret contains very useful concept of document keys. You could think about the key like as document field that unique across the index. Ok. Some code could help you understand a bit more. Let's imaging that we want to index Document object. {{{ #!ruby document = Document.find(some_id) #Document our business class that we want to index with Ferret index << {:id => document.id, :text => document.text} }}} If you run this code you will have indexed document. It is exactly what we need. But what will be if we run this code again?? Then we would have 2 Document objects with the same id in our index. But it is wrong!! We need to store just one Document. In this situation you could help Ferret index ''keys''. In the code below we set that key of index will be ''id'' field. So after we execute code we will have only one document in index. {{{ #!ruby index = Index::Index.new(:key => :id) index << {:id => 23, :data => "This is the data..."} index << {:id => 23, :data => "This is the new data..."} }}} Remember also that we could get very quickly document by its key (and I love Ferret for this feature) {{{ #!ruby index["23"] # Get document with key 23 index[112] # Get document with internal number 112. It is NOT the key field. # It is just the internal Ferret id. This number is subject to change # whenever the document is updated or other documents are deleted and # the index is optimized. #Now we will remove by key index.remove("23") #Remove Document with id=23 from index. The same as following statement index.remove("id:23") }}} --------------------- == How to index an IMAP directory == John Wells has written some lines of code to index via IMAP using Ferret. Code can be found in [http://www.ruby-forum.com/topic/51242 this thread] on the Ruby Forum. --------------------- == How to do location-based searches (search by zip code) == You can find some example code posted on the tourbus blog at [http://blog.tourb.us/archives/ferret-and-location-based-searches] --------------------- == How to build a ferret index from documents with different mime types == ''' The FerretHelper module and Ferret Finder utility ''' Stuart Rackham wrote [http://www.methods.co.nz/ff/ Helpers and Utilities] for indexing the filesystem. With his Tools you are able to use commands like {{{ $ ff -i ~/doc ~/projects # Create new index of doc and projects directories $ ff instantiation ruby # Find docs with both words $ ff "array ruby -python" # Find docs with array and ruby but not python $ ff file:*ruby*.txt # Find docs with file names like *ruby*.txt }}} His library is utilizing the following tools for conversion to indexable txt: - PDF to text conversion with pdftotext - HTML to text conversion with html2text - Open Document to text conversion with odt2txt - Word to text conversion with antiword Converting these common document types for indexing will be a task that everyone is facing who wants to do desktop search. If that's intersting for you, you might want to have a look at RDig as well (following right underneath...) ----------------- == How to crawl internet-sites, an intranet or the filesystem and index the crawled documents - RDig == Jens Kraemer came up with a great tool for crawling documents that reside on the internet, your intranet or the file-system. Have a look at [http://rdig.rubyforge.org/ RDig]: RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig. ----------------- == How to index word documents == [http://www.winfield.demon.nl/ antiword] is a great tool for converting word documents to text. You can use this to batch convert your word documents to text so you can index them with Lucene. You can see a web demo of this in action at [http://scattrbrain.com/stuff/word scattrbrain] ----------------- == How to make sure that the index gets valid UTF-8 text == Paul Battley has a good blog post on correcting UTF-8 text at http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/. Basically you use the iconv library (a standard library) and do this; {{{ #!ruby ic = Iconv.new('UTF-8//IGNORE', 'UTF-8') valid_string = ic.iconv(untrusted_string) }}} ----------------- == How to launch DRb server on reboot (linux) == Many people have had a difficult time getting their DRb server to launch at reboot on newer Linux distributions. This is caused by a PATH issue that comes about when users have installed Ruby in /usr/local/bin and their linux distribution utilizes SELinux. Here's a fix (and a startup script): {{{ #!/bin/bash # # This script starts and stops the ferret DRb server # chkconfig: 2345 89 36 # description: Ferret search engine for ruby apps. # # save the current directory CURDIR=`pwd` PATH=/usr/local/bin:$PATH RORPATH="/path/to/ror_root" case "$1" in start) cd $RORPATH echo "Starting ferret DRb server." FERRET_USE_LOCAL_INDEX=1 \ script/runner -e production \ vendor/plugins/acts_as_ferret/script/ferret_start ;; stop) cd $RORPATH echo "Stopping ferret DRb server." FERRET_USE_LOCAL_INDEX=1 \ script/runner -e production \ vendor/plugins/acts_as_ferret/script/ferret_stop ;; *) echo $"Usage: $0 {start, stop}" exit 1 ;; esac cd $CURDIR }}} ----------------- == How to create synonym based searching == Most code listed here is based off examples from the "Lucene in Action" book along with examples of how to create filters/analyzers from the ferret mailing list. The wordnet_prolog_2_ferret.rb script is based on the Lucece program Syns2Index.java. === Creating the analyzer === The !SynonymAnalyzer is fairly simple, like most analyzers. It is very similar to the !StandardAnalyzer except for a few exceptions noted below. A synonym engine must be supplied to the analyzer. The engine is required to do the lookup of a word and return the resulting synonyms. The !SynonymAnalyzer also requires a !SynonymTokenFilter that does most of the work and actually makes the calls to the specified synonym engine. Finally, unlike the !StandardAnalyzer this class does not run tokens through the !HyphenFilter because if there are hyphenated words that have synonyms, it would be nice to capture those. {{{ class SynonymAnalyzer < Ferret::Analysis::Analyzer include Ferret::Analysis def initialize(synonym_engine, stop_words = FULL_ENGLISH_STOP_WORDS, lower = true) @synonym_engine = synonym_engine @lower = lower @stop_words = stop_words end def token_stream(field, str) ts = StandardTokenizer.new(str) ts = LowerCaseFilter.new(ts) if @lower ts = StopFilter.new(ts, @stop_words) ts = SynonymTokenFilter.new(ts, @synonym_engine) end end }}} === Creating the token filter === !SynonymTokenFilter does the job of taking a token from the supplied token stream and injecting all the synonyms for that token. Couple of interesting piece of code here, starting with the 'next' method. The first thing it does is check the @synonym_stack to see if there are any synonyms left in it and if so then return that instead of the next token in the @token_stream. If @synonym_stack is empty then it proceeds to finding the next token and if it's not nil it calls add_synonyms_to_stack. The add_synonyms_to_stack method takes the supplied token, calls the get_synonym method of the @synonym_engine and then loops over the results and adding them to the stack. While adding them to the stack it turns them into tokens that have the same start position and end position as the original token. It also makes sure to set the position increment to 0. That is very important because you want all the synonyms and the original token to have the same positions. {{{ class SynonymTokenFilter < Ferret::Analysis::TokenStream include Ferret::Analysis def initialize(token_stream, synonym_engine) @token_stream = token_stream @synonym_stack = [] @synonym_engine = synonym_engine end def text=(text) @token_stream.text = text end def next return @synonym_stack.pop if @synonym_stack.size > 0 if token = @token_stream.next add_synonyms_to_stack(token) unless token.nil? end return token end private def add_synonyms_to_stack(token) synonyms = @synonym_engine.get_synonyms(token.text) return if synonyms.nil? synonyms.each do |s| @synonym_stack.push( Token.new(s, token.start, token.end, 0)) end end end }}} === Create the synonym engine === The !WordnetSynonymEngine does the actual job of querying an existing ferret index for the synonyms for any word passed to get_synonyms. The engine creates a searcher object to use for every call to get_synonyms. The 'existing ferret index' mentioned previously is created by wordnet_prolog_2_ferret.rb that'll be described in the next section. When get_synonyms is called it creates a simple !TermQuery object on the "word" field in the index and returns the first result it finds from the @searcher's search_each method. Any synonym engine must implement get_synonyms, and the results get_synonyms returns must be an array. {{{ # Accesses a ferret index created from the wordnet synonym database class WordnetSynonymEngine include Ferret::Search def initialize(wordnet_index_location) @searcher = Searcher.new(index_location) end def get_synonyms(word) @searcher.search_each(TermQuery.new(:word, word)) do |doc_id, score| return @searcher[doc_id][:syn] end return nil end end }}} The engine described above is based on the example in the "Lucene in Action" book; however, other engines can easily be created. Here's an example of using a YAML based synonym engine. {{{ # Accesses a YAML file for synonym lookup. class YAMLSynonymEngine def initialize(index_location) @searcher = YAML.load_file(index_location) end def get_synonyms(word) return @searcher[word] end end }}} Fairly simple class that loads the file specified by the index_location parameter into the @searcher variable. Then any call to get_synonyms just returns the lookup for @searcher's indexer method. If YAML doesn't find anything it returns nil, but if it does find something it returns that. Again, the engines must return an array so this YAML engine requires that the YAML file be set up using a multi-line inline collection. Here's an short example: {{{ # Notice that multi-word keys must be in quotes. ferret: ['black-footed ferret', 'mustela nigripes', 'ferret out'] 'black-footed ferret': ['ferret', 'mustela nigripes'] 'ferret out': ['ferret'] 'mustela nigripes': ['black-footed ferret', 'ferret'] }}} Both engines work the same: {{{ >> w = WordnetSynonymEngine.new("#{RAILS_ROOT}/index/#{ENV['RAILS_ENV']}/wordnet") >> w.get_synonyms('ferret') => ["black-footed ferret", "mustela nigripes", "ferret out"] >> >> y = YAMLSynonymEngine.new("#{RAILS_ROOT}/extras/synonyms.yaml") >> y.get_synonyms('ferret') => ["black-footed ferret", "mustela nigripes", "ferret out"] }}} === Creating the Wordnet synonym index === ==== Ferret version ==== This code is a port of the Syns2Index.java program into ruby with only a few minor changes to how it works. I did not want to exclude words with spaces in them so I removed any logic for that, and obviously I changed it so that it builds a ferret index instead of a Lucene index. To use this script download the [http://wordnet.princeton.edu/obtain prolog wordnet database] and extract it. Run the script without any arguments to see the usage. The file you will want to use is 'wn_s.pl'. The index is built on the idea that there are two fields. A "word" field and a "syn" field. The word field is the word to look up, and the syn field is an array of all the synonyms. When ferret returns the syn field it will return the array as it was indexed. {{{ require 'rubygems' require 'ferret' def index(index_dir, word2nums, num2words) row = 0 mod = 1 # override the specific index if it already exists field_infos = Ferret::Index::FieldInfos.new() field_infos.add_field(:word, :index => :untokenized, :term_vector => :no) field_infos.add_field(:syn, :index => :no, :term_vector => :no) index = Ferret::Index::Index.new(:path => index_dir, :field_infos => field_infos) word2nums.each do |key, value| doc = {:word => key} n = index_word(word2nums, num2words, key, doc) if n > 0 if ((row = row + 1) % mod) == 0 puts "\nrow=#{row}/#{word2nums.size} doc=#{doc}" mod = mod * 2 end index << doc end # else degenerate end end # Given 2 maps fills a document for 1 word def index_word(word2nums, num2words, key, doc) words = [] word2nums[key].each do |value| words << num2words[value] unless num2words[value].nil? end words.flatten! words.uniq! num = 0 words.delete(key) # remove itself doc[:syn] = [] words.each do |value| num = num + 1 doc[:syn] << value end num end def usage puts "ruby wordnet_prolog_to_ferret.rb <prolog file> <index dir>" end if ARGV.size.eql? 2 @prolog_filename = ARGV[0] @index_dir = ARGV[1] else usage; exit(1); end # make sure the prolog file is readable unless File.readable?(@prolog_filename) puts "Error: cannot read Prolog file: #{@prolog_filename}" exit(1) end # exit if the target index directory already exists if File.exists?(@index_dir) puts "Error: index directory already exists: #{@index_dir}" puts "Please specify a name of a non-existant directory" exit(1) end puts "Opening Prolog file #{@prolog_filename}" File.open(@prolog_filename, "r") do |file| word2nums = {} num2words = {} rejected_words = 0 mod = 1; # used for row = 1; # status updates puts "[1/2] Parsing #{@prolog_filename}" while (line = file.gets) # occasional progress if ((row = row +1) % mod) == 0 # periodically print out line we read in mod = mod * 2 puts "\n#{row} #{line} word2num size: #{word2nums.size} num2words size: #{num2words.size} rejected words=#{rejected_words}" end # syntax check unless line[0..1] == "s(" puts "OUCH: #{line}" exit(1); end # parse line line = line[2..-4] line_parts = line.split(',') line_parts[2] = line_parts[2].slice(1..-2).downcase # trim single quotes off word # 1/2: word2nums map # append to entry or add new one lis = word2nums[line_parts[2]] if lis.nil? word2nums[line_parts[2]] = [line_parts[0]] else lis << line_parts[0] end # 2/2: num2words map lis = num2words[line_parts[0]] if lis.nil? num2words[line_parts[0]] = [line_parts[2]] else lis << line_parts[2] end end puts "\n[2/2] Building index to store synonyms, map sizes are #{word2nums.size} and #{num2words.size}" index(@index_dir, word2nums, num2words) end }}} ==== YAML Version ==== For completeness sake here is a quick version thrown together to create a YAML version of the wordnet database. Due to the length of the code I'm only including the relevant methods that have changed. This script will take some time to complete and will use a lot of resources (over 250 megs of memory to create). The resultant YAML file will require a little over 50 megs of memory in usage when loaded for searching. The file that is output has a different format then the example YAML file listed above but it works exactly the same. {{{ require 'yaml' # instead of require 'ferret' def index(index_dir, word2nums, num2words) row = 0 mod = 1 doc = {} word2nums.each do |key, value| n = index_word(word2nums, num2words, key, doc) if n > 0 if ((row = row + 1) % mod) == 0 puts "\nrow=#{row}/#{word2nums.size} doc_count=#{doc.size}" mod = mod * 2 end end end File.open(index_dir, 'w') do |out| YAML.dump(doc, out) end end # Given 2 maps fills a document for 1 word def index_word(word2nums, num2words, key, doc) words = [] word2nums[key].each do |value| words << num2words[value] unless num2words[value].nil? end words.flatten! words.uniq! num = 0 words.delete(key) # remove itself doc[key] = [] if words.size > 0 words.each do |value| num = num + 1 doc[key] << value end num end }}} === Integrating with acts_as_ferret and Rails === Using this with Rails and acts_as_ferret is easy. Store these files in your "#{RAILS_ROOT}/lib" directory so they are loaded by the Rails system when it starts up. Then modify one of your existing aaf enabled models similar to the following: {{{ class Test << ActiveRecord::Base acts_as_ferret( :fields => [:your, :fields, :here], :store_class_name => true, :ferret => { :or_default => false, :analyzer => SynonymAnalyzer.new( # YAMLSynonymEngine.new("#{RAILS_ROOT}/extras/synonyms.yaml"), []) WordnetSynonymEngine.new('#{RAILS_ROOT}/index/#{ENV['RAILS_ENV']}/wordnet'), []) }, end }}} Move the indexes you created in the section above into the relevant areas. Delete the engine reference you don't want to use. Then you're all set up. === Outstanding Issues === There are still some issues that need to be taken care of: 1. You can not do a synonym based search for words with spaces yet. Since the tokenizer breaks words up by spaces it will not find these in the index. 2. Currently aaf doesn't support different analyzers for searching/indexing so doing this actually causes the synonym insertion to be done twice (once during indexing and another time during query generation). Really it's only needed once: during indexing if you want it more transparent to the user, or during query generation if you want to be able to give the user control of when to search for synonyms). More on this second option in a minute. 3. Currently I get errors when trying to use this with a ferret index running on Drb. Above I mentioned allowing the user to control the search for synonyms. I was considering a construct of "%{word or words}" to add to the grammar. This would give the user the ability to do "rabbits %{ferret}" and the resulting query would look like: {{{ rabbits ferret|"black-footed ferret"|"mustela nigripes"|"ferret out" }}} That would actually solve issue 1 and issue 2 above, since by enclosing the synonym search in French braces would allow for multi-word synonyms. It would also remove the need for indexing your documents upon insertion into the database keeping the size of the index down as well. None of that has been done as of yet, so for now the synonym searching is not as robust as it will hopefully become.
E-mail address and user name can be saved in the Preferences.