[Java] Full-text search with Lucene

Sometimes you have to search large text data in your database. When you use like “%text to find%” SQL expressions database does very slow searches. You can improve it by creating full-text indexes on database’s columns. It may be very costly and in some cases you don’t want to create SQL-specific queries. The possible solution is to use external full-text search engine. Nowadays there are some popular engines, for example Lucene, Sphinx or Solr. In this article I emphasize on Lucene, free Java solution on Apache 2.0 License.

Important: you must have following jars (or newer versions) in classpath:

lucene-core-3.0.3.jar

The following code and comments will explain how to use it in simple searching.

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

/**
 * Simple class that keeps article.
 */
class Article {
        public static final String AUTHOR_FIELD = "AUTHOR_FIELD";
        public static final String TEXT_FIELD = "TEXT_FIELD";
        public static final String ID_FIELD = "ID_FIELD";
        private Integer id;
        private String text;
        private String author;
        public Article(Integer id, String text, String author) {
                this.id = id;
                this.author = author;
                this.text = text;
        }
        public Integer getId() {
                return id;
        }
        public void setId(Integer id) {
                this.id = id;
        }
        public String getText() {
                return text;
        }
        public void setText(String text) {
                this.text = text;
        }
        public String getAuthor() {
                return author;
        }
        public void setAuthor(String author) {
                this.author = author;
        }
        /**
         * Creates document for Lucene searching.
         *
         * @return Lucene's document
         */
        public Document createDocument() {
                Document document = new Document();
                document.add(new Field(AUTHOR_FIELD, author, Store.YES,
                                Index.ANALYZED));
                document.add(new Field(TEXT_FIELD, text, Store.YES, Index.ANALYZED));
                document.add(new Field(ID_FIELD, id.toString(), Store.YES, Index.NO));
                return document;
        }
}

public class LuceneTest {
        /**
         * Directory to keep index data.
         * In this case we will use RAM directory,
         * which stands for keeping data in memory,
         * with no persitence.
         */
        private final Directory directory;
        /**
         * Analyzer for documents. Simple analyzer
         * will be enough for this case.
         */
        private final Analyzer analyzer;
        /**
         * The object for searching operations.
         */
        private IndexSearcher indexSearcher;
        public LuceneTest() throws Exception {
                directory = new RAMDirectory();
                analyzer = new SimpleAnalyzer();
                /**
                 * IndexWriter is used in writing data to index.
                 */
                IndexWriter indexWriter = new IndexWriter(directory, analyzer,
                                MaxFieldLength.LIMITED);
                /**
                 * Create some data to be search.
                 */
                for (Article art : new Article[] {
                                new Article(1,
                                                "lucene has the great speed",
                                                "lucene developer"),
                                new Article(2,
                                                "text-search engine - lucene - is for you",
                                                "pr"),
                                new Article(3,
                                                "let's check out what lucene can give to us!",
                                                "codesmuggler")
                }) {
                        /**
                         * Adding document to our index.
                         */
                        indexWriter.addDocument(art.createDocument());
                }
                /**
                 * Don't forget to close index writer!
                 */
                indexWriter.close();
        }
        private void ensureIndexSearcherOpened() throws Exception {
                if (indexSearcher == null) {
                        indexSearcher = new IndexSearcher(directory);
                }
        }
        public void searchAndPrint(String author, String text) throws Exception {
                ensureIndexSearcherOpened();
                /**
                 * Query we will use in searching.
                 */
                String q = Article.AUTHOR_FIELD + ":\"" + author + "\" OR "
                                + Article.TEXT_FIELD + ": \"" + text + "\"";
                System.out.println("\n" + q);
                /**
                 * Query parser is used for parsing string query.
                 * It uses the same analyzer and Lucene Version for
                 * parsing.
                 */
                Query query = new QueryParser(Version.LUCENE_30, "", analyzer)
                                        .parse(q);
                /**
                 * Searches and return max 3 docs from index.
                 */
                printDocs(indexSearcher.search(query, 3).scoreDocs);
        }
        private void printDocs(ScoreDoc[] docs) throws Exception {
                ensureIndexSearcherOpened();
                System.out.println("MATCHES:");
                int i = 1;
                for (ScoreDoc scoreDoc : docs) {
                        /**
                         * Get document from index!
                         */
                        Document doc = indexSearcher.doc(scoreDoc.doc);
                        /**
                         * Document could be null if in the same time
                         * someone has thrown out this document from index.
                         */
                        if (doc != null) {
                                System.out.println(i++ + ":");
                                System.out.println("AUTHOR: " + doc.get(Article.AUTHOR_FIELD));
                                System.out.println("TEXT: " + doc.get(Article.TEXT_FIELD));
                        }
                }
        }
        public void close() throws Exception {
                if (indexSearcher != null) {
                        indexSearcher.close();
                }
                if (directory != null) {
                        directory.close();
                }
                if (analyzer != null) {
                        analyzer.close();
                }
        }

        /**
         * Do some tests!
         */
        public static void main(String[] args) throws Exception {
                LuceneTest lucene = new LuceneTest();
                lucene.searchAndPrint("", "speed"); // 1 doc
                lucene.searchAndPrint("codesmuggler", ""); // 1 doc
                lucene.searchAndPrint("", "lucene"); // 3 docs
                lucene.searchAndPrint("unknown", "weird words"); // 0 docs
                lucene.close();
        }
}

And the out:

AUTHOR_FIELD:"" OR TEXT_FIELD: "speed"
MATCHES:
1:
AUTHOR: lucene developer
TEXT: lucene has the great speed

AUTHOR_FIELD:"codesmuggler" OR TEXT_FIELD: ""
MATCHES:
1:
AUTHOR: codesmuggler
TEXT: let's check out what lucene can give to us!

AUTHOR_FIELD:"" OR TEXT_FIELD: "lucene"
MATCHES:
1:
AUTHOR: lucene developer
TEXT: lucene has the great speed
2:
AUTHOR: pr
TEXT: text-search engine - lucene - is for you
3:
AUTHOR: codesmuggler
TEXT: let's check out what lucene can give to us!

AUTHOR_FIELD:"unknown" OR TEXT_FIELD: "weird words"
MATCHES:

As you see in the example, the usage of Lucene is very simple.
The pluses are:
+ very fast searching,
+ no like sql-engine-specific queries,
+ less time spend on database operation.
The minuses:
- another system to integrate,
- could be overkill if your database is very small.

This entry was posted in coding, enterprise edition, java and tagged , , , , , , , , . Bookmark the permalink.

One Response to [Java] Full-text search with Lucene

  1. Vincenzo says:

    Great article! It was very usefull for me.

    ps: new IndexWriter(directory, analyzer,MaxFieldLength.LIMITED);
    and
    new SimpleAnalyzer();

    are deprecated. ;)
    Thanks!
    Vincenzo

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: