Hello World 2.0: 2008-03 -- Java, Programming, and Other Technical Musings

i had been looking around for an open source Java library that would facilitate searching in PDF files, when i discovered the solution of using a combination of PDFBox and Apache Lucene.

PDFBox is an open source Java PDF library for working with PDF documents. It allows creation of new PDF documents, manipulation of existing documents, and - most importantly for this purpose - the ability to extract content from documents.

Apache Lucene, on the other hand, provides Java-based indexing and search technology.

It is not hard to see, then, that these two libraries can be used in combination for PDF searching in Java; PDFBox can be used extract text from PDF documents, and Lucene can be used to search through the extracted text. In actual fact, it is easier than that, as PDFBox provides an utility that enables simple integration with Lucene. This utility is the org.pdfbox.searchengine.lucene.LucenePDFDocument class, which contains static methods for obtaining a Lucene document from a PDF file. The document can then be added to a Lucene index, which can be searched with an index searcher.

A simple implementation that determines whether a specified term is present in a PDF file:

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.pdfbox.searchengine.lucene.LucenePDFDocument;

public final class SimplePdfSearch
{
    private static final String PDF_FILE_PATH = "/path/to/pdffile.pdf";
    private static final String SEARCH_TERM = "searchterm";

    public static final void main(String[] args) throws IOException
    {
        Directory directory = null;

        try
        {
            File pdfFile = new File(PDF_FILE_PATH);
            Document document = LucenePDFDocument.getDocument(pdfFile);

            directory = new RAMDirectory();

            IndexWriter indexWriter = null;

            try
            {
                Analyzer analyzer = new StandardAnalyzer();
                indexWriter = new IndexWriter(directory, analyzer, true);

                indexWriter.addDocument(document);
            }
            finally
            {
                if (indexWriter != null)
                {
                    try
                    {
                        indexWriter.close();
                    }
                    catch (IOException ignore)
                    {
                        // Ignore
                    }

                    indexWriter = null;
                }
            }

            IndexSearcher indexSearcher = null;

            try
            {
                indexSearcher = new IndexSearcher(directory);

                Term term = new Term("contents", SEARCH_TERM);
                Query query = new TermQuery(term);

                Hits hits = indexSearcher.search(query);

                System.out.println((hits.length() != 0) ? "Found" : "Not Found");
            }
            finally
            {
                if (indexSearcher != null)
                {
                    try
                    {
                        indexSearcher.close();
                    }
                    catch (IOException ignore)
                    {
                        // Ignore
                    }

                    indexSearcher = null;
                }
            }
        }
        finally
        {
            if (directory != null)
            {
                try
                {
                    directory.close();
                }
                catch (IOException ignore)
                {
                    // Ignore
                }

                directory = null;
            }
        }
    }
}

This code fragment demonstrates only the basic concept, and is not very useful per se, but it is not difficult to extend it to do some powerful searches by utilising the capabilities of Lucene. For example, different queries such as a phrase query or a fuzzy query can be used instead of a term query (see org.apache.lucene.search.Query), and a highlighter object can be used to extract the text fragments that contain the found term (see org.apache.lucene.search.highlight.Highlighter).

Hello World 2.0

2008-03-12

Java Library to Search in PDF Files

2008-03-07

Previous Version

My Tweets

My Tweets

Blog Archive

Tags

Links

Advertisements