Searching with Lucenes subproject Nutch

Some sites on the web do not provide any search facilities or their results are such that they are not usable. Therefore I was looking around for alternatives. I found Nutch and got it running.

Nutch is a subproject of Apaches Lucene. The task of Nutch is twofold:

  1. Crawl certain specified webpages
  2. Provide a web application for the search

The first part is pretty stright forward and well documented. I decided to compile the sources myself from the subversion trunk.
The crawling itself once failed with a file not found in the cache. Deleting the cache and recrawling solved this.
The setting up of a web application and running it correctly is not so trivial as described in the tutorial, since it has some pitfalls.

  1. If you have already installed tomcat you will not want to install the nutch-x.y.war as ROOT.war. Instead copy the nutch-x.y.war into the webapps direcory and rename it as nutch.war. Then start and stop tomcat, so the web archive is extracted.
  2. The location of your crawled indices is unknown to the web application so any search will find nothing. You have to specify the exact location in the nutch-site.xml, which can be found unter webapps/nutch/WEB-INF/classes. There the following lines have to be added to point to the correct directory:
    <configuration>
      <property>
        <name>searcher.dir</name>
        <value>/opt/nutch/crawl</value>
      </property>
    </configuration>
    

    Note that I have Nutch installed in /opt/nutch and the index directory specified in the crawl command is crawl.
    A pretty good article on how this works on Windws can be found at nutchinstall.blogspot.com.

Ein Gedanke zu „Searching with Lucenes subproject Nutch“

Schreibe einen Kommentar