Some sites on the web do not provide any search facilities or their results are such that they are not usable. Therefore I was looking around for alternatives. I found Nutch and got it running.
Nutch is a subproject of Apaches Lucene. The task of Nutch is twofold:
- Crawl certain specified webpages
- Provide a web application for the search
The first part is pretty stright forward and well documented. I decided to compile the sources myself from the subversion trunk.
The crawling itself once failed with a file not found in the cache. Deleting the cache and recrawling solved this.
The setting up of a web application and running it correctly is not so trivial as described in the tutorial, since it has some pitfalls.
- If you have already installed tomcat you will not want to install the nutch-x.y.war as ROOT.war. Instead copy the nutch-x.y.war into the webapps direcory and rename it as nutch.war. Then start and stop tomcat, so the web archive is extracted.
- The location of your crawled indices is unknown to the web application so any search will find nothing. You have to specify the exact location in the nutch-site.xml, which can be found unter webapps/nutch/WEB-INF/classes. There the following lines have to be added to point to the correct directory:
<configuration> <property> <name>searcher.dir</name> <value>/opt/nutch/crawl</value> </property> </configuration>
Note that I have Nutch installed in /opt/nutch and the index directory specified in the crawl command is
crawl
.
A pretty good article on how this works on Windws can be found at nutchinstall.blogspot.com.
Ein Gedanke zu „Searching with Lucenes subproject Nutch“