{"id":884,"date":"2010-04-10T19:13:09","date_gmt":"2010-04-10T18:13:09","guid":{"rendered":"http:\/\/sahits.ch\/blog\/?p=884"},"modified":"2010-04-10T19:13:09","modified_gmt":"2010-04-10T18:13:09","slug":"searching-with-lucenes-subproject-nutch","status":"publish","type":"post","link":"http:\/\/sahits.ch\/blog\/blog\/2010\/04\/10\/searching-with-lucenes-subproject-nutch\/","title":{"rendered":"Searching with Lucenes subproject Nutch"},"content":{"rendered":"<p>Some sites on the web do not provide any search facilities or their results are such that they are not usable. Therefore I was looking around for alternatives. I found <a href=\"http:\/\/lucene.apache.org\/nutch\/\">Nutch<\/a> and got it running.<br \/>\n<!--more--><br \/>\nNutch is a subproject of Apaches Lucene. The task of Nutch is twofold:<\/p>\n<ol>\n<li>Crawl certain specified webpages<\/li>\n<li>Provide a web application for the search<\/li>\n<\/ol>\n<p>The first part is pretty stright forward and well <a href=\"http:\/\/lucene.apache.org\/nutch\/tutorial.html\">documented<\/a>. I decided to compile the sources myself from the subversion trunk.<br \/>\nThe crawling itself once failed with a file not found in the cache. Deleting the cache and recrawling solved this.<br \/>\nThe setting up of a web application and running it correctly is not so trivial as described in the tutorial, since it has some pitfalls.<\/p>\n<ol>\n<li>If you have already installed tomcat you will not want to install the nutch-x.y.war as ROOT.war. Instead copy the nutch-x.y.war into the webapps direcory and rename it as nutch.war. Then start and stop tomcat, so the web archive is extracted.<\/li>\n<li>The location of your crawled indices is unknown to the web application so any search will find nothing. You have to specify the exact location in the nutch-site.xml, which can be found unter webapps\/nutch\/WEB-INF\/classes. There the following lines have to be added to point to the correct directory:\n<pre>\r\n&lt;configuration&gt;\r\n  &lt;property&gt;\r\n    &lt;name&gt;searcher.dir&lt;\/name&gt;\r\n    &lt;value&gt;\/opt\/nutch\/crawl&lt;\/value&gt;\r\n  &lt;\/property&gt;\r\n&lt;\/configuration&gt;\r\n<\/pre>\n<p>Note that I have Nutch installed in \/opt\/nutch and the index directory specified in the crawl command is <code>crawl<\/code>.<br \/>\nA pretty good article on how this works on Windws can be found at <a href=\"http:\/\/nutchinstall.blogspot.com\/\">nutchinstall.blogspot.com<\/a>.<\/p>\n<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Some sites on the web do not provide any search facilities or their results are such that they are not usable. Therefore I was looking around for alternatives. I found Nutch and got it running.<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[138],"tags":[152,150,153,151,149],"class_list":["post-884","post","type-post","status-publish","format-standard","hentry","category-it","tag-apache","tag-crawl","tag-lucene","tag-nutch","tag-search"],"_links":{"self":[{"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/posts\/884","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/comments?post=884"}],"version-history":[{"count":4,"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/posts\/884\/revisions"}],"predecessor-version":[{"id":888,"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/posts\/884\/revisions\/888"}],"wp:attachment":[{"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/media?parent=884"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/categories?post=884"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/sahits.ch\/blog\/wp-json\/wp\/v2\/tags?post=884"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}