Nutch Resource
Crawling
Configuration of Nutch Crawler
See also http://lucene.apache.org/nutch/tutorial8.html for more information.
- URLs to start with:
- e.g. nutch-0.8.x/url/yanel-website.txt (http://yanel.wyona.org/)
- e.g. nutch-0.8.x/url/yulup-website.txt (http://www.yulup.org/)
- The range of crawling resp. URLs to be parsed and followed (IMPORTANT: Both files below need to have an "accept hosts" entry):
- nutch-0.8.x/conf/crawl-urlfilter.txt (+^http://yanel.wyona.org/)
- nutch-0.8.x/conf/regex-urlfilter.txt (+^http://yanel.wyona.org/)
- Depth of Crawling: crawl.sh (e.g. DEPTH=5)
Running Nutch Crawler
- sh crawl.sh
Your comments are much appreciated
Is the content of this page unclear or you think it could be improved? Please add a comment and we will try to improve it accordingly.