Nutch hello world

download and install ant

download and install Cygwin

download HBase 0.94.14

config java_home in .bashrc

Download a source package

cd apache-nutch-2.2.1

Run ant

Now there is a directory runtime/local which contains a ready to use Nutch installation.

Customize your crawl properties

Add your agent name in the value field of the property in conf/nutch-site.xml, for example:

Edit the file conf/regex-urlfilter.txt and replace

accept anything else


with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:


Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml

  • Ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml

  • Ensure that HBaseStore is set as the default datastore in $NUTCH_HOME/conf/ Other documentation for HBaseStore can be found here.

run ant runtime

config ssh for cygwin

start HBase