There are two major Java-based projects that offer a web crawler implementation—Nutch and Heritrix. Nutch is an Apache Lucene subproject. Heritrix is the Internet Archive’s open source web…
Mostly, I'm an experimenter trying out random stuff. My varied interests include technology, web, history, books, travel, movies, behavioral sciences etc.
Disclaimer: All the opinions expressed here are my own and are not of my employer.