Welcome! You're here probably because you saw HooWWWer in your site's logs. Our show tonight includes:
HooWWWer is a web crawler. Crawlers download documents from web servers, store those deemed interesting, follows links on the pages to new documents, and so the cycle goes on... We're running it to feed our search engine research.
HooWWWer is a fast distributed crawler, written in C and developed under Linux. It will be released under GPL in the near future. I still feel like adding some stuff, and getting more test kilometerage to iron bugs out. Meanwhile, if you need to run a crawler, or just want to play with one, take a look at larbin for instance. If you aren't afraid of C-code, tweaking Makefiles, compiling stuff from sources and so on, send email to me and I'll throw you a tarball.
Oops, sorry about that. Please let me know what goes wrong so that I can fix it. Bugs are one issue, another is coping with every strange thing out there. I'm imperfect in many ways :)
My intention isn't to annoy anyone. It's in everybody's interest to have robots behaving nicely. If you're annoyed, I'm sorry for that, but please consider the following issues.
I'm, above all, worried about the "sane manner" part. Insane robots should be taken offline immediately for repairs. If you are still annoyed, please read below about How do I ban HooWWWer from my site?.
If you run a site named www.example.com, add the following lines as
http://www.example.com/robots.txt
User-agent: HooWWWer Disallow: /
For more information, please look at the robot exclusion standard.
If the topic is search-related in general, please take a look at the contact
page. For HooWWWer/crawling issues, please email
crawler-info at hiit.FI
.
Note that there's hardly any documentation. Contact me with questions and I'll write some stuff on a need-to-know basis...