HooWWWer

HooWWWer logo Welcome! You're here probably because you saw HooWWWer in your site's logs. Our show tonight includes:

News

What's this HooWWWer thing?

HooWWWer is a web crawler. Crawlers download documents from web servers, store those deemed interesting, follows links on the pages to new documents, and so the cycle goes on... We're running it to feed our search engine research.

HooWWWer is a fast distributed crawler, written in C and developed under Linux. It will be released under GPL in the near future. I still feel like adding some stuff, and getting more test kilometerage to iron bugs out. Meanwhile, if you need to run a crawler, or just want to play with one, take a look at larbin for instance. If you aren't afraid of C-code, tweaking Makefiles, compiling stuff from sources and so on, send email to me and I'll throw you a tarball.

There's this annoying thing it keeps doing...

Oops, sorry about that. Please let me know what goes wrong so that I can fix it. Bugs are one issue, another is coping with every strange thing out there. I'm imperfect in many ways :)

My intention isn't to annoy anyone. It's in everybody's interest to have robots behaving nicely. If you're annoyed, I'm sorry for that, but please consider the following issues.

I'm, above all, worried about the "sane manner" part. Insane robots should be taken offline immediately for repairs. If you are still annoyed, please read below about How do I ban HooWWWer from my site?.

How do I ban HooWWWer from my site?

If you run a site named www.example.com, add the following lines as http://www.example.com/robots.txt

User-agent: HooWWWer
Disallow: /

For more information, please look at the robot exclusion standard.

How to reach you?

If the topic is search-related in general, please take a look at the contact page. For HooWWWer/crawling issues, please email crawler-info at hiit.FI.

Take a peek at HooWWWer

hoowwwer_3.0.3_beta.tgz

Note that there's hardly any documentation. Contact me with questions and I'll write some stuff on a need-to-know basis...

Almost Valid XHTML 1.0! :)