Irchiver FAQ

1: So, what do you want?

We want to develop a useful open-source tool to aid collaborative work. That is, to make often pretty chaotic IRC discussions more useful. But we can't develop the tool without some real-life data i.e actual IRC discussions.

2: Do you have any ideas how to collect the test data set while respecting our privacy?

Yes. We have a suggestion:

You'll always see the presence of the bots.
We collect data some limited amount of time, say 14 days.
Opt-out. The bots come from one IP which is easy to ban.
You can provide us with a list of channels beforehand which shouldn't be logged.
We filter out all identity data, meaning hostnames and nicks.
We don't distribute the data but we may show Google-like snippets of discussions as search results.
Logging period will be made explicit.

In the current situation (18th Nov 2003) we are collecting a test data set which will be handled CONFIDENTIALLY. We will not use this data in public. The data will be used just to evaluate our statistical models.

3: What are you going to do with the data?

We will run it through our text processing system, build the statistical models and start tweaking them. Eventually we hope that we could provide a publicly available prototype system allowing you to make searches on this one static data set.

4: Searches like what?

It depends what will work and what will not. Probably you could type a query or give an example document and the system returns IRC channels with corresponding topics. Or you could see how the topics change on selected channels over time.

Search would be totally based on the collected test data (nicks gone etc.) and we would not use or collect any other information for this purpose.

5: What then?

Even before publishing the prototype system we will provide a public CVS access etc. Then we will see whether this kind of service would be useful. After that everyone could utilize the code for his / her own purposes and maybe we could even try to arrange a bigger and even real-time reference system. We will never publish the data we've collected, code only.