From irc.freenode.net #searchengine, 2003-10-29
All times are in UTC.

19:01:18 --- lilo has changed the topic to: Discussion of issues surrounding this research project: http://cosco.hiit.fi/irchiver/ .... the researcher will describe the project in a moment.
19:01:53 <lilo> okay
19:02:18 <tuulos> First: Hi everyone
19:02:25 <lilo> let me introduce Ville H Tuulos with the Complex Systems Computation Group at Helsinki Institute for Information Technology
19:02:42 <lilo> he's the researcher, and he'll describe what the project is about
19:02:53 <tuulos> I'll try to be brief
19:03:07 <tuulos> Ok. So.
19:03:32 <tuulos> Imagine a search engine for IRC, like Google
19:04:02 <tuulos> the problem with Google-like approach is that it depends heavily on link information on Web (PageRank) 
19:04:15 <tuulos> and you don't have it with IRC, of course
19:04:47 <tuulos> moreover IRC is highly dynamic compared to Web - discussion topics etc change all the time
19:05:19 <tuulos> so having just a keyword search is not enough
19:06:00 <tuulos> what we're doing is that we have various statistical models which try to capture changing themes in natural language, documents, chat, whatever
19:06:32 <tuulos> if you're in this field, words MPCA or ICA might say something to you
19:07:06 <tuulos> but as with statistics always, it needs huge amounts of data to behave nicely
19:07:23 <tuulos> and we're talking about *really* huge amount here, not just 100Mb
19:08:06 <tuulos> we have just ordered a SAN system having 1.5 teras of disc space, to start with
19:08:18 <tuulos> so this is the scale..
19:08:47 <tuulos> so we can't play with some artificial or small domain data
19:09:23 <tuulos> we really have to work with real life data, as it's the only way to get hands on enough data
19:09:46 <tuulos> like Google.. PageRank works lousily on small nets.
19:10:10 <tuulos> ok. This is the first part. Huge amounts of data.
19:10:47 <tuulos> the second thing is that as you might guess our models fail with crap
19:11:17 <tuulos> meaning, pure line noise
19:11:36 <tuulos> as IRC discussions sometimes are on some other networks / channels
19:12:24 <tuulos> Freenode is perfect with this respect: There's enough people (as you can see :) and the discussions aren't typically line noise
19:12:55 <tuulos> ok.. Let me hype a bit so this won't be just technicalities
19:13:18 <tuulos> what we would eventually like to have is a system which for example:
19:14:05 <tuulos> a) You could type a query, e.g. a linux related question and the system shows where there are discussions like this 
19:14:38 <tuulos> b) system could show evolution of topics real-time i.e. how the themes change
19:15:29 <tuulos> c) Having eventually multiple IRC networks, you could see how the topics spread over the world
19:16:41 <tuulos> 
d) I bet this one raises much discussion: You could see topics per user i.e. what are the main areas of expertise etc. of a nick.
19:17:05 <tuulos> I'm sure that you got the beef and you can imagine the rest (:
19:17:20 <lilo> okay, let's see
19:17:26 <tuulos> What we're doing will be totally open source. GPL'ed.
19:17:26 <lilo> so people are going to have questions
19:18:07 <lilo> okay, beginning to get questions
19:18:14 <lilo> tuulos: should we start with the basic questions?
19:18:24 <tuulos> ok
19:18:41 <lilo> okay, folks
19:19:10 <lilo> basic questions about how the information is captured, opt-in or opt-out, what channels are covered, who will see the information initially and eventually
19:19:13 <lilo> can we start there?
19:19:16 <lilo> stuff of that sort
19:19:27 <lilo> let me start going through the messages
19:19:58 <lilo> Q. will all channels be logged ?
19:21:03 <tuulos> I presented you the idea. Then it's up to you whether we'd like to have this kind of a system. And in which extent.
19:21:25 <tuulos> but short answer: probably not
19:21:50 <tuulos> ..and definitely you should be always able to protect your privacy
19:21:54 <lilo> tuulos: let's see
19:22:07 <tuulos> we are talking here about a public service, not a spying machine
19:22:20 <lilo> could we could have individual channels opt-in?
19:22:59 <lilo> and if so, would it be possible to have individual users opt out?
19:23:33 <tuulos> The system should always respect privacy.
19:24:00 <lilo> is it *feasible* technically to have individual users opt out?  would that create problems in data collection?
19:24:24 <tuulos> yes. Sure. Consider a snapshot of IRC traffic. You can do grep -v tuulos etc.
19:25:18 <lilo> would user hosts be retained, or nicks only?
19:25:24 <lilo> or would nicks be expunged as well?
19:25:45 <tuulos> We're not interested in individual users
19:26:02 <tuulos> so we'll not save hosts or anything else personal, except nicks.
19:26:39 <tuulos> the situation is somehow the same as with DejaNews / Usenet
19:27:27 <lilo> hmmm
19:27:38 <dmwaters> how will you let channel founders decide wether to be  involved or not?
19:27:38 <lilo> Q. spammers find ways to trick the systems, how can you prevent spammers from flooding irc to try to game the system?
19:28:23 <tuulos> well.. It's an arms race.
19:28:48 <lilo> so you'll be working on that throughout the project?
19:28:55 <tuulos> The point is anyway that the bigger the system (amount of data) the more difficult it's to trick
19:29:08 <tuulos> Faking PageRank is not trivial, for example
19:29:41 <tuulos> I guess we have to. We can't keep the system usable otherwise
19:30:00 <lilo> Q. will there be (or is there) a posted privacy policy?
19:30:14 <tuulos> no. We are here discussing about that (:
19:30:21 <tuulos> it's up to you
19:31:03 <lilo> I guess it sounds like the user asking the question is wondering if there'll be a general privacy policy w/r to the information you're logging
19:31:13 <lilo> i.e., how it will be used and what will be archived
19:31:54 <tuulos> One option is that we restrict this service to a limited area e.g. to some network / channels.
19:32:17 <tuulos> that makes sure that you can make the decision whether to be logged or not
19:32:23 <lilo> can an individual user opt out across the network?
19:33:07 <tuulos> well.. It disturbs the data.
19:33:44 <tuulos> You wouldn't get the whole flow of discussion in that case
19:34:14 <tuulos> so I guess that a channel-based system is better
19:35:10 <lilo> let's see
19:35:48 <lilo> Q. Another question:  Will "expertise" info, et cetera, be bound to nickserv IDs, or individual nicks?  Will all of my nicks be counted as me, or will tanuki be distinct from Phoon be distinct from Angus be distinct from tanuki_...
19:36:17 <tuulos> a tricky one.
19:36:54 <tuulos> how can you see that tanuki_ and tanuki are the same but tanuki and tanuko are not?
19:37:02 <tuulos> edit distance is the same
19:37:10 <lilo> that information is available from services, but I think you'd have to have a link into services
19:37:10 <tuulos> well. We have to think about that.
19:37:16 <lilo> the current services don't make that easy
19:37:21 <lilo> probably the new ones would make it easier
19:37:29 <lilo> Q. won't it, even if it respects the privacy of everyone, possible be a huge source of inf for social engineers and others?
19:37:44 <tuulos> but the first phase is anyway to have a channel-level search engine, which doesn't pay attention to individual nicks
19:38:13 <tuulos> well. These are public chats.
19:38:34 <lilo> it *is* worth noting that anyone on any client whatsoever can do what you propose to do, without asking
19:38:43 <lilo> it's not typically done
19:38:51 <tuulos> anyway you should pay attention to what you say here
19:39:26 <tuulos> and please note, that even though everyone says no to this idea, eventually really nasty people will do it by themselves, without asking
19:39:42 <lilo> that's a side issue, but definitely worth keeping in mind
19:39:43 * lilo agrees
19:40:21 * lilo reads message traffic again
19:41:14 <lilo> Q. what stage is this project in?  is there a peer-reviewed paper or a "state of the project" update available online?
19:41:30 <tuulos> http://cosco.hiit.fi/search/
19:41:54 <tuulos> We are currently updating the pages, sorry about lack of fancyness etc (:
19:43:37 <tuulos> we have the first models ready and we have e.g. harvested the whole Amazon.com and built models on that data
19:43:37 <tuulos> so we're not in a toy system stage
19:43:39 <lilo> Q. would the logs of conversations you capture be available themselves, or would it just be meta-data and channel pointers?
19:43:39 <tuulos> and some of you might know www.nutch.org -open source search engine
19:43:39 <lilo> oic
19:43:42 <tuulos> we have discussed with them about cooperation
19:44:24 <tuulos> we need raw data at first
19:44:58 <lilo> would this project produce an openly-available search page?  or would access be restricted to researchers?
19:45:01 <tuulos> meaning basicly snapshot from IRC protocol traffic
19:45:47 <tuulos> well, making a global public search engine is not trivial. At first the code will be available. We aim at publising something in early 2004.
19:46:16 <tuulos> but eventually of course we're interested in providing a reference service by ourselves
19:46:16 <tuulos> depending on the support we get etc.
19:47:23 <tuulos> definitely we wan't to make this a community-based project. Not just something academical which nobody uses.
19:48:31 <lilo> one thing I can do for you is to try to package up some of the privacy comments, tuulos 
19:48:43 <tuulos> thanks
19:48:49 <lilo> I've gotten a fairly large number of them, but in most cases there's no question attached to them
19:49:18 <lilo> they'll give you a feel for the flavor of the culture
19:49:56 <tuulos> I'm eager to hear your opinions
19:50:00 <lilo> I will comment that I have no idea at this point whether this is something that could be done here; my thought is just that tuulos should have some input as to some of the questions people will have, and the concerns, if he's going to do something of this sort
19:50:21 <lilo> if he does it here, or elsewhere, it needs to be set up in a way that people will be comfortable with
19:50:31 <lilo> so I hope this will be helpful in that respect
19:50:40 <lilo> let's see if I can find a few more questions
19:50:50 <tuulos> I totally agree with you.
19:51:32 <tuulos> but still I hope that you see my point: This is a Linus-type approach. We do it and see what happens. Then we discuss and make it better. Not just whine about something abstract. 
19:52:03 <tuulos> Without spreading the bots this evening, I wouldn't be here. 
19:52:13 <tuulos> Now I feel lucky that I'm here (:
19:52:42 <tuulos> I have to copypaste you something..
19:52:46 <tuulos> Person who say it cannot be done should not interrupt person doing it. --Chinese Proverb
19:53:54 <lilo> okay, any more questions about how the information will be used, or possible concerns?  I think we should hit those
19:54:08 <lilo> there have been some questions about technical aspects, but not so many
19:54:18 <tuulos> interesting
19:54:26 <lilo> we can probably refer people with technical questions directly to you, tuulos, after this ends
19:54:37 <tuulos> so privacy seems to be the main concern. Quite understandable.
19:54:50 <tuulos> sure
19:55:25 <lilo> Q: is it possible to ask a removal of something said?
19:55:40 <lilo> (that's a good question in that it suggests a whole range of concerns)
19:55:52 <tuulos> Probably not. But then again, it doesn't make any difference in a statistical model.
19:56:12 <lilo> I should re-ask a previous question
19:56:23 <lilo> how much of this would be available publicly, through a search engine?
19:56:36 <lilo> conversations? or just pointers to channels and topics?
19:56:53 <lilo> would the raw data be publicly available, say missing user@host info?
19:56:58 * lilo is trying to get a feel for this
19:57:11 <tuulos> at first I guess that the system would show just the broad topics of discussions, not nicks, not individual comments
19:57:26 <tuulos> we could even restrict the system to this stage, if needed
19:58:13 <tuulos> so we could filter out everything individual during logging and save just the words
19:59:12 <tuulos> I'd like to provide raw data for anyone interested, at least for research purposes
19:59:25 <lilo> there might be concerns with that
19:59:30 <tuulos> I nor anyone else should be priviledged in this sense. It's not privacy.
19:59:32 <lilo> again, I want to emphasize two things for the people here
19:59:42 <tuulos> you shouldn't trust me (:
19:59:54 <lilo> (1) I see the cultural problems with doing this, so I'm not necessarily advocating our doing it
20:00:09 <lilo>     (it would take a lot of concerns being addressed)
20:00:40 <lilo> (2) Despite that, it's worth noting that this is something anyone can do pretty invisibly on a medium-sized or larger channel, tuulos came to us in a very obvious way
20:01:06 * lilo appreciates that fact that tuulos is here talking rather than just doing it
20:01:42 <lilo> I believe he'll listen to your concerns, and not do it at all if it doesn't seem workable *to the users* as a result of those concerns
20:01:57 <lilo> but that doesn't say that somebody else may not do this on some scale without consulting us
20:02:05 <lilo> and in a not terribly visible way
20:02:11 <lilo> both things are worth thinking about
20:02:32 <lilo> as regards (1), I don't want to say "you can do this" without knowing that it can be done in a way that people will be comfortable with
20:02:44 <lilo> so, just things for people to bear in mind
20:03:24 <lilo> okay....I'm going to open up the channel....I will reiterate that nobody is going to be *doing* this without a lot more study
20:03:30 <lilo> at least, not if they ask our permission 8)
20:03:34 <tuulos> For me it seems that the most practical way would be to restrict logging only to some channels.
20:03:38 * lilo nods
20:03:53 <lilo> deciding how to allow opt-in would be one necessary challenge
20:04:19 <tuulos> thanks for your attention. I'm eager to hear your opinions / ideas etc. (tuulos@...)
20:04:23 <lilo> okay, that's pretty much it....thanks to everyone for coming in and asking questions, and feel free to stick around
20:04:42 <lilo> *try to put tuulos at ease* when I -m you 8) get your concerns across, but remember there's only one of him, and a lot of us :)
20:04:45 * lilo waves
20:04:45 --- lilo sets mode -m #searchengine