From irc.freenode.net #searchengine, 2003-10-29 All times are in UTC. 19:01:18 --- lilo has changed the topic to: Discussion of issues surrounding this research project: http://cosco.hiit.fi/irchiver/ .... the researcher will describe the project in a moment. 19:01:53 okay 19:02:18 First: Hi everyone 19:02:25 let me introduce Ville H Tuulos with the Complex Systems Computation Group at Helsinki Institute for Information Technology 19:02:42 he's the researcher, and he'll describe what the project is about 19:02:53 I'll try to be brief 19:03:07 Ok. So. 19:03:32 Imagine a search engine for IRC, like Google 19:04:02 the problem with Google-like approach is that it depends heavily on link information on Web (PageRank) 19:04:15 and you don't have it with IRC, of course 19:04:47 moreover IRC is highly dynamic compared to Web - discussion topics etc change all the time 19:05:19 so having just a keyword search is not enough 19:06:00 what we're doing is that we have various statistical models which try to capture changing themes in natural language, documents, chat, whatever 19:06:32 if you're in this field, words MPCA or ICA might say something to you 19:07:06 but as with statistics always, it needs huge amounts of data to behave nicely 19:07:23 and we're talking about *really* huge amount here, not just 100Mb 19:08:06 we have just ordered a SAN system having 1.5 teras of disc space, to start with 19:08:18 so this is the scale.. 19:08:47 so we can't play with some artificial or small domain data 19:09:23 we really have to work with real life data, as it's the only way to get hands on enough data 19:09:46 like Google.. PageRank works lousily on small nets. 19:10:10 ok. This is the first part. Huge amounts of data. 19:10:47 the second thing is that as you might guess our models fail with crap 19:11:17 meaning, pure line noise 19:11:36 as IRC discussions sometimes are on some other networks / channels 19:12:24 Freenode is perfect with this respect: There's enough people (as you can see :) and the discussions aren't typically line noise 19:12:55 ok.. Let me hype a bit so this won't be just technicalities 19:13:18 what we would eventually like to have is a system which for example: 19:14:05 a) You could type a query, e.g. a linux related question and the system shows where there are discussions like this 19:14:38 b) system could show evolution of topics real-time i.e. how the themes change 19:15:29 c) Having eventually multiple IRC networks, you could see how the topics spread over the world 19:16:41 d) I bet this one raises much discussion: You could see topics per user i.e. what are the main areas of expertise etc. of a nick. 19:17:05 I'm sure that you got the beef and you can imagine the rest (: 19:17:20 okay, let's see 19:17:26 What we're doing will be totally open source. GPL'ed. 19:17:26 so people are going to have questions 19:18:07 okay, beginning to get questions 19:18:14 tuulos: should we start with the basic questions? 19:18:24 ok 19:18:41 okay, folks 19:19:10 basic questions about how the information is captured, opt-in or opt-out, what channels are covered, who will see the information initially and eventually 19:19:13 can we start there? 19:19:16 stuff of that sort 19:19:27 let me start going through the messages 19:19:58 Q. will all channels be logged ? 19:21:03 I presented you the idea. Then it's up to you whether we'd like to have this kind of a system. And in which extent. 19:21:25 but short answer: probably not 19:21:50 ..and definitely you should be always able to protect your privacy 19:21:54 tuulos: let's see 19:22:07 we are talking here about a public service, not a spying machine 19:22:20 could we could have individual channels opt-in? 19:22:59 and if so, would it be possible to have individual users opt out? 19:23:33 The system should always respect privacy. 19:24:00 is it *feasible* technically to have individual users opt out? would that create problems in data collection? 19:24:24 yes. Sure. Consider a snapshot of IRC traffic. You can do grep -v tuulos etc. 19:25:18 would user hosts be retained, or nicks only? 19:25:24 or would nicks be expunged as well? 19:25:45 We're not interested in individual users 19:26:02 so we'll not save hosts or anything else personal, except nicks. 19:26:39 the situation is somehow the same as with DejaNews / Usenet 19:27:27 hmmm 19:27:38 how will you let channel founders decide wether to be involved or not? 19:27:38 Q. spammers find ways to trick the systems, how can you prevent spammers from flooding irc to try to game the system? 19:28:23 well.. It's an arms race. 19:28:48 so you'll be working on that throughout the project? 19:28:55 The point is anyway that the bigger the system (amount of data) the more difficult it's to trick 19:29:08 Faking PageRank is not trivial, for example 19:29:41 I guess we have to. We can't keep the system usable otherwise 19:30:00 Q. will there be (or is there) a posted privacy policy? 19:30:14 no. We are here discussing about that (: 19:30:21 it's up to you 19:31:03 I guess it sounds like the user asking the question is wondering if there'll be a general privacy policy w/r to the information you're logging 19:31:13 i.e., how it will be used and what will be archived 19:31:54 One option is that we restrict this service to a limited area e.g. to some network / channels. 19:32:17 that makes sure that you can make the decision whether to be logged or not 19:32:23 can an individual user opt out across the network? 19:33:07 well.. It disturbs the data. 19:33:44 You wouldn't get the whole flow of discussion in that case 19:34:14 so I guess that a channel-based system is better 19:35:10 let's see 19:35:48 Q. Another question: Will "expertise" info, et cetera, be bound to nickserv IDs, or individual nicks? Will all of my nicks be counted as me, or will tanuki be distinct from Phoon be distinct from Angus be distinct from tanuki_... 19:36:17 a tricky one. 19:36:54 how can you see that tanuki_ and tanuki are the same but tanuki and tanuko are not? 19:37:02 edit distance is the same 19:37:10 that information is available from services, but I think you'd have to have a link into services 19:37:10 well. We have to think about that. 19:37:16 the current services don't make that easy 19:37:21 probably the new ones would make it easier 19:37:29 Q. won't it, even if it respects the privacy of everyone, possible be a huge source of inf for social engineers and others? 19:37:44 but the first phase is anyway to have a channel-level search engine, which doesn't pay attention to individual nicks 19:38:13 well. These are public chats. 19:38:34 it *is* worth noting that anyone on any client whatsoever can do what you propose to do, without asking 19:38:43 it's not typically done 19:38:51 anyway you should pay attention to what you say here 19:39:26 and please note, that even though everyone says no to this idea, eventually really nasty people will do it by themselves, without asking 19:39:42 that's a side issue, but definitely worth keeping in mind 19:39:43 * lilo agrees 19:40:21 * lilo reads message traffic again 19:41:14 Q. what stage is this project in? is there a peer-reviewed paper or a "state of the project" update available online? 19:41:30 http://cosco.hiit.fi/search/ 19:41:54 We are currently updating the pages, sorry about lack of fancyness etc (: 19:43:37 we have the first models ready and we have e.g. harvested the whole Amazon.com and built models on that data 19:43:37 so we're not in a toy system stage 19:43:39 Q. would the logs of conversations you capture be available themselves, or would it just be meta-data and channel pointers? 19:43:39 and some of you might know www.nutch.org -open source search engine 19:43:39 oic 19:43:42 we have discussed with them about cooperation 19:44:24 we need raw data at first 19:44:58 would this project produce an openly-available search page? or would access be restricted to researchers? 19:45:01 meaning basicly snapshot from IRC protocol traffic 19:45:47 well, making a global public search engine is not trivial. At first the code will be available. We aim at publising something in early 2004. 19:46:16 but eventually of course we're interested in providing a reference service by ourselves 19:46:16 depending on the support we get etc. 19:47:23 definitely we wan't to make this a community-based project. Not just something academical which nobody uses. 19:48:31 one thing I can do for you is to try to package up some of the privacy comments, tuulos 19:48:43 thanks 19:48:49 I've gotten a fairly large number of them, but in most cases there's no question attached to them 19:49:18 they'll give you a feel for the flavor of the culture 19:49:56 I'm eager to hear your opinions 19:50:00 I will comment that I have no idea at this point whether this is something that could be done here; my thought is just that tuulos should have some input as to some of the questions people will have, and the concerns, if he's going to do something of this sort 19:50:21 if he does it here, or elsewhere, it needs to be set up in a way that people will be comfortable with 19:50:31 so I hope this will be helpful in that respect 19:50:40 let's see if I can find a few more questions 19:50:50 I totally agree with you. 19:51:32 but still I hope that you see my point: This is a Linus-type approach. We do it and see what happens. Then we discuss and make it better. Not just whine about something abstract. 19:52:03 Without spreading the bots this evening, I wouldn't be here. 19:52:13 Now I feel lucky that I'm here (: 19:52:42 I have to copypaste you something.. 19:52:46 Person who say it cannot be done should not interrupt person doing it. --Chinese Proverb 19:53:54 okay, any more questions about how the information will be used, or possible concerns? I think we should hit those 19:54:08 there have been some questions about technical aspects, but not so many 19:54:18 interesting 19:54:26 we can probably refer people with technical questions directly to you, tuulos, after this ends 19:54:37 so privacy seems to be the main concern. Quite understandable. 19:54:50 sure 19:55:25 Q: is it possible to ask a removal of something said? 19:55:40 (that's a good question in that it suggests a whole range of concerns) 19:55:52 Probably not. But then again, it doesn't make any difference in a statistical model. 19:56:12 I should re-ask a previous question 19:56:23 how much of this would be available publicly, through a search engine? 19:56:36 conversations? or just pointers to channels and topics? 19:56:53 would the raw data be publicly available, say missing user@host info? 19:56:58 * lilo is trying to get a feel for this 19:57:11 at first I guess that the system would show just the broad topics of discussions, not nicks, not individual comments 19:57:26 we could even restrict the system to this stage, if needed 19:58:13 so we could filter out everything individual during logging and save just the words 19:59:12 I'd like to provide raw data for anyone interested, at least for research purposes 19:59:25 there might be concerns with that 19:59:30 I nor anyone else should be priviledged in this sense. It's not privacy. 19:59:32 again, I want to emphasize two things for the people here 19:59:42 you shouldn't trust me (: 19:59:54 (1) I see the cultural problems with doing this, so I'm not necessarily advocating our doing it 20:00:09 (it would take a lot of concerns being addressed) 20:00:40 (2) Despite that, it's worth noting that this is something anyone can do pretty invisibly on a medium-sized or larger channel, tuulos came to us in a very obvious way 20:01:06 * lilo appreciates that fact that tuulos is here talking rather than just doing it 20:01:42 I believe he'll listen to your concerns, and not do it at all if it doesn't seem workable *to the users* as a result of those concerns 20:01:57 but that doesn't say that somebody else may not do this on some scale without consulting us 20:02:05 and in a not terribly visible way 20:02:11 both things are worth thinking about 20:02:32 as regards (1), I don't want to say "you can do this" without knowing that it can be done in a way that people will be comfortable with 20:02:44 so, just things for people to bear in mind 20:03:24 okay....I'm going to open up the channel....I will reiterate that nobody is going to be *doing* this without a lot more study 20:03:30 at least, not if they ask our permission 8) 20:03:34 For me it seems that the most practical way would be to restrict logging only to some channels. 20:03:38 * lilo nods 20:03:53 deciding how to allow opt-in would be one necessary challenge 20:04:19 thanks for your attention. I'm eager to hear your opinions / ideas etc. (tuulos@...) 20:04:23 okay, that's pretty much it....thanks to everyone for coming in and asking questions, and feel free to stick around 20:04:42 *try to put tuulos at ease* when I -m you 8) get your concerns across, but remember there's only one of him, and a lot of us :) 20:04:45 * lilo waves 20:04:45 --- lilo sets mode -m #searchengine