如何在长的多用户互联网聊天日志中识别特定用户?
这是我们计划举办的在线编程竞赛。
有哪些可能的方法可以解决这个问题?
从随机的 IRC(互联网中继聊天)日志中,一小部分用户昵称将被随机删除。参与者的代码必须能够填写缺失的用户昵称。换句话说,这个事件要求你想出一个智能程序,可以弄清楚“谁可以说什么”。
可以假设所有交流都将使用现代英语,带或不带标点符号。
例如 -
原始聊天: <代码>... <用户1>:嘿! <用户2>:您好!用户1,你来自哪里?
以下内容仅提供给参与者。
删除了一些昵称的聊天记录:
..
:嘿! :您好!用户1,你来自哪里? :有人可以帮我安装 Gnome 吗? :印度。 user3,您是否安装了X Windows系统? :酷。 Gnome、user3 是什么? <%%%>:我不知道。我该如何检查? <%%%>:它是桌面环境,user2。 :噢耶!刚刚用谷歌搜索。 :在命令行中输入“startx”。以 root 身份登录并输入“apt-get install gnome”。 :谢谢! <%%%>:我是root,听我的! <%%%>:啊?! :用户2,你最好开始使用Linux! ...
参与者的代码将负责用适当的用户昵称替换“<%%%>s”。在模棱两可的情况下,例如上例中的随机注释(任何其他用户也可以这么说!),代码应该指示相同的内容。
Here is an online programming contest we are planning to have.
What could be possible approaches to solving the same?
From a random IRC (Internet Relay Chat) log, a small percentage of the user nicknames will be randomly deleted. The participant’s code must be able to fill in the missing user nicks. In other words, this event requires you to come up with an intelligent program that can figure out “who could have said what”.
It may be assumed that all communication will be in modern English, with or without punctuation.
For example -
Original Chat:...
<user1>: Hey!
<user2>: Hello! Where are you from, user1?
<user3>: Can anybody help me out with Gnome installation?
<user1>: India. user3, do you have the X Windows System installed?
<user2>: Cool. What is Gnome, user3?
<user3>: I don’t know. How do I check?
<user3>: Its a desktop environment, user2.
<user2>: Oh yeah! Just googled.
<user1>: Type “startx” on the command line. Login as root and type “apt-get install gnome”.
<user3>: Thanks!
<user5>: I’m root, obey me!
<user2>: Huh?!
<user3>: user2, you better start using Linux!
...
The following only will be given to the participant.
Chat log with some nicks deleted:
..
: Hey!
: Hello! Where are you from, user1?
: Can anybody help me out with Gnome installation?
: India. user3, do you have the X Windows System installed?
: Cool. What is Gnome, user3?
<%%%>: I don’t know. How do I check?
<%%%>: Its a desktop environment, user2.
: Oh yeah! Just googled.
: Type “startx” on the command line. Login as root and type “apt-get install gnome”.
: Thanks!
<%%%>: I’m root, obey me!
<%%%>: Huh?!
: user2, you better start using Linux!
...
The participant’s code will have the task of replacing "<%%%>s" with the appropriate user nicks. In ambiguous cases, like the random comment by in the above example (which could have been said by any other user too!), the code should indicate the same.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想到了两件事:作者归属和聊天解开。两者都不完全是你所描述的,但它们都非常接近。
作者归属是试图找到一组已知作者中哪位作者撰写了特定文档的问题。经典的作者归属通常用于大段文本(例如戏剧、小说、演讲),但人们一直在尝试对来自互联网来源的较短文本样本执行相同的操作。一个好的参考可能是 Moshe Koppel 撰写的带有“作者身份”的任何内容' 在标题中,例如最近的论文 野外作者归属。此任务的常用方法涉及使用典型的文档分类方法,即在一组通常被认为是停用词(例如 as、of、the 等)上使用词袋特征和机器学习分类器。这里的问题是所有这些工作都是在文档上进行的,并且没有考虑 IRC 数据的会话性质。
聊天解开是从聊天数据中识别出许多连贯的“对话”的问题。这是一个相当困难的问题,因为您经常需要使用对话上下文才能知道谁在回复谁。我想这种方法对于这项任务也很重要。例如,如果匿名消息是对话的一部分,则将作者集限制为对话中的人员。我真的只从论文 解开聊天,作者:Elsner 和 Charniak。他们的“相关工作”部分很好地概述了该领域。
Two things spring to my mind: authorship attribution and chat disentaglement. Neither are exactly what you describe, but they both come pretty close.
Authorship attribution is the problem of trying to find which of a known set of authors wrote a particular document. Classic authorship attribution is typically used on large sections of text (e.g. plays, novels, speeches) but people have been trying to do the same on shorter samples of text from internet sources. A good reference is probably anything written by Moshe Koppel with 'authorship' in the title, for example the recent paper Authorship Attribution in the Wild. The usual approach to this task involves using typical document classification approaches, that is using bag of words features and a machine learning classifier, on a set of what are usually thought of as stop words (e.g. as, of, the, etc.). The problem here is that all of this work is on documents and does not take into account the conversational nature of IRC data.
Chat disentanglement is the problem of identifying a number of coherent 'conversations' from chat data. This is quite a hard problem as you often need to use the context of conversation in order to know who is replying to who. I imagine this kind of approach would be important to this task as well. For example, if the anonymised message is part of a conversation then that limits the set of authors to the people in the conversation. I really only know about this from the paper Disentangling Chat by Elsner and Charniak. Their 'related work' section is a good overview of the field.
一种可能的解决方案是采用朴素贝叶斯分类器“垃圾邮件过滤器”的想法,看看不同的昵称倾向于使用哪些词。根据用户使用的单词“最像”未知用户发送的单词对消息进行分类。这样做的缺点是,如果他们使用您以前从未见过的新单词(这很可能),那么您需要了解更高级别的上下文信息。
One possible solution would be to take the Naive Bayes Classifier 'spam filter' idea and see which words different nicks tend to use. Classify messages according to which user uses words 'most like' the ones sent from an unknown user. The downfall of this would be that if they were using new words you hadn't seen before (which is very likely), then you'd need to understand higher-level context information.