在哪里可以找到描述特定主题事件的文本?
那么,一些背景知识:我正在尝试训练一个机器学习系统来回答有关事件的问题,其中事件描述和问题都是用自然语言提出的;事件描述仅限于单个句子。
到目前为止,主要问题是找到一个语料库,该语料库用足够有限的词汇来描述事件,以便在所有事件中提出类似的问题(例如,如果所有事件都涉及国际象棋,我可以合理地问“哪个棋子移动了?”)可以从相当比例的事件描述句子中得出答案)。
考虑到这一点,我希望找到一个紧紧围绕描述某个相当有限的主题内的事件的文本源(例如,更多地沿着国际象棋评论而不是国际象棋论坛)。
虽然我很幸运地获得了空中交通控制器对话框的语料库,大多数句子都不是典型的英语(它们涉及很多查理、探戈等)。然而,如果格式如我所描述的那样,那么实际的焦点主题是无关紧要的,只要它有一个即可。
由于我计划根据本文构建自己的语料库,因此不需要标记。
So, some background: I'm trying to train a ML system to answer questions about events, where both the event descriptions and questions are posed in natural language; the event descriptions are constrained to being single sentences.
So far the main problem with this has been locating a corpus that describes events with a limited enough vocabulary to pose similar questions across all of the events (e.g. if all of the events involved chess, I could reasonably ask 'what piece moved?' and an answer could be drawn from a decent percentage of the event description sentences).
With that in mind, I'm hoping to find a text source that is tightly focused around describing events within some fairly limited topic (more along the lines of chess commentary than a chess forum, for example).
While I've had some luck with a corpus of air-traffic controller dialogs, most of sentences aren't typical English (they involve a lot of Charlie, Tango, etc.). However, if the format is as I've described then the actual topic of focus is irrelevant, so long as it has one.
Since I plan on building my own corpus out of this text, no tagging is necessary.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
路透社语料库的内容相当单调(商业新闻;CEO任命、并购、重大交易等);我更熟悉多语言 v2,但 IIRC v1 语料库是单语英语。这些将是多句新闻报道,但根据新闻惯例,您可以期望第一句话构成整个故事的合理要点。 http://about.reuters.com/researchandstandards/corpus/
您还可以查看其他TREC,尤其是 MUC 竞赛材料; http://en.wikipedia.org/wiki/Message_Understand_Conference
The Reuters corpus has a fairly monotonous content (commercial news; CEO appointments, mergers and acquisitions, major deals, etc); I am more familiar with the multilingual v2 but IIRC the v1 corpus was monolingual English. These will be multiple-sentence news stories, but in keeping with journalistic conventions, you can expect the first sentence to form a reasonable gist of the full story. http://about.reuters.com/researchandstandards/corpus/
You might also look at other TREC and especially MUC competition materials; http://en.wikipedia.org/wiki/Message_Understanding_Conference
您考虑过 Usenet 吗?它有自己的一堆特殊约定,但像
rec.food.cooking
这样的东西似乎大致符合您的描述。 http://groups.google.com/group/rec.food.cooking/ 也可以看看rec.sports.hockey
或rec.games.video.arcade
。如果您正在寻找规范的、知名的语料库,还有 20 Newsgroups 语料库,它至少包含一些与体育相关的新闻组材料。 http://people.csail.mit.edu/jrennie/20Newsgroups/(也许在你的国家,“公众”对棒球很满意,而在这里,你知道,是那种不能用手的足球。)
Have you considered Usenet? It has a bunch of idiosyncratic conventions of its own but something like
rec.food.cooking
would seem to broadly fit your description. http://groups.google.com/group/rec.food.cooking/ Have a look at e.g.rec.sports.hockey
orrec.games.video.arcade
as well. There is also the 20 Newsgroups corpus if you are looking for a canonical, well-known corpus, and it contains at least some sports-related newsgroup material. http://people.csail.mit.edu/jrennie/20Newsgroups/(Maybe in your country the "general public" is comfortable with baseball. Over here it would be football, you know, the kind where you can't use your hands.)