如何使用 NLP 技术筛选成语并将短语与其他常见短语区分开来?
存在什么技术可以区分简单的常见短语(例如“to the”、“and the”)和具有自己词汇含义的固定短语和习语(例如“pick up”、“fall in love”、“红鲱鱼”) ”、“死胡同”?
是否存在即使没有字典也能成功的技术,例如在大型语料库上训练 HMM 的统计方法?
或者是否存在启发式方法,例如忽略或加权“混杂”单词,这些单词可以与几乎任何单词同时出现,而不是单独出现或在特定的有限惯用短语集中出现?
如果存在这样的启发法,我们如何考虑固定短语和动词短语,这些短语确实包含混杂的单词,例如“beat up”、“eat up”、“sit up”、“think up”中的“up”?
更新
我在网上发现了一篇有趣的论文:惯用表达的无监督类型和标记识别
What techniques exist that can tell the difference betwen plain common phrases such as "to the", "and the" and set phrases and idioms which have their own lexical meanings such as "pick up", "fall in love", "red herring", "dead end"?
Are there techniques which are successful even without a dictionary, statistical methods HMMs train on large corpora for instance?
Or are there heuristics such as ignoring or weighting down "promiscuous" words which can co-occur with just about any word versus words which occur either alone or in a specific limited set of idiomatic phrases?
If there are such heuristics, how do we take into account set phrases and verbal phrases which do incorporate promiscuous words such as "up" in "beat up", "eat up", "sit up", "think up"?
UPDATE
I've found an interesting paper online: Unsupervised Type and Token Identification of Idiomatic Expressions
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您在寻找搭配检测吗?
查看Manning & 优秀著作自然语言处理基础知识中的这一章舒策。
Are you looking for collocation detection?
Take a look at this chapter in the excellent book, Fundamentals of natural language processing by Manning & Schütze.