如何识别给定文本中的想法和概念
我目前正在开发一个项目,如果能够检测文本正文中何时提到某个主题/想法,这将非常有用。例如,如果文本包含:
如果您告诉我更多有关琼斯先生的信息,也许会有帮助。如果我能描述一下他的外貌,或者最好是一张照片,也会很有用吗?
如果能够检测到这个人要求提供琼斯先生的照片,那就太好了。我可以采取一种非常天真的方法,只寻找“照片”或“照片”这个词,但如果他们写的是这样的东西,这显然是不好的:
请永远不要向我发送琼斯先生的照片。
有人知道从哪里开始吗?有可能吗?
我研究过 nltk 之类的东西,但我还没有找到有人做类似事情的例子,而且我仍然不完全确定这种分析叫什么。任何能让我起步的帮助都会很棒。
谢谢!
I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained:
Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, or even better a photograph?
It'd be great to be able to detect that the person has asked for a photograph of Mr Jones. I could take a really naïve approach and just look for the word "photo" or "photograph", but this would obviously be no good if they wrote something like:
Please, never send me a photo of Mr Jones.
Does anyone know where to start with this? Is it even possible?
I've looked into things like nltk, but I've yet to find an example of someone doing something similar and am still not entirely sure what this kind of analysis is called. Any help that can get me off the ground would be great.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
可能对您有用的最好的东西是自动情绪分析。例如,它用于判断客户评论是正面还是负面。我无法直接向您提供可用工具的指示,但这正是您正在寻找的工具。
但我必须说,这是当前自然语言处理的热门话题,我在会议上看到过很多论文。这绝对是一个相当复杂的问题,如果您从头开始,可能需要相当长的时间才能得到您想要的结果。
The best thing out there that might be useful to you is automatic sentiment analysis. This is used, for example, to judge whether, say, a customer review is positive or negative. I cannot give you direct pointers to available tools, but this is what you are looking for.
I must say, though, that this is a current hot topic in natural language processing and I’ve seen a number of papers at conferences. It’s definitely quite a complex matter and if you’re starting from scratch, it might take quite some time before you get the results that you want.
NLTK 对于解析自然语言来说是一个不错的框架,但要注意这不是一件简单的事情。做这样的事情确实是研究级别的编程。
使它变得更容易的一件好事是,如果您的领域非常有限 - 假设您的应用程序专注于有关著名作家的信息,那么您可以避免自然语言的一些复杂性,例如某些类型的歧义。
从哪里开始?好问题。我不知道有关于这个主题的任何教程(我猜你尝试过 Google 选项),但我想 iTunes U 会有关于这个主题的课程。如果没有,我可以发布一个我完成的课程的链接,该课程提到了该主题并且并不完全可怕:http://www.inf.ed.ac.uk/teaching/courses/inf2a/lecturematerials/index.html#lecture01
NLTK is not a bad framework for parsing natural language but beware that this is not a simple matter. Doing stuff like this is really research level programming.
A good thing that makes it much easier is if you have a very limited domain - say your application focuses on information about famous writers, then you can avoid some complexities of natural language like certain types of ambiguities.
Where to start? Good question. I don't know of any tutorials on the topic (and I presume you tried the Google option) but I'd imagine that iTunes U would have a course on the topic. If not I can post a link to a course I've done that mentions the subject and wasn't completely horrible: http://www.inf.ed.ac.uk/teaching/courses/inf2a/lecturematerials/index.html#lecture01
你要解决的问题非常具有挑战性。
我首先会识别文本中的实体(问题称为命名实体识别,谷歌搜索),然后我会尝试识别概念。
如果想粗略地识别文本的内容,我建议您从使用 WordNet 开始,根据单词及其在层次结构中的位置来识别所涉及的概念。
如果您想创建一个显示真正智能的系统,那么您应该开始研究 CYC (OpenCYC) 等资源,它可以让您将句子转换为 FOL 句子。
这种核心人工智能是解决您问题的方法。对于简单的聊天机器人,依靠简单的统计方法会更容易。
祝你好运
The problem that u tackle is very challenging.
I would start by first identifying the entities in the text (problem referred as Named Entity Recognition, google it), and then a I would try to identify concepts.
If want to roughly identify what is the text about, I suggest that you start by using WordNet and according to the words and their places in the hierarchy to identify the concepts involved.
If you want to produce a system which show real intelligence than you should start researching about resources such as CYC (OpenCYC) which will allow you to convert the sentences into FOL sentences.
This hardcore AI, approach to solving your problem. For simple chat bot, it would be easier to rely on simple statistical methods.
good luck
无监督方法,例如文本聚类或主题建模,可以指示语料库中的哪个文档反映特定主题。但这些方法的成功取决于您对主题边界的定义以及与特定用例相关的错误。
在您的示例中,第一步是简单的方法,即仅保留包含“照片*”或语义相关标记的文档(例如,通过从语言模型提取的单词嵌入来识别) 。然后,您可以测试否定检测方法(包括语言模型),旨在识别文本中的概念否定,广泛应用于 临床领域。
结果将是概率性的,指示特定文档是否落入桶中。在您的示例中,文档 X 是否属于“是照片”主题。存储桶和方法定义得越好,防止误报的机会就越大。
Unsupervised methods, such as text clustering or topic modeling, can indicate which document in the corpus reflect a particular topic. But the success of these methods depends on your definition of of topic boundary and what counts as an error in relation to a particular use case.
In your example the first steps IS the naive approach, which is to keep only the documents that contain "photo*" or semantically related tokens (identified, for example, via word embedding drawn from a language model). You can then test negation detection methods (including language models), that seek to identify concept negation in texts, as widely applied in the clinical domain.
The result will be probabilistic, indicating whether a particular document falls into a bucket. In your example, whether document X falls into the "is photo" topic. The better you define the bucket and method, the better your chances of preventing false positives.