如何查找网络上的文档是否与其他文档在语义上相关?
我的问题是,给定网络上的文档 d1 和文档 d2,我如何判断 d1 和 d2 在语义上相关。是否有一些 API 可以进行一定量的自然语言处理,这可能会提示我 d1 可能连接到 d2。 我非常急切地需要它。请帮助!
My question here is that given a document d1 on the web and a document d2
how do I tell that d1 and d2 are semantically related. Are there some API's that can do some amount of natural language processing that might give me a hint as to d1 is a probably connected to d2.
I need it badly and uregently.Please Help!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以使用特殊的微格式。如需了解更多信息,请访问 http://microformats.org/
简单示例:
Rel-License 是多种微格式之一。通过将 rel="license" 添加到超链接,页面指示该超链接的目标是当前页面的许可证。
You can use special microformats. See more at http://microformats.org/
Simple example:
Rel-License is one of several microformats. By adding rel="license" to a hyperlink, a page indicates that the destination of that hyperlink is a license for the current page.
对于语义相关的文档,您可以使用特殊词汇(如 SKOS)并将它们在本体中关联起来。或者您可以直接在文档中使用 - 正如 silex 提到的 - 微格式。
对于自然语言处理,存在不同的工具,例如 GATE 可以提取信息。但这并不是一项简单的任务。
也许你可以完善你想做的事情?您想定义哪些文档是相关的吗?或者您想要一个软件来找出哪些文档可能相关?
For semantically relating documents you can use special vocabularies like SKOS and relate them in an ontology. Or you can use - as silex mentioned - microformats directly in your documents.
For natural language processing, there exist different tools like GATE which can extract information. But this is not a trivial task.
Perhaps you can refine what you want to do? Do you want to define which documents are related? Or do you want a software to find out which documents may be related?
您需要研究“命名实体提取”,即自然语言处理来提取可能常见的实体到这两个文件。这些通常是人、地点、事件、时间、组织。
查看 OpenCalais http://www.opencalais.com/ 了解一些此类类型的实际应用程序的技术。
You need to look into "named entity extraction" i.e. natural language processing to extract likely entities that are common to both documents. These are generally people, places, events, times, organisations.
Take a look at OpenCalais http://www.opencalais.com/ for some real-world applications of this type of technology.