文本中实体引用的 rdf 表示
考虑这样一个句子:
约翰·史密斯前往华盛顿。
在美好的一天,姓名标签会将“约翰·史密斯”识别为一个人,将“华盛顿”识别为一个地方。然而,如果没有其他证据,它无法分辨出世界上所有可能的“约翰·史密斯”中的哪一个,甚至无法分辨出各种“华盛顿”中的哪一个。
最终,某些解决过程可能会根据其他证据做出决定。然而,在那之前,在 RDF 中表示这些引用的良好实践是什么?在某个命名空间中为它们分配唯一的标识符?制作空白元组(例如“文档 d 中引用了某个名为 John Smith 的人”。)?还有其他选择吗?我的一本书给出了一个涉及匿名气象站的例子,但我不太明白他们的例子如何与所描述的 RDF 的其他内容相适应。
Consider a sentence like:
John Smith travelled to Washington.
A name tagger would identify, on a good day, 'John Smith' as a person, and 'Washington' as a place. However, without other evidence, it can't tell which of all the possible 'John Smith's in the world, or even which of the various 'Washington's, it's got.
Eventually, some resolution process might decide, based on other evidence. Until that point, however, what is a good practice for representing these references in RDF? Assign them made-up unique identifiers in some namespace? Make blank tuples (e.g. 'Some person named John Smith was referenced in Document d'.)? Some other alternative? A book I have gives an example involving anonymous weather stations, but I am not quite following how their example fits in with everything else about RDF being described.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在您自己的命名空间中为它们分配唯一的标识符。如果您后来发现这个“华盛顿”与 http://dbpedia.org/resource 相同/Washington,_D.C.,或者其他什么,你可以添加一个 owl:sameAs 来断言。
Assign them unique identifiers in your own namespace. If you later discover that this "Washington" is the same as http://dbpedia.org/resource/Washington,_D.C., or whatever, you can add an owl:sameAs to assert that.
首先,您可以使用现有的良好服务进行实体识别,例如 OpenCalais、Zemanta 和 炼金术< /a>.
更具体地说,是的,只需为每件事“铸造”你自己的 URI(标识符),然后讨论它们 - 在turtle中提供此信息的表示
,如果你稍后将它们匹配,那么你可以使用 owl:正如格伦·麦克唐纳提到的那样。
first of all, there are existing good services you can use for entity recognition such as OpenCalais, Zemanta and Alchemy.
To be more specific though, yes simply 'mint' your own URIs (identifiers) for each thing, then talking about them - to offer up a representation for this information in turtle
and if you later match them up, then you can use owl:sameAs as glenn mcdonald mentions.
可能与您阅读 Apache Stanbol 如何执行此操作相关: http:// stanbol.apache.org/docs/trunk/components/enhancer/enhancementstruct.html
May be relevant for you to read how Apache Stanbol does it: http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html
您可以按照上面的讨论创建自己的 URI,也可以使用空白节点。两种方法各有利弊:
URI 具有外部标识,因此您可以在将来的查询中明确引用您的概念,这可以使某些查询变得更加简单;但是,它们具有外部身份,因此用于构建 URI 的算法成为基础设施的关键部分,并且您必须保证它们既稳定又唯一。起初这可能是微不足道的,但是当您开始处理在不同时间(通常是并行)并且在分布式系统上重新处理的多个文档时,它很快就不再是简单的了。
空白节点是专门为了解决这个问题而引入的,它们的唯一性由它们的范围保证;但是,如果您需要在查询中显式引用空白节点,您将需要使用非标准扩展,或者找到某种方法来表征该节点。
在这两种情况下,尤其是如果您使用空白节点,您应该包含出处声明来描述它。
@nathan 的例子是一个很好的理解这个想法的例子。
因此,使用空白节点的示例可能是:
请注意,使用 rfc5147 文本/纯片段标识符来唯一标识正在处理的文件,这为您提供了如何识别各个运行的灵活性。另一种方法是在文档根的 URI 中捕获所有这些内容,或者完全放弃出处。
您会注意到 foaf:name 具有 owl:Thing 的范围,因此它可以应用于任何内容。另一种方法可能是使用 skos:Concept 和 rdfs:label 作为专有名词。
空白节点与 URI 的最后一个考虑因素是,您使用的任何数据存储最终都必须存储您使用的任何 URI,如果您使用非常大的数据集,这可能会对性能产生影响。
最终,如果我要在图中发布出处信息以及最终的统一实体,我会倾向于使用空白节点并将 URI 分配给我最终统一实体的概念。
然而,如果我不打算跟踪推论的出处,并且这只是管道中的多次传递中的一个,最终会丢弃中间结果,那么我只会使用某种文档哈希、时间戳和 id 来创建 URI并完成它。
You can either mint your own URI as discussed above, or use a blank-node. There are pros and cons for both approaches:
URI's have an external identity, so you can explicitly refer to your concept in future queries which can make some queries much simpler; but, you they have an external identity, so the algorithm you use to construct the URI's becomes a critical part of your infrastructure and you have to guarantee they are both stable and unique. This may be trivial at first, but when you start dealing with multiple documents being reprocessed at differing times, often in parallel, and on distributed systems, it pretty quickly ceases to be straight forward.
Blank-nodes were included specifically to solve this problem, their uniqueness is guaranteed by their scoping; but, if you are going to need to refer to a blank-node in a query explicitly you are going to need to use either a non-standard extension, or find some way to characterize the node.
In both cases, but especially should you use a blank-node, you should include provenance statements to characterize it anyway.
@nathan's example is a good one to get the idea.
So an example using blank-nodes might be:
Note the use of rfc5147 text/plain fragment identifiers to uniquely identify the file being processed, this provides you with flexibility as to how you wish to identify individual runs. The alternative is to capture all this in the URI for the document root, or to abandon provenance altogether.
You will note that foaf:name has a range of owl:Thing, so it can be applied to anything. An alternative might to use skos:Concept and rdfs:label for the proper nouns.
One final consideration for blank-node vs. URI is that any datastore you use will ultimately have to store any URI you use, and this can have implications regarding performance if you are using very large datasets.
Ultimately if I was going to publish the provenance information in the graph along with the final unified entities, I would be inclined to go with blank-nodes and allocate URI's to the concepts I ultimately unify entities with.
If however I am not going to be tracking the provenance of the inferences, and this is just one pass of many in a pipeline which will ultimately discard the intermediate results, I would just mint URIs using some sort of document hash, timestamp, and id and be done with it.