如何使用 XQuery 查找 xml 文档中的重复数据?
我在 MarkLogic xml 数据库中有一堆文档。 一份文档具有:
<colors>
<color>red</color>
<color>red</color>
</colors>
拥有多种颜色不是问题。 拥有多种红色的颜色是一个问题。 如何查找有重复数据的文档?
I have a bunch of documents in a MarkLogic xml database. One document has:
<colors>
<color>red</color>
<color>red</color>
</colors>
Having multiple colors is not a problem. Having multiple colors that are both red is a problem. How do I find the documents that have duplicate data?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于这个 XML:
使用这个 XSD:
我得到这个输出:
这样至少会找到它们,但它会报告每次出现的重复颜色,而不仅仅是每个重复颜色。
For this XML:
Using this XSD:
I got this output:
So that will at least find them, but it will report each occurrence of a repeated color, not just every repeated color.
MarkLogic 返回的所有内容都只是一个节点序列,因此我们可以计算整个序列的大小,并将其与不同值序列的计数进行比较。 如果它们不不同,则它们是重复的,并且您有您的子集。
Everything MarkLogic returns is just a sequence of nodes, so we can count the sequence size of the whole and compare it to the count of the sequence of distinct values. If they're not distinct, they're duplicate, and you have your subset.
这应该可以解决问题。 我对MarkLogic不太熟悉,所以第一行获取文档集可能是错误的。 这将返回具有 2 个或更多具有相同字符串值的颜色元素的所有文档。
This should do the trick. I am not too familiar with MarkLogic, so the first line to get the set of documents may be wrong. This will return all documents which have 2 or more color elements with the same string value.
或者您可以完全不用索引来完成此操作:)
for $c in doc()//colors
可能会在较大的数据集上创建 EXPANDED TREE CACHE 错误。当数据量很大时,这里有一个稍微复杂的方法来攻击这个问题,确保URI Lexicon已打开,然后在元素上添加一个元素范围索引 >颜色并计算在某处有重复的不同颜色值。 然后仅逐一循环具有该颜色的文档,并计算文档中感兴趣的颜色的项目频率计数。 如果频率超过 1,则该文档需要进行重复数据删除。
希望有帮助。
Or you could do it completely out of indexes :)
for $c in doc()//colors
is likely to create an EXPANDED TREE CACHE error on larger data sets.Here is a slightly more complicated way to attack this when the data is huge, make sure the URI Lexicon is turned on and then add a element range index on the element color and compute the distinct color values that have duplication somewhere. Then loop over only the documents that have this color one by one and compute the item-frequency counts of the colors of interest in the documents. If you get a frequency over 1, this document needs de-duplication.
Hope that helps.