如何使用 XQuery 查找 xml 文档中的重复数据?

发布于 2024-07-12 00:48:31 字数 221 浏览 11 评论 0原文

我在 MarkLogic xml 数据库中有一堆文档。 一份文档具有:

<colors>
  <color>red</color>
  <color>red</color>
</colors>

拥有多种颜色不是问题。 拥有多种红色的颜色是一个问题。 如何查找有重复数据的文档?

I have a bunch of documents in a MarkLogic xml database. One document has:

<colors>
  <color>red</color>
  <color>red</color>
</colors>

Having multiple colors is not a problem. Having multiple colors that are both red is a problem. How do I find the documents that have duplicate data?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

东风软 2024-07-19 00:48:32

对于这个 XML:

<?xml version="1.0"?>
<colors>
    <color>Red</color>
    <color>Red</color>
    <color>Blue</color>
</colors>

使用这个 XSD:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method = "text" />  
    <xsl:strip-space elements="*"/>

    <xsl:template match="colors">

        <xsl:for-each select="color">
            <xsl:variable name="node_color" select="text()"/>
            <xsl:variable name="numEntries" select="count(../color[text()=$node_color])"/>
            <xsl:if test="$numEntries > 1">
                <xsl:text>Color value of </xsl:text><xsl:value-of select="."/><xsl:text> has multiple entries 
</xsl:text>      
            </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

我得到这个输出:

Color value of Red has multiple entries 
Color value of Red has multiple entries 

这样至少会找到它们,但它会报告每次出现的重复颜色,而不仅仅是每个重复颜色。

For this XML:

<?xml version="1.0"?>
<colors>
    <color>Red</color>
    <color>Red</color>
    <color>Blue</color>
</colors>

Using this XSD:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method = "text" />  
    <xsl:strip-space elements="*"/>

    <xsl:template match="colors">

        <xsl:for-each select="color">
            <xsl:variable name="node_color" select="text()"/>
            <xsl:variable name="numEntries" select="count(../color[text()=$node_color])"/>
            <xsl:if test="$numEntries > 1">
                <xsl:text>Color value of </xsl:text><xsl:value-of select="."/><xsl:text> has multiple entries 
</xsl:text>      
            </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

I got this output:

Color value of Red has multiple entries 
Color value of Red has multiple entries 

So that will at least find them, but it will report each occurrence of a repeated color, not just every repeated color.

初见你 2024-07-19 00:48:31

MarkLogic 返回的所有内容都只是一个节点序列,因此我们可以计算整个序列的大小,并将其与不同值序列的计数进行比较。 如果它们不不同,则它们是重复的,并且您有您的子集。

for $c in doc()//colors
where fn:count($c/color) != fn:count(fn:distinct-values($c/color))
return $c

Everything MarkLogic returns is just a sequence of nodes, so we can count the sequence size of the whole and compare it to the count of the sequence of distinct values. If they're not distinct, they're duplicate, and you have your subset.

for $c in doc()//colors
where fn:count($c/color) != fn:count(fn:distinct-values($c/color))
return $c
笑忘罢 2024-07-19 00:48:31

这应该可以解决问题。 我对MarkLogic不太熟悉,所以第一行获取文档集可能是错误的。 这将返回具有 2 个或更多具有相同字符串值的颜色元素的所有文档。

for $doc in doc()
let $colors = $doc//color/string(.)
where some $color in $colors
      satisfies count($colors[. = $color] > 1)
return doc()

This should do the trick. I am not too familiar with MarkLogic, so the first line to get the set of documents may be wrong. This will return all documents which have 2 or more color elements with the same string value.

for $doc in doc()
let $colors = $doc//color/string(.)
where some $color in $colors
      satisfies count($colors[. = $color] > 1)
return doc()
尤怨 2024-07-19 00:48:31

或者您可以完全不用索引来完成此操作:)

for $c in doc()//colors 可能会在较大的数据集上创建 EXPANDED TREE CACHE 错误。

当数据量很大时,这里有一个稍微复杂的方法来攻击这个问题,确保URI Lexicon已打开,然后在元素上添加一个元素范围索引 >颜色并计算在某处有重复的不同颜色值。 然后仅逐一循环具有该颜色的文档,并计算文档中感兴趣的颜色的项目频率计数。 如果频率超过 1,则该文档需要进行重复数据删除。

let $qn := xs:QName("color")
let $colorsWithItemFreq := cts:element-values($qn, (), ("ascending", "item-order", "item-frequency"))
let $colorsOfInterest := 
    for $color at $i in cts:element-values($qn, (), ("ascending", "item-order", "fragment-frequency"))
    let $fragFrequency := cts:frequency($color)
    let $itemFrequency := cts:frequency($colorsWithItemFreq[$i])
    where $itemFrequency gt $fragFrequency
    return 
        $color

for $uri in cts:uris( (), ("document"), cts:element-value-query($qn, $colorsOfInterest)
let $colorsWithDuplicationInThisDoc :=
    for $color in cts:element-values($qn, (), ("item-frequency"), cts:document-query($uri) )
    where $color = $colorsOfInterest and cts:frequency($color) gt 1
    return
        $color
where fn:count( $colorsWithDuplicationInThisDoc ) gt 1
return
    $uri

希望有帮助。

Or you could do it completely out of indexes :)

for $c in doc()//colors is likely to create an EXPANDED TREE CACHE error on larger data sets.

Here is a slightly more complicated way to attack this when the data is huge, make sure the URI Lexicon is turned on and then add a element range index on the element color and compute the distinct color values that have duplication somewhere. Then loop over only the documents that have this color one by one and compute the item-frequency counts of the colors of interest in the documents. If you get a frequency over 1, this document needs de-duplication.

let $qn := xs:QName("color")
let $colorsWithItemFreq := cts:element-values($qn, (), ("ascending", "item-order", "item-frequency"))
let $colorsOfInterest := 
    for $color at $i in cts:element-values($qn, (), ("ascending", "item-order", "fragment-frequency"))
    let $fragFrequency := cts:frequency($color)
    let $itemFrequency := cts:frequency($colorsWithItemFreq[$i])
    where $itemFrequency gt $fragFrequency
    return 
        $color

for $uri in cts:uris( (), ("document"), cts:element-value-query($qn, $colorsOfInterest)
let $colorsWithDuplicationInThisDoc :=
    for $color in cts:element-values($qn, (), ("item-frequency"), cts:document-query($uri) )
    where $color = $colorsOfInterest and cts:frequency($color) gt 1
    return
        $color
where fn:count( $colorsWithDuplicationInThisDoc ) gt 1
return
    $uri

Hope that helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文