当与具有多个 Text 子节点的节点一起使用时,XPath contains(text(),'some string') 不起作用

发布于 2024-09-17 10:53:49 字数 913 浏览 5 评论 0 原文

我有一个关于 dom4j 的 XPath contains 的小问题...

假设我的 XML 是

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

假设我想找到给定根元素的文本中具有 ABC 的所有节点...

所以我需要编写的 XPath 是be

//*[contains(text(),'ABC')]

但这不是 dom4j 返回的...这是 dom4j 问题还是我对 XPath 工作原理的理解,因为该查询返回只有 Street 元素而不是 Comment 元素?

DOM 使 Comment 元素成为具有四个标签的复合元素,两个

[Text = 'XYZ'][BR][BR][Text = 'ABC'] 

我认为查询仍应返回该元素,因为它应该找到该元素并在其上运行 contains ,但事实并非如此。 下面的查询返回元素,但

它返回的不仅仅是元素——它还返回父元素,这对于问题来说是不受欢迎的。

//*[contains(text(),'ABC')]

有谁知道只返回元素 的 XPath 查询吗?

I have a small problem with XPath contains with dom4j ...

Let's say my XML is

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

Let's say I want to find all the nodes that have ABC in the text given the root Element...

So the XPath that I would needed to write would be

//*[contains(text(),'ABC')]

However this is not what dom4j returns .... is this a dom4j problem or my understanding how XPath works, since that query returns only the Street element and not the Comment element?

The DOM makes the Comment element a composite element with four tags two

[Text = 'XYZ'][BR][BR][Text = 'ABC'] 

I would assume that the query should still return the element since it should find the element and run contains on it, but it doesn't ...

The following query returns the element, but it returns far more then just the element – it returns the parent elements as well, which is undesirable to the problem.

//*[contains(text(),'ABC')]

Does any one know the XPath query that would return just the elements <Street/> and <Comment/> ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

对岸观火 2024-09-24 10:54:00

这是匹配包含给定文本字符串的节点的另一种方法。首先查询文本节点本身,然后获取父节点:

//text()[contains(., "ABC")]/..

对我来说,这很容易阅读和理解。

Here is an alternate way to match nodes which contain a given text string. First query for the text node itself, then get the parent:

//text()[contains(., "ABC")]/..

For me this is easy to read and understand.

只涨不跌 2024-09-24 10:53:59
//*[text()='ABC'] 

回报

<street>ABC</street>
<comment>BLAH BLAH BLAH <br><br>ABC</comment>
//*[text()='ABC'] 

returns

<street>ABC</street>
<comment>BLAH BLAH BLAH <br><br>ABC</comment>
冷清清 2024-09-24 10:53:57

接受的答案也将返回所有父节点。即使字符串位于
之后,也仅获取具有 ABC 的实际节点:

//*[text()[contains(.,'ABC')]]/text()[contains(.,"ABC")]

The accepted answer will return all the parent nodes too. To get only the actual nodes with ABC even if the string is after
:

//*[text()[contains(.,'ABC')]]/text()[contains(.,"ABC")]
心凉怎暖 2024-09-24 10:53:56

XML 文档:

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

XPath 表达式:

//*[contains(text(), 'ABC')]

//* 匹配任何 根节点的后代元素。即,除根节点之外的任何元素。

[...] 是一个 谓词,它过滤节点集。它返回 ...true 的节点:

谓词过滤节点集[...]以生成新的节点集。对于要过滤的节点集中的每个节点,都会评估 PredicateExpr [...];如果该节点的 PredicateExpr 计算结果为 true,则该节点将包含在新节点集中;否则不包含在内。

contains('haystack', 'needle') 返回 true 如果 haystack 包含

功能:布尔值包含(字符串,字符串)

如果第一个参数字符串包含第二个参数字符串,则 contains 函数返回 true,否则返回 false。

但是 contains() 将字符串作为其第一个参数。并且它已经通过了节点。要处理作为第一个参数传递的每个节点或节点集,请使用 通过 string() 函数转换为字符串:

参数被转换为字符串类型,就像调用字符串函数一样。

string() 函数返回 string-value ="noreferrer">第一个节点

通过返回节点集中文档顺序中第一个节点的字符串值,将节点集转换为字符串。如果节点集为空,则返回空字符串。

元素节点字符串值 :

元素节点的字符串值是该元素节点的所有文本节点后代的字符串值按文档顺序串联而成的。

文本节点字符串值一个>:

文本节点的字符串值是字符数据。

因此,基本上,string-value 是节点中包含的所有文本(所有后代文本节点的串联)。

text() 是节点测试匹配任何文本节点:

节点测试 text() 对于任何文本节点都为 true。例如,child::text() 将选择上下文节点的文本节点子节点。

话虽如此, //*[contains(text(), 'ABC')] 匹配任何第一个文本节点包含 ABC。由于 text() 返回一个节点集,其中包含上下文节点的所有子文本节点(相对于计算表达式的节点)。但 contains() 仅采用第一个。因此,对于上面的文档,路径与 Street 元素匹配。

以下表达式 //*[text()[contains(., 'ABC')]] 匹配至少有一个子文本节点且包含 < 的任何元素(根节点除外)代码>ABC. 表示上下文节点。在本例中,它是除根节点之外的任何元素的子文本节点。因此,对于上面的文档,路径与 StreetComment 元素匹配。

现在, //*[contains(., 'ABC')] 匹配包含 ABC 的任何元素(但根节点除外)(在后代文本的串联中)节点)。对于上面的文档,它匹配 HomeAddrStreetComment 元素。因此,//*[contains(., 'BLAH ABC')] 匹配 HomeAddr注释 元素。

The XML document:

<Home>
    <Addr>
        <Street>ABC</Street>
        <Number>5</Number>
        <Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
    </Addr>
</Home>

The XPath expression:

//*[contains(text(), 'ABC')]

//* matches any descendant element of the root node. That is, any element but the root node.

[...] is a predicate, it filters the node-set. It returns nodes for which ... is true:

A predicate filters a node-set [...] to produce a new node-set. For each node in the node-set to be filtered, the PredicateExpr is evaluated [...]; if PredicateExpr evaluates to true for that node, the node is included in the new node-set; otherwise, it is not included.

contains('haystack', 'needle') returns true if haystack contains needle:

Function: boolean contains(string, string)

The contains function returns true if the first argument string contains the second argument string, and otherwise returns false.

But contains() takes a string as its first parameter. And it's passed nodes. To deal with that every node or node-set passed as the first parameter is converted to a string by the string() function:

An argument is converted to type string as if by calling the string function.

string() function returns string-value of the first node:

A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.

string-value of an element node:

The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

string-value of a text node:

The string-value of a text node is the character data.

So, basically string-value is all text that is contained in a node (concatenation of all descendant text nodes).

text() is a node test that matches any text node:

The node test text() is true for any text node. For example, child::text() will select the text node children of the context node.

Having that said, //*[contains(text(), 'ABC')] matches any element (but the root node), the first text node of which contains ABC. Since text() returns a node-set that contains all child text nodes of the context node (relative to which an expression is evaluated). But contains() takes only the first one. So for the document above the path matches the Street element.

The following expression //*[text()[contains(., 'ABC')]] matches any element (but the root node), that has at least one child text node, that contains ABC. . represents the context node. In this case, it's a child text node of any element but the root node. So for the document above the path matches the Street, and the Comment elements.

Now then, //*[contains(., 'ABC')] matches any element (but the root node) that contains ABC (in the concatenation of the descendant text nodes). For the document above it matches the Home, the Addr, the Street, and the Comment elements. As such, //*[contains(., 'BLAH ABC')] matches the Home, the Addr, and the Comment elements.

起风了 2024-09-24 10:53:55

涵盖 XPath 1.0 与 XPath 2.0+ 行为的现代答案...

此 XPath 的

//*[contains(text(),'ABC')]

行为与 XPath 1.0 和更高版本的 XPath(2.0 +)。

常见行为

  • //* 选择文档中的所有元素。
  • [] 根据其中表达的谓词过滤这些元素。
  • 谓词中的 contains(string, substring) 将过滤那些 substringstring 中的子字符串的元素。

XPath 1.0 行为

  • contains(arg1, substring):如果第一个参数的计算结果为节点集,contains() 会将节点集转换为通过获取节点集中的第一个节点的字符串值来获取字符串。 (如果 arg1text(),则 contains() 只会考虑所有匹配的 text 节点中的第一个>.) 如果您觉得这很奇怪,那么您并不孤单。
  • 对于 //*[contains(text(),'ABC')],该节点集将是文档中每个元素的所有子文本节点。
  • 由于仅使用第一个文本节点子节点,因此违反了对所有子文本节点进行 'ABC' 子字符串包含测试的期望。
  • 对于任何不熟悉上述转换规则的人来说,这都会导致反直觉的结果。

XPath 1.0 在线示例 显示仅选择了一个 'ABC'

XPath 2.0+ 行为

  • 使用包含多个项目的序列作为第一个参数来调用 contains(arg1, substring) 是错误的。
  • 这纠正了上述 XPath 1.0 中的反直觉行为。

XPath 2.0 在线示例 显示了由于 XPath 2.0+ 特有的转换错误而导致的典型错误消息。

常见解决方案

  1. 如果您希望包含后代元素(超出子元素),请针对 作为单个字符串的元素的字符串值,而不是子文本节点的各个字符串值,此 XPath,

    //*[包含(.,'ABC')]
    

    选择目标 StreetComment 元素及其 AddrHome 祖先元素,因为它们也有'ABC' 作为其字符串值的子字符串。

    在线示例显示祖先也被选择。

  2. 如果您希望排除后代元素(超出子元素),则此 XPath,

    //*[text()[contains(.,'ABC')]]
    

    仅选择目标 StreetComment,因为只有这些元素的文本节点子节点的字符串值包含 'ABC' 子字符串。对于所有版本的 XPath 都是如此

    在线示例仅显示街道评论< /code> 被选中。


Modern answer that covers XPath 1.0 vs XPath 2.0+ behavior ...

This XPath,

//*[contains(text(),'ABC')]

behaves differently with XPath 1.0 and later versions of XPath (2.0+).

Common behavior

  • //* selects all elements within a document.
  • [] filters those elements according to the predicate expressed therein.
  • contains(string, substring) within the predicate will filter those elements for which substring is a substring in string.

XPath 1.0 behavior

  • contains(arg1, substring): If the first argument evaluates to a node set, contains() will convert the node set to a string by taking the string value of the first node in the node set. (If arg1 is text(), only the first of all matching text nodes will be considered by contains().) If you find this odd, you are not alone.
  • For //*[contains(text(),'ABC')] that node set will be all child text nodes of each element in the document.
  • Since only the first text node child is used, the expectation that all child text nodes are tested for 'ABC' substring containment is violated.
  • This leads to counter-intuitive results to anyone unfamiliar with the above conversion rules.

XPath 1.0 online example shows that only one 'ABC' is selected.

XPath 2.0+ behavior

  • It is an error to call contains(arg1, substring) with a sequence of more than one item as the first argument.
  • This corrected the counter-intuitive behavior described above in XPath 1.0.

XPath 2.0 online example shows a typical error message due to the conversion error particular to XPath 2.0+.

Common solutions

  1. If you wish to include descendent elements (beyond children), test against the string value of an element as a single string, rather than the individual string values of the child text nodes, this XPath,

    //*[contains(.,'ABC')]
    

    selects your targeted Street and Comment elements and also their Addr and Home ancestor elements because those too have 'ABC' as substrings of their string values.

    Online example shows ancestors being selected too.

  2. If you wish to exclude descendent elements (beyond children), this XPath,

    //*[text()[contains(.,'ABC')]]
    

    selects only your targeted Street and Comment because only those elements have text node children whose string values contain the 'ABC' substring. This will be true for all versions of XPath

    Online example shows only Street and Comment being selected.

策马西风 2024-09-24 10:53:54

标记包含两个文本节点和两个
节点作为子节点。

你的 xpath 表达式是

//*[contains(text(),'ABC')]

为了打破这个问题,

  1. * 是一个匹配任何元素(即标签)的选择器——它返回一个节点集。
  2. [] 是在该节点集中的每个单独节点上运行的条件。如果它所操作的任何单个节点与括号内的条件匹配,则它匹配。
  3. text() 是一个选择器,它匹配作为上下文节点子节点的所有文本节点——它返回一个节点集。
  4. contains 是一个对字符串进行操作的函数。如果传递节点集,则节点集通过返回转换为字符串节点集中文档顺序第一个节点的字符串值。因此,它只能匹配 元素中的第一个文本节点 - 即 BLAH BLAH BLAH。由于不匹配,因此您在结果中不会看到

您需要将其更改为

//*[text()[contains(.,'ABC')]]
  1. * 是一个匹配任何元素(即标签)的选择器 - 它返回一个节点集。
  2. 外部的 [] 是对节点集中的每个单独节点进行操作的条件 - 这里它对文档中的每个元素进行操作。
  3. text() 是一个选择器,它匹配作为上下文节点子节点的所有文本节点——它返回一个节点集。
  4. 内部的 [] 是对该节点集中的每个节点(这里是每个单独的文本节点)进行操作的条件。每个单独的文本节点是括号中任何路径的起点,也可以在括号内显式引用为 .。如果它所操作的任何单个节点与括号内的条件匹配,则它匹配。
  5. contains 是一个对字符串进行操作的函数。这里传递的是一个单独的文本节点 (.)。由于它是单独传递 标记中的第二个文本节点,因此它将看到 'ABC' 字符串并能够匹配它。

The <Comment> tag contains two text nodes and two <br> nodes as children.

Your xpath expression was

//*[contains(text(),'ABC')]

To break this down,

  1. * is a selector that matches any element (i.e. tag) -- it returns a node-set.
  2. The [] are a conditional that operates on each individual node in that node set. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
  3. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
  4. contains is a function that operates on a string. If it is passed a node set, the node set is converted into a string by returning the string-value of the node in the node-set that is first in document order. Hence, it can match only the first text node in your <Comment> element -- namely BLAH BLAH BLAH. Since that doesn't match, you don't get a <Comment> in your results.

You need to change this to

//*[text()[contains(.,'ABC')]]
  1. * is a selector that matches any element (i.e. tag) -- it returns a node-set.
  2. The outer [] are a conditional that operates on each individual node in that node set -- here it operates on each element in the document.
  3. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
  4. The inner [] are a conditional that operates on each node in that node set -- here each individual text node. Each individual text node is the starting point for any path in the brackets, and can also be referred to explicitly as . within the brackets. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
  5. contains is a function that operates on a string. Here it is passed an individual text node (.). Since it is passed the second text node in the <Comment> tag individually, it will see the 'ABC' string and be able to match it.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文