过滤 xml 文件以删除其中包含某些文本的行？

发布于 2024-11-18 10:28:36 字数 544 浏览 3 评论 0原文

例如，假设我有：

<div class="info"><p><b>Orange</b>, <b>One</b>, ...
<div class="info"><p><b>Blue</b>, <b>Two</b>, ...
<div class="info"><p><b>Red</b>, <b>Three</b>, ...
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...

并且我想从列表中删除所有包含单词的行，因此我只会在符合我的条件的行上使用 xpath。例如，我可以使用列表作为 ['Orange', 'Red'] 来标记不需要的行，因此在上面的示例中我只想使用第 2 行和第 4 行进行进一步处理。

我该怎么做？

原文

For example, suppose I have:

<div class="info"><p><b>Orange</b>, <b>One</b>, ...
<div class="info"><p><b>Blue</b>, <b>Two</b>, ...
<div class="info"><p><b>Red</b>, <b>Three</b>, ...
<div class="info"><p><b>Yellow</b>, <b>Four</b>, ...

And I'd like to remove all lines that have words from a list so I'll only use xpath on the lines that fit my criteria. For example, I could use the list as ['Orange', 'Red'] to mark the unwanted lines, so in the above example I'd only want to use lines 2 and 4 for further processing.

How can I do this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温柔一刀 2024-11-25 10:28:36

使用：

//div
  [not(p/b[contains('|Orange|Red|', 
                    concat('|', ., '|')
                   )
          ]
       )
  ]

选择 XML 文档中的任何 div 元素，这样它就没有 p 子元素，其 b 子元素的string value 是管道分隔的字符串列表中用作过滤器的字符串之一。

这种方法只需将新的过滤器值添加到管道分隔列表中即可实现扩展，而无需更改 XPath 表达式中的任何其他内容。

注意：当 XML 文档的结构静态已知时，请始终避免使用 // XPath 伪运算符，因为它会导致效率显着降低（速度变慢）。

Use:

//div
  [not(p/b[contains('|Orange|Red|', 
                    concat('|', ., '|')
                   )
          ]
       )
  ]

This selects any div elements in the XML document, such that it has no p child whose b child's string valu is one of the strings in the pipe-separated list of strings to use as filters.

This approach allows extensibility by just adding new filter values to the pipe-separated list, without changing anything else in the XPath expression.

Note: When the structure of the XML document is statically known, always avoid using the // XPath pseudo-operator, because it leads to significant inefficiency (slowdown).

回复收藏 0 原文

沫雨熙 2024-11-25 10:28:36

import lxml.html as lh

# http://lxml.de/xpathxslt.html
# http://exslt.org/regexp/functions/match/index.html
content='''\
<table>
<div class="info"><p><b>Orange</b>, <b>One</b></p></div>
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
<div class="info"><p><b>Red</b>, <b>Three</b></p></div>
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
</table>
'''
NS = 'http://exslt.org/regular-expressions'
tree = lh.fromstring(content)
exclude=['Orange','Red']
for elt in tree.xpath(
    "//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)),
    namespaces={'re': NS}):
    print(lh.tostring(elt))
    print('-'*80)

产量

<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>

--------------------------------------------------------------------------------
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>

--------------------------------------------------------------------------------

import lxml.html as lh

# http://lxml.de/xpathxslt.html
# http://exslt.org/regexp/functions/match/index.html
content='''\
<table>
<div class="info"><p><b>Orange</b>, <b>One</b></p></div>
<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>
<div class="info"><p><b>Red</b>, <b>Three</b></p></div>
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>
</table>
'''
NS = 'http://exslt.org/regular-expressions'
tree = lh.fromstring(content)
exclude=['Orange','Red']
for elt in tree.xpath(
    "//div[not(re:test(p/b[1]/text(), '{0}'))]".format('|'.join(exclude)),
    namespaces={'re': NS}):
    print(lh.tostring(elt))
    print('-'*80)

yields

<div class="info"><p><b>Blue</b>, <b>Two</b></p></div>

--------------------------------------------------------------------------------
<div class="info"><p><b>Yellow</b>, <b>Four</b></p></div>

--------------------------------------------------------------------------------

回复收藏 0 原文

~没有更多了~