pyparsing - 解析 xml 注释

发布于 2024-12-10 16:43:00 字数 943 浏览 1 评论 0原文

我需要解析一个包含 xml 注释的文件。具体来说,它是使用 MS /// 约定的 ac# 文件。

由此我需要取出 foobar,或者 /// foobar 也是可以接受的。 (注意 - 如果您将 xml 全部放在一行上,这仍然不起作用...)

testStr = """
    ///<summary>
    /// foobar
    ///</summary>
    """

这是我所拥有的:

import pyparsing as pp

_eol = pp.Literal("\n").suppress()
_cPoundOpenXmlComment = Suppress('///<summary>') + pp.SkipTo(_eol)
_cPoundCloseXmlComment = Suppress('///</summary>') + pp.SkipTo(_eol)
_xmlCommentTxt = ~_cPoundCloseXmlComment + pp.SkipTo(_eol)
xmlComment = _cPoundOpenXmlComment + pp.OneOrMore(_xmlCommentTxt) + _cPoundCloseXmlComment

match = xmlComment.scanString(testStr)

和输出:

for item,start,stop in match:
    for entry in item:
        print(entry)

但我在跨多行工作的语法方面还没有取得太大成功。

(注意 - 我在 python 3.2 中测试了上述示例;它可以工作,但(根据我的问题)不打印任何值)

谢谢!

I need to parse a file containing xml comments. Specifically it's a c# file using the MS /// convention.

From this I'd need to pull out foobar, or /// foobar would be acceptable, too. (Note - this still doesn't work if you make the xml all on one line...)

testStr = """
    ///<summary>
    /// foobar
    ///</summary>
    """

Here is what I have:

import pyparsing as pp

_eol = pp.Literal("\n").suppress()
_cPoundOpenXmlComment = Suppress('///<summary>') + pp.SkipTo(_eol)
_cPoundCloseXmlComment = Suppress('///</summary>') + pp.SkipTo(_eol)
_xmlCommentTxt = ~_cPoundCloseXmlComment + pp.SkipTo(_eol)
xmlComment = _cPoundOpenXmlComment + pp.OneOrMore(_xmlCommentTxt) + _cPoundCloseXmlComment

match = xmlComment.scanString(testStr)

and to output:

for item,start,stop in match:
    for entry in item:
        print(entry)

But I haven't had much success with the grammer working across multi-line.

(note - I tested the above sample in python 3.2; it works but (per my issue) does not print any values)

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

泛滥成性 2024-12-17 16:43:00

我认为 Literal('\n') 是你的问题。您不想构建带有空格字符的字面量(因为默认情况下,字面量在尝试匹配之前会跳过空格)。尝试使用 LineEnd() 代替。

编辑1:
仅仅因为 LineEnd 出现无限循环并不意味着 Literal('\n') 更好。尝试在 _eol 定义末尾添加 .setDebug(),您会发现它永远不会匹配任何内容。

与其尝试将评论正文定义为“不是结束行的一行或多行,而是将所有内容都放到行尾”,如果您这样做:(

xmlComment = _cPoundOpenXmlComment + pp.SkipTo(_cPoundCloseXmlComment) + _cPoundCloseXmlComment 

您获得无限的原因与 LineEnd() 循环的区别是,您本质上是在执行 OneOrMore(SkipTo(LineEnd())),但从不消耗 LineEnd(),因此 OneOrMore 只是不断匹配、匹配、匹配,解析并返回一个空字符串,因为解析位置曾是行尾。)

I think Literal('\n') is your problem. You don't want to build a Literal with whitespace characters (since Literals by default skip over whitespace before trying to match). Try using LineEnd() instead.

EDIT 1:
Just because you get an infinite loop with LineEnd doesn't mean that Literal('\n') is any better. Try adding .setDebug() on the end of your _eol definition, and you'll see that it never matches anything.

Instead of trying to define the body of your comment as "one or more lines that are not a closing line, but get everything up to the end-of-line", what if you just do:

xmlComment = _cPoundOpenXmlComment + pp.SkipTo(_cPoundCloseXmlComment) + _cPoundCloseXmlComment 

(The reason you were getting an infinite loop with LineEnd() was that you were essentially doing OneOrMore(SkipTo(LineEnd())), but never consuming the LineEnd(), so the OneOrMore just kept matching and matching and matching, parsing and returning an empty string since the parsing position was at the end of line.)

我为君王 2024-12-17 16:43:00

使用nestedExpr怎么样:

import pyparsing as pp

text = '''\
///<summary>
/// foobar
///</summary>
blah blah
///<summary> /// bar ///</summary>
///<summary>  ///<summary> /// baz  ///</summary> ///</summary>    
'''

comment=pp.nestedExpr("///<summary>","///</summary>")
for match in comment.searchString(text):
    print(match)
    # [['///', 'foobar']]
    # [['///', 'bar']]
    # [[['///', 'baz']]]

How about using nestedExpr:

import pyparsing as pp

text = '''\
///<summary>
/// foobar
///</summary>
blah blah
///<summary> /// bar ///</summary>
///<summary>  ///<summary> /// baz  ///</summary> ///</summary>    
'''

comment=pp.nestedExpr("///<summary>","///</summary>")
for match in comment.searchString(text):
    print(match)
    # [['///', 'foobar']]
    # [['///', 'bar']]
    # [[['///', 'baz']]]
梦巷 2024-12-17 16:43:00

您可以使用 xml 解析器来解析 xml。提取相关注释行应该很容易:

import re
from xml.etree import cElementTree as etree

# extract all /// lines
lines = re.findall(r'^\s*///(.*)', text, re.MULTILINE)

# parse xml
root = etree.fromstring('<root>%s</root>' % ''.join(lines))
print root.findtext('summary')
# -> foobar

You could use an xml parser to parse xml. It should be easy to extract relevant comment lines:

import re
from xml.etree import cElementTree as etree

# extract all /// lines
lines = re.findall(r'^\s*///(.*)', text, re.MULTILINE)

# parse xml
root = etree.fromstring('<root>%s</root>' % ''.join(lines))
print root.findtext('summary')
# -> foobar
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文