将 pyparsing.ParseResults 转换回 html 字符串

发布于 2024-10-21 00:36:35 字数 1056 浏览 7 评论 0原文

我是 pyparsing 的新手。
如何将 pyparsing.ParseResults 类的实例转换回 html 字符串。

前任。

>>> type(gcdata)
<type 'unicode'>
>>> pat
{<"div"> SkipTo:(</"div">) </"div">}
>>> type(pat)
<class 'pyparsing.And'>
>>> 
>>> l = pat.searchString( gcdata  )
>>> l[0]
(['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False, u'<div class="shoveler-heading">\n    <p>Customers Who Bought This Item Also Bought</p>\n    \n', '</div>'], {'startDiv': [((['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False], {u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]}), 0)], 'endDiv': [('</div>', 5)], u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]})
>>> 
>>> type(l[0])
<class 'pyparsing.ParseResults'>
>>> 
>>> divhtml = foo (l[0])

所以,我需要这个函数foo
有什么建议吗?

I'm brand new to pyparsing.
How can I convert instance of class pyparsing.ParseResults back to a html string.

ex.

>>> type(gcdata)
<type 'unicode'>
>>> pat
{<"div"> SkipTo:(</"div">) </"div">}
>>> type(pat)
<class 'pyparsing.And'>
>>> 
>>> l = pat.searchString( gcdata  )
>>> l[0]
(['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False, u'<div class="shoveler-heading">\n    <p>Customers Who Bought This Item Also Bought</p>\n    \n', '</div>'], {'startDiv': [((['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False], {u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]}), 0)], 'endDiv': [('</div>', 5)], u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]})
>>> 
>>> type(l[0])
<class 'pyparsing.ParseResults'>
>>> 
>>> divhtml = foo (l[0])

So, I need this function foo.
Any suggestions ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

甜妞爱困 2024-10-28 00:36:35

这是 makeHTMLTags 返回的表达式的问题,需要进行大量额外的分组和命名,如果您只需要标签文本,这会妨碍您。

Pyparsing 包含方法 originalTextFor 来帮助解决这个问题。以 @samplebias 的示例代码为基础:

start, end = makeHTMLTags('div')
#anchor = start + SkipTo(end).setResultsName('body') + end 
anchor = originalTextFor(start + SkipTo(end).setResultsName('body') + end)

通过将表达式包装在 originalTextFor 中,将标签分解为其组成部分的所有操作都将被撤消,您只需从原始字符串中取回文本(也可以包括任何中间的空格)。默认行为是只返回该字符串,这会带来不幸的副作用,即丢失所有结果名称,因此返回已解析的属性值可能会很麻烦。当我编写 originalTextFor 时,我假设需要一个字符串,并且我无法将结果名称附加到字符串。因此,我向 originalTextFor 添加了一个可选参数 asString ,该参数默认为 True,但如果作为 False 传递,将返回一个 ParseResults,其中仅包含整个匹配字符串的单个标记,< em>加上所有匹配的结果名称。因此,您仍然可以从结果中提取 res.id,而 res[0] 将返回整个匹配的 HTML。

其他一些注释:

是一种非常常见的标记,仅使用 makeHTMLTags 返回的标记很容易出现错误匹配。它将匹配任何 div,并且可能匹配许多您并不真正感兴趣的内容。如果您可以使用 withAttributewithAttribute 指定一些也应该匹配的属性,则可以减少不匹配的数量代码>.您可以使用以下方法执行此操作:

start.setParseAction(withAttribute(id="purchaseShvl"))

start.setParseAction(withAttribute(**{"class":"shovelr"}))

(使用“class”作为过滤属性可能是您想要做的最常见的事情,但由于“class”也是Python关键字,因此您可以使用命名参数形式,就像我一样与 id 一起做,太糟糕了。)

最后,除了

的共性之外,还有嵌套的可能性。 div经常嵌套在 div 中,而单纯的 SkipTo 不够智能,无法考虑到这一点。我们在重建您发布的结果时看到了这一点:

<div class='shovelr' id='purchaseShvl>
<div class='shovelr-heading'>
<p>Customers WhoBought This Item Also Bought</p>
</div>

第一个终止 结束了表达式的匹配。我怀疑您可能需要扩展匹配表达式以考虑这些额外的 div,而不仅仅是简单的 SkipTo(end)。

This is an issue with the expressions returned by makeHTMLTags, that a lot of extra grouping and naming goes on, which gets in your way if you just want the tag text.

Pyparsing includes the method originalTextFor to help address this. Building on the sample code from @samplebias:

start, end = makeHTMLTags('div')
#anchor = start + SkipTo(end).setResultsName('body') + end 
anchor = originalTextFor(start + SkipTo(end).setResultsName('body') + end)

By wrapping the expression in originalTextFor, all of the breakup of the tag into its component parts gets undone, and you just get back the text from the original string (also including any intervening whitespace). The default behavior is to just give you back this string, which has the unfortunate side effect of losing all of the results names, so getting back the parsed attribute values can be a hassle. When I wrote originalTextFor, I assumed that a string was what was wanted, and I could not attach results names to a string. So I added an optional parameter asString to originalTextFor which defaults to True, but if passed as False, will return a ParseResults containing just a single token of the entire matched string, plus all matched results names. So you could still extract res.id from the results, while res[0] would return you the entire matched HTML.

Some other comments:

<div> is a very common tag, and one easily matched in error using just the tag returned by makeHTMLTags. It will match any div, and probably many you aren't really interested in. You can cut down the number of mismatches if you can specify some attribute that should also match, using withAttribute. You could do this with:

start.setParseAction(withAttribute(id="purchaseShvl"))

or

start.setParseAction(withAttribute(**{"class":"shovelr"}))

(Using 'class' as a filtering attribute is probably the most common thing you'll want to do, but since 'class' is also a Python keyword, you can just use the named arguments form as I did with id, too bad.)

Lastly, along with the commonness of <div> is the likelihood of nesting. divs are frequently nested within divs, and just plain SkipTo is not smart enough to take this into account. We see this when reconstructing your posted results:

<div class='shovelr' id='purchaseShvl>
<div class='shovelr-heading'>
<p>Customers WhoBought This Item Also Bought</p>
</div>

The first terminating </div> ends the match for your expression. I suspect that you may need to expand your matching expression to take into account these additional div's, instead of just plain SkipTo(end).

帅哥哥的热头脑 2024-10-28 00:36:35

你最好使用返回 DOM 的 HTML 解析器,例如 lxml.html 但我怀疑你正在做这更多是为了学习Pyparsing。由于您没有发布源代码片段,我进行了一些猜测,并使用下面列出的 pyparsing.makeHTMLTags 制作了一个示例。

import cgi
from pyparsing import makeHTMLTags, SkipTo

raw = """<body><div class="shoveler" id="purchaseShvl">
<p>Customers who bought this item also bought</p>
<div class="foo">
    <span class="bar">Shovel cozy</span>
    <span class="bar">Shovel rack</span>
</div>
</div></body>"""

def foo(parseResult):
    parts = []
    for token in parseResult:
        st = '<div id="%s" class="%s">' % \
             (cgi.escape(getattr(token, 'id')),
             cgi.escape(getattr(token, 'class')))
        parts.append(st + token.body + token.endDiv)
    return '\n'.join(parts)

start, end = makeHTMLTags('div')
anchor = start + SkipTo(end).setResultsName('body') + end
res = anchor.searchString(raw)
print foo(res)

You would be much better off using an HTML parser which returns a DOM, like lxml.html but I suspect you're doing this more to learn Pyparsing. Since you didn't post a snippet of source code I've taken a few guesses and made an example using pyparsing.makeHTMLTags, listed below.

import cgi
from pyparsing import makeHTMLTags, SkipTo

raw = """<body><div class="shoveler" id="purchaseShvl">
<p>Customers who bought this item also bought</p>
<div class="foo">
    <span class="bar">Shovel cozy</span>
    <span class="bar">Shovel rack</span>
</div>
</div></body>"""

def foo(parseResult):
    parts = []
    for token in parseResult:
        st = '<div id="%s" class="%s">' % \
             (cgi.escape(getattr(token, 'id')),
             cgi.escape(getattr(token, 'class')))
        parts.append(st + token.body + token.endDiv)
    return '\n'.join(parts)

start, end = makeHTMLTags('div')
anchor = start + SkipTo(end).setResultsName('body') + end
res = anchor.searchString(raw)
print foo(res)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文