将 pyparsing.ParseResults 转换回 html 字符串
我是 pyparsing 的新手。
如何将 pyparsing.ParseResults 类的实例转换回 html 字符串。
前任。
>>> type(gcdata)
<type 'unicode'>
>>> pat
{<"div"> SkipTo:(</"div">) </"div">}
>>> type(pat)
<class 'pyparsing.And'>
>>>
>>> l = pat.searchString( gcdata )
>>> l[0]
(['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False, u'<div class="shoveler-heading">\n <p>Customers Who Bought This Item Also Bought</p>\n \n', '</div>'], {'startDiv': [((['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False], {u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]}), 0)], 'endDiv': [('</div>', 5)], u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]})
>>>
>>> type(l[0])
<class 'pyparsing.ParseResults'>
>>>
>>> divhtml = foo (l[0])
所以,我需要这个函数foo。
有什么建议吗?
I'm brand new to pyparsing.
How can I convert instance of class pyparsing.ParseResults back to a html string.
ex.
>>> type(gcdata)
<type 'unicode'>
>>> pat
{<"div"> SkipTo:(</"div">) </"div">}
>>> type(pat)
<class 'pyparsing.And'>
>>>
>>> l = pat.searchString( gcdata )
>>> l[0]
(['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False, u'<div class="shoveler-heading">\n <p>Customers Who Bought This Item Also Bought</p>\n \n', '</div>'], {'startDiv': [((['div', ([u'class', u'shoveler'], {}), ([u'id', u'purchaseShvl'], {}), False], {u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]}), 0)], 'endDiv': [('</div>', 5)], u'class': [(u'shoveler', 1)], 'empty': [(False, 3)], u'id': [(u'purchaseShvl', 2)]})
>>>
>>> type(l[0])
<class 'pyparsing.ParseResults'>
>>>
>>> divhtml = foo (l[0])
So, I need this function foo.
Any suggestions ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是
makeHTMLTags
返回的表达式的问题,需要进行大量额外的分组和命名,如果您只需要标签文本,这会妨碍您。Pyparsing 包含方法
originalTextFor
来帮助解决这个问题。以 @samplebias 的示例代码为基础:通过将表达式包装在
originalTextFor
中,将标签分解为其组成部分的所有操作都将被撤消,您只需从原始字符串中取回文本(也可以包括任何中间的空格)。默认行为是只返回该字符串,这会带来不幸的副作用,即丢失所有结果名称,因此返回已解析的属性值可能会很麻烦。当我编写originalTextFor
时,我假设需要一个字符串,并且我无法将结果名称附加到字符串。因此,我向originalTextFor
添加了一个可选参数asString
,该参数默认为 True,但如果作为 False 传递,将返回一个 ParseResults,其中仅包含整个匹配字符串的单个标记,< em>加上所有匹配的结果名称。因此,您仍然可以从结果中提取res.id
,而res[0]
将返回整个匹配的 HTML。其他一些注释:
是一种非常常见的标记,仅使用
makeHTMLTags
返回的标记很容易出现错误匹配。它将匹配任何 div,并且可能匹配许多您并不真正感兴趣的内容。如果您可以使用withAttribute
withAttribute 指定一些也应该匹配的属性,则可以减少不匹配的数量代码>.您可以使用以下方法执行此操作:或
(使用“class”作为过滤属性可能是您想要做的最常见的事情,但由于“class”也是Python关键字,因此您可以使用命名参数形式,就像我一样与 id 一起做,太糟糕了。)
最后,除了
的共性之外,还有嵌套的可能性。 div经常嵌套在 div 中,而单纯的 SkipTo 不够智能,无法考虑到这一点。我们在重建您发布的结果时看到了这一点:
第一个终止
结束了表达式的匹配。我怀疑您可能需要扩展匹配表达式以考虑这些额外的 div,而不仅仅是简单的 SkipTo(end)。
This is an issue with the expressions returned by
makeHTMLTags
, that a lot of extra grouping and naming goes on, which gets in your way if you just want the tag text.Pyparsing includes the method
originalTextFor
to help address this. Building on the sample code from @samplebias:By wrapping the expression in
originalTextFor
, all of the breakup of the tag into its component parts gets undone, and you just get back the text from the original string (also including any intervening whitespace). The default behavior is to just give you back this string, which has the unfortunate side effect of losing all of the results names, so getting back the parsed attribute values can be a hassle. When I wroteoriginalTextFor
, I assumed that a string was what was wanted, and I could not attach results names to a string. So I added an optional parameterasString
tooriginalTextFor
which defaults to True, but if passed as False, will return a ParseResults containing just a single token of the entire matched string, plus all matched results names. So you could still extractres.id
from the results, whileres[0]
would return you the entire matched HTML.Some other comments:
<div>
is a very common tag, and one easily matched in error using just the tag returned bymakeHTMLTags
. It will match any div, and probably many you aren't really interested in. You can cut down the number of mismatches if you can specify some attribute that should also match, usingwithAttribute
. You could do this with:or
(Using 'class' as a filtering attribute is probably the most common thing you'll want to do, but since 'class' is also a Python keyword, you can just use the named arguments form as I did with id, too bad.)
Lastly, along with the commonness of
<div>
is the likelihood of nesting. divs are frequently nested within divs, and just plain SkipTo is not smart enough to take this into account. We see this when reconstructing your posted results:The first terminating
</div>
ends the match for your expression. I suspect that you may need to expand your matching expression to take into account these additional div's, instead of just plain SkipTo(end).你最好使用返回 DOM 的 HTML 解析器,例如 lxml.html 但我怀疑你正在做这更多是为了学习Pyparsing。由于您没有发布源代码片段,我进行了一些猜测,并使用下面列出的 pyparsing.makeHTMLTags 制作了一个示例。
You would be much better off using an HTML parser which returns a DOM, like lxml.html but I suspect you're doing this more to learn Pyparsing. Since you didn't post a snippet of source code I've taken a few guesses and made an example using
pyparsing.makeHTMLTags
, listed below.