使用 pyparsing 查找以下标签

发布于 2024-08-11 19:55:37 字数 1331 浏览 9 评论 0原文

我正在使用 pyparsing 来解析 HTML。我正在抓取所有 embed 标签，但在某些情况下，紧随其后的 a 标签我也想抓取（如果可用）。

示例：

import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)

result = target.searchString(""".....
   <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
   """)

我无法在结果对象中找到任何字符偏移量，否则我可以抓取原始输入字符串的一部分并从那里开始工作。

编辑：

有人问我为什么不使用 BeautifulSoup。这是一个很好的问题，让我向您展示为什么我选择不将它与代码示例一起使用：

import BeautifulSoup
import urllib
import re
import socket

socket.setdefaulttimeout(3)

# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()

success, failure = 0.0, 0.0

for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
    print url
    try:
        BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
    except IOError:
        pass
    except Exception, e:
        print e
        failure += 1
    else:
        success += 1


print failure / (failure + success)

当我尝试此操作时，BeautifulSoup 会失败并出现解析错误 20-30%。这些并不是罕见的边缘情况。 pyparsing 缓慢且麻烦，但无论我向它扔什么，它都没有崩溃。如果我能了解一种更好的使用 BeautifulSoup 的方法，那么我会非常有兴趣了解这一点。

原文

I'm using pyparsing to parse HTML. I'm grabbing all embed tags, but in some cases there's an a tag directly following that I also want to grab if it's available.

example:

import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)

result = target.searchString(""".....
   <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
   """)

I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.

EDIT:

Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:

import BeautifulSoup
import urllib
import re
import socket

socket.setdefaulttimeout(3)

# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()

success, failure = 0.0, 0.0

for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
    print url
    try:
        BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
    except IOError:
        pass
    except Exception, e:
        print e
        failure += 1
    else:
        success += 1


print failure / (failure + success)

When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

森林很绿却致人迷途 2024-08-18 19:55:37

如果有一个可选的标记，如果它位于标记后面，那么它会很有趣，然后将其添加到您的搜索模式中：

embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....   
    <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
    """)

print result.dump()

如果您想要捕获解析器中表达式的字符位置，插入其中之一，并带有结果名称：

loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + 
                                                 pyparsing.Optional(aTag)

If there is an optional <a> tag that would be interesting if it follows an <embed> tag, then add it to your search pattern:

embedTag = pyparsing.makeHTMLTags("embed")[0]
aTag = pyparsing.makeHTMLTags("a")[0]
target = embedTag + pyparsing.Optional(aTag)
result = target.searchString(""".....   
    <object....><embed>.....</embed></object><br /><a href="blah">blah</a>
    """)

print result.dump()

If you want to capture the character location of an expression within your parser, insert one of these, with a results name:

loc = pyparsing.Empty().setParseAction(lambda s,locn,toks: locn)
target = loc("beforeEmbed") + embedTag + loc("afterEmbed") + 
                                                 pyparsing.Optional(aTag)

回复收藏 0 原文

偏爱你一生 2024-08-18 19:55:37

为什么要编写自己的 HTML 解析器？标准库包括 HTMLParser，BeautifulSoup 可以处理 HTMLParser 无法处理的任何工作。

回复收藏 0 原文

李白 2024-08-18 19:55:37

您不喜欢使用普通的正则表达式？还是因为解析html的坏习惯？：D

re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)

you don't prefer using normal regex? or because its bad habit to parse html? :D

re.findall("<object.*?</object>(?:<br /><a.*?</a>)?",a)

回复收藏 0 原文

芯好空 2024-08-18 19:55:37

我能够运行您的 BeautifulSoup 代码并且没有收到任何错误。我正在运行 BeautifulSoup 3.0.7a

请使用 BeautifulSoup 3.0.7a； 3.1.0.1 存在一些错误，在某些情况下（例如您的情况）根本无法工作。

回复收藏 0 原文

~没有更多了~

关于作者

掐死时间

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

使用 pyparsing 查找以下标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

使用 pyparsing 查找以下标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

尘曦

在梵高的星空下

善良天后

韬韬不绝

qq_CgiN62

不美如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。