使用 pyparsing 查找以下标签
我正在使用 pyparsing 来解析 HTML。我正在抓取所有 embed
标签,但在某些情况下,紧随其后的 a
标签我也想抓取(如果可用)。
示例:
import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br /><a href="blah">blah</a>
""")
我无法在结果对象中找到任何字符偏移量,否则我可以抓取原始输入字符串的一部分并从那里开始工作。
编辑:
有人问我为什么不使用 BeautifulSoup。这是一个很好的问题,让我向您展示为什么我选择不将它与代码示例一起使用:
import BeautifulSoup
import urllib
import re
import socket
socket.setdefaulttimeout(3)
# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()
success, failure = 0.0, 0.0
for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
print url
try:
BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
except IOError:
pass
except Exception, e:
print e
failure += 1
else:
success += 1
print failure / (failure + success)
当我尝试此操作时,BeautifulSoup 会失败并出现解析错误 20-30%。这些并不是罕见的边缘情况。 pyparsing 缓慢且麻烦,但无论我向它扔什么,它都没有崩溃。如果我能了解一种更好的使用 BeautifulSoup 的方法,那么我会非常有兴趣了解这一点。
I'm using pyparsing to parse HTML. I'm grabbing all embed
tags, but in some cases there's an a
tag directly following that I also want to grab if it's available.
example:
import pyparsing
target = pyparsing.makeHTMLTags("embed")[0]
target.setParseAction(pyparsing.withAttribute(src=pyparsing.withAttribute.ANY_VALUE))
target.ignore(pyparsing.htmlComment)
result = target.searchString(""".....
<object....><embed>.....</embed></object><br /><a href="blah">blah</a>
""")
I haven't been able to find any character offset in the result objects, otherwise I could just grab a slice of the original input string and work from there.
EDIT:
Someone asked why I don't use BeautifulSoup. That's a good question, let me show you why I chose not to use it with a code sample:
import BeautifulSoup
import urllib
import re
import socket
socket.setdefaulttimeout(3)
# get some random blogs
xml = urllib.urlopen('http://rpc.weblogs.com/shortChanges.xml').read()
success, failure = 0.0, 0.0
for url in re.compile(r'\burl="([^"]+)"').findall(xml)[:30]:
print url
try:
BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
except IOError:
pass
except Exception, e:
print e
failure += 1
else:
success += 1
print failure / (failure + success)
When I try this, BeautifulSoup fails with parse errors 20-30% of the time. These aren't rare edge cases. pyparsing is slow and cumbersome but it hasn't blown up no matter what I throw at it. If I can be enlightened as to a better way to use BeautifulSoup then I would be really interested in knowing that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果有一个可选的
标记,如果它位于
标记后面,那么它会很有趣,然后将其添加到您的搜索模式中:
如果您想要捕获解析器中表达式的字符位置,插入其中之一,并带有结果名称:
If there is an optional
<a>
tag that would be interesting if it follows an<embed>
tag, then add it to your search pattern:If you want to capture the character location of an expression within your parser, insert one of these, with a results name:
为什么要编写自己的 HTML 解析器?标准库包括 HTMLParser,BeautifulSoup 可以处理 HTMLParser 无法处理的任何工作。
Why would you write your own HTML parser? The standard library includes HTMLParser, and BeautifulSoup can handle any job HTMLParser can't.
您不喜欢使用普通的正则表达式?还是因为解析html的坏习惯? :D
you don't prefer using normal regex? or because its bad habit to parse html? :D
我能够运行您的 BeautifulSoup 代码并且没有收到任何错误。我正在运行 BeautifulSoup 3.0.7a
请使用 BeautifulSoup 3.0.7a; 3.1.0.1 存在一些错误,在某些情况下(例如您的情况)根本无法工作。
I was able to run your BeautifulSoup code and received no errors. I'm running BeautifulSoup 3.0.7a
Please use BeautifulSoup 3.0.7a; 3.1.0.1 has bugs that prevent it from working at all in some cases (such as yours).