在此 Python 脚本中将 BeautifulSoup 替换为另一个(标准)HTML 解析模块
我用 BeautifulSoup 制作了一个脚本,它运行良好并且非常可读,但我想有一天重新分发它,而 BeautifulSoup 是我想避免的外部依赖项,特别是考虑到 Windows 使用。
这是代码,它从给定的谷歌地图用户获取每个用户地图链接。 ####### 标记的行是使用 BeautifulSoup 的行:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source) ####
maptables = soup.findAll(id=re.compile('^map[0-9]+$')) #################
for table in maptables:
for line in table.findAll('a', 'maptitle'): ################
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
print shown, mapid, '\t', mapname
shown += 1
urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
如您所见,只有三行代码使用 BSoup,但我不是程序员,尝试使用其他标准 HTML 时遇到了很多困难和XML解析工具,我想可能是因为我尝试了错误的方法。
编辑:这个问题更多的是关于替换该脚本的三行代码,而不是找到解决可能存在的通用 html 解析问题的方法。
任何帮助将不胜感激,感谢您的阅读!
I have made a script with BeautifulSoup which works fine and is very readable, but I want to redistribute it some day, and BeautifulSoup is an external dependency I would like to avoid, specially considering Windows use.
Here is the code, it gets every usermap link from a given google maps user. The ####### marked lines are the ones using BeautifulSoup:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source) ####
maptables = soup.findAll(id=re.compile('^map[0-9]+
As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.
EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.
Any help will be much appreciated, thanks for reading!
)) #################
for table in maptables:
for line in table.findAll('a', 'maptitle'): ################
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
print shown, mapid, '\t', mapname
shown += 1
urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.
EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.
Any help will be much appreciated, thanks for reading!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不幸的是,Python 在标准库中没有有用的 HTML 解析,因此解析 HTML 的唯一合理方法是使用第三方模块,例如
lxml.html
或BeautifulSoup
。这并不意味着您必须有一个单独的依赖项——这些模块是免费软件,如果您不需要外部依赖项,欢迎您将它们与您的代码捆绑在一起,这样它们就不再是一个依赖项。依赖程度高于您自己编写的代码。Unfortunately, Python does not have useful HTML parsing in the standard library, so the only reasonable way to parse HTML is by using a third party module like
lxml.html
orBeautifulSoup
. This does not mean that you have to have a separate dependency--these modules are free software and if you do not want an external dependency, you're welcome to bundle them with your code, which then won't make them any more a dependency than the code you write yourself.要解析 HTML 代码,我看到有三种解决方案:
to parse HTML code I see have three solutions :
我已经尝试过这段代码(见下文),它显示了一个链接列表。由于我没有安装漂亮的汤并且不想安装,所以我很难根据代码给出的结果检查结果。
没有任何“汤”的“纯”Python代码更短,更具可读性。
无论如何,就在这里。告诉我你的想法!友善,路易斯。
I have tried this code (see below) and it shows up a list of links. As I have no beautiful soup installed and don't want to, it is very difficult to me to check the results against what your code gives.
The "pure" python code without any "soup" is even shorter and more readable.
Anyway, here it is. Tell me what you think ! Friendly, Louis.