在此 Python 脚本中将 BeautifulSoup 替换为另一个(标准)HTML 解析模块

发布于 2024-12-02 23:54:32 字数 1369 浏览 1 评论 0原文

我用 BeautifulSoup 制作了一个脚本,它运行良好并且非常可读,但我想有一天重新分发它,而 BeautifulSoup 是我想避免的外部依赖项,特别是考虑到 Windows 使用。

这是代码,它从给定的谷歌地图用户获取每个用户地图链接。 ####### 标记的行是使用 BeautifulSoup 的行:

# coding: utf-8

import urllib, re
from BeautifulSoup import BeautifulSoup as bs

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    soup = bs(source)  ####
    maptables = soup.findAll(id=re.compile('^map[0-9]+$'))  #################
    for table in maptables:
        for line in table.findAll('a', 'maptitle'):  ################
            mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
            mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
            print shown, mapid, '\t', mapname
            shown += 1

            urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
                               '&msa=0&output=kml', mapname + '.kml')


    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

如您所见,只有三行代码使用 BSoup,但我不是程序员,尝试使用其他标准 HTML 时遇到了很多困难和XML解析工具,我想可能是因为我尝试了错误的方法。

编辑:这个问题更多的是关于替换该脚本的三行代码,而不是找到解决可能存在的通用 html 解析问题的方法。

任何帮助将不胜感激,感谢您的阅读!

I have made a script with BeautifulSoup which works fine and is very readable, but I want to redistribute it some day, and BeautifulSoup is an external dependency I would like to avoid, specially considering Windows use.

Here is the code, it gets every usermap link from a given google maps user. The ####### marked lines are the ones using BeautifulSoup:

# coding: utf-8

import urllib, re
from BeautifulSoup import BeautifulSoup as bs

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    soup = bs(source)  ####
    maptables = soup.findAll(id=re.compile('^map[0-9]+

As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.

EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.

Any help will be much appreciated, thanks for reading!

)) ################# for table in maptables: for line in table.findAll('a', 'maptitle'): ################ mapid = re.search(uid+'\.([^"]*)', str(line)).group(1) mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3] print shown, mapid, '\t', mapname shown += 1 urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml') if '<span>Next</span>' in str(source): start += 5 else: break

As you can see, there are just three lines of code using BSoup, but I am not a programmer and I had a lot of difficulty trying to use other standard HTML and XML parsing tools, probably because I tried the wrong way, I guess.

EDIT: This question is more about replacing the three lines of code of this script than to find a way to solve generic html parsing problems there might be.

Any help will be much appreciated, thanks for reading!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

尝蛊 2024-12-09 23:54:32

不幸的是,Python 在标准库中没有有用的 HTML 解析,因此解析 HTML 的唯一合理方法是使用第三方模块,例如 lxml.htmlBeautifulSoup。这并不意味着您必须有一个单独的依赖项——这些模块是免费软件,如果您不需要外部依赖项,欢迎您将它们与您的代码捆绑在一起,这样它们就不再是一个依赖项。依赖程度高于您自己编写的代码。

Unfortunately, Python does not have useful HTML parsing in the standard library, so the only reasonable way to parse HTML is by using a third party module like lxml.html or BeautifulSoup. This does not mean that you have to have a separate dependency--these modules are free software and if you do not want an external dependency, you're welcome to bundle them with your code, which then won't make them any more a dependency than the code you write yourself.

迷离° 2024-12-09 23:54:32

要解析 HTML 代码,我看到有三种解决方案:

  • 使用简单的字符串搜索(.find(),...)快速!
  • 使用正则表达式(又名 regex)
  • 使用 HTMLParser

to parse HTML code I see have three solutions :

  • use simple string search (.find(),...) Fast !
  • use regular expressions (aka regex)
  • use HTMLParser
澜川若宁 2024-12-09 23:54:32

我已经尝试过这段代码(见下文),它显示了一个链接列表。由于我没有安装漂亮的汤并且不想安装,所以我很难根据代码给出的结果检查结果。
没有任何“汤”的“纯”Python代码更短,更具可读性。
无论如何,就在这里。告诉我你的想法!友善,路易斯。

#coding: utf-8

import urllib, re

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    while True:
        endit = source.find('maptitle')
        mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
        mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
        print shown, mapid, '\t', mapname
        shown += 1
        urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')

    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

I have tried this code (see below) and it shows up a list of links. As I have no beautiful soup installed and don't want to, it is very difficult to me to check the results against what your code gives.
The "pure" python code without any "soup" is even shorter and more readable.
Anyway, here it is. Tell me what you think ! Friendly, Louis.

#coding: utf-8

import urllib, re

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    while True:
        endit = source.find('maptitle')
        mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
        mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
        print shown, mapid, '\t', mapname
        shown += 1
        urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')

    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文