格式错误的开始标记错误 - Python、BeautifulSoup 和 Sipie - Ubuntu 10.04

发布于 2024-09-08 13:00:25 字数 1941 浏览 2 评论 0原文

我刚刚安装了 python、mplayer、beautifulsoup 和 sipie，以便在我的 Ubuntu 10.04 计算机上运行 Sirius。我遵循了一些看似简单的文档，但遇到了一些问题。我对 Python 不太熟悉，所以这可能超出了我的范围。

我能够安装所有内容，但运行 sipie 会给出以下信息：

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated;使用 hashlib 代替 import md5
回溯（最近一次调用最后一次）：文件“/usr/bin/Sipie/sipie.py”，第 22 行，在中 Sipie.cliPlayer()
文件“/usr/bin/Sipie/Sipie/cliPlayer.py”，第 74 行，在 cliPlayer 中 Completer = Completer(sipie.getStreams())
文件“/usr/bin/Sipie/Sipie/Factory.py”，第 374 行，在 getStreams 中流 = self.tryGetStreams()
文件“/usr/bin/Sipie/Sipie/Factory.py”，第 298 行，位于 tryGetStreams 汤 = BeautifulSoup(数据)
文件“/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py”，第 1499 行，位于 __init__ 中 BeautifulStoneSoup.__init__(self, *args, **kwargs)
文件“/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py”，第 1230 行，位于 __init__ 中 self._feed(isHTML=isHTML)
文件“/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py”，第 1263 行，位于 _feed self.builder.feed（标记）
文件“/usr/lib/python2.6/HTMLParser.py”，第 108 行，提要中 self.goahead(0)
文件“/usr/lib/python2.6/HTMLParser.py”，第 148 行，在 goahead 中 k = self.parse_starttag(i)
文件“/usr/lib/python2.6/HTMLParser.py”，第 226 行，在 parse_starttag 中 endpos = self.check_for_whole_start_tag(i)
文件“/usr/lib/python2.6/HTMLParser.py”，第 301 行，在 check_for_whole_start_tag 中 self.error("格式错误的开始标记")
文件“/usr/lib/python2.6/HTMLParser.py”，第115行，错误引发 HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我查看了这些文件和行号，但由于我不熟悉 Python，所以它没有多大意义。关于下一步该做什么有什么建议吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绿阴红影里的.如风往事 2024-09-15 13:00:25

假设你使用的是BeautifulSoup4，我在官方文档中发现了一些关于此的内容： http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

如果您使用的 Python 2 版本早于 2.7.3，或者版本
对于 3.2.2 之前的 Python 3，必须安装 lxml
或 html5lib——Python 的内置 HTML 解析器在以下方面不是很好
旧版本。

我尝试过这个，效果很好，就像@Joshua一样

soup = BeautifulSoup(r.text, 'html5lib')

Suppose you are using BeautifulSoup4, I found out something in the official document about this: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you’re using a version of Python 2 earlier than 2.7.3, or a version
of Python 3 earlier than 3.2.2, it’s essential that you install lxml
or html5lib–Python’s built-in HTML parser is just not very good in
older versions.

I tried this and it works well, just like what @Joshua

soup = BeautifulSoup(r.text, 'html5lib')

回复收藏 0 原文

决绝 2024-09-15 13:00:25

您遇到的问题非常常见，它们专门处理格式错误的 HTML。就我而言，有一个 HTML 元素对属性值进行了双引号。我今天实际上遇到了这个问题，并在这样做时看到了您的帖子。在将其交给 BeautifulSoup 4 之前，我终于能够通过 html5lib 解析 HTML 来解决这个问题。

首先，您需要：

sudo easy_install bs4
sudo apt-get install python-html5lib

然后，运行此示例代码：

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

如果您对此代码有任何疑问或需要一点更具体的指导，请告诉我。 :)

The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.

First off, you'll need to:

sudo easy_install bs4
sudo apt-get install python-html5lib

Then, run this example code:

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

If you have any questions about this code or need a little more specific guidance, just let me know. :)

回复收藏 0 原文

公布 2024-09-15 13:00:25

较新版本的 BeautifulSoup 使用 HTMLParser 而不是 SGMLParser （由于 SGMLParser 是从 Python 3.0 标准库中删除）。因此，BeautifulSoup 无法再正确处理许多格式错误的 HTML 文档，我相信您在这里遇到了这种情况。

解决您的问题的方法可能是卸载BeautifulSoup，然后安装旧版本（在 Ubuntu 10.04LTS 上仍可与 Python 2.6 配合使用）：

sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"

请注意，此临时解决方案将不再与 Python 3.0 配合使用（在未来版本的 Ubuntu 中可能会成为默认设置）。

Newer versions of BeautifulSoup uses HTMLParser rather than SGMLParser (due to SGMLParser being removed from the Python 3.0 standard library). As a result, BeautifulSoup can no longer process many malformed HTML documents correctly, which is what I believe you are encountering here.

A solution to your problem is likely to be to uninstall BeautifulSoup, and install an older version (which will still work with Python 2.6 on Ubuntu 10.04LTS):

sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"

Just be aware that this temporary solution will no longer work with Python 3.0 (which may become the default in future versions of Ubuntu).

回复收藏 0 原文

静若繁花 2024-09-15 13:00:25

命令行：

$ pip install beautifulsoup4
$ pip install html5lib

Python 3：

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')

for link in links:
    print(link.string, link['href'])

Command Line:

$ pip install beautifulsoup4
$ pip install html5lib

Python 3:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')

for link in links:
    print(link.string, link['href'])

回复收藏 0 原文