格式错误的开始标记错误 - Python、BeautifulSoup 和 Sipie - Ubuntu 10.04

发布于 2024-09-08 13:00:25 字数 1941 浏览 2 评论 0原文

我刚刚安装了 python、mplayer、beautifulsoup 和 sipie,以便在我的 Ubuntu 10.04 计算机上运行 Sirius。我遵循了一些看似简单的文档,但遇到了一些问题。我对 Python 不太熟悉,所以这可能超出了我的范围。

我能够安装所有内容,但运行 sipie 会给出以下信息:

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated;使用 hashlib 代替 import md5
回溯(最近一次调用最后一次): 文件“/usr/bin/Sipie/sipie.py”,第 22 行,在中 Sipie.cliPlayer()
文件“/usr/bin/Sipie/Sipie/cliPlayer.py”,第 74 行,在 cliPlayer 中 Completer = Completer(sipie.getStreams())
文件“/usr/bin/Sipie/Sipie/Factory.py”,第 374 行,在 getStreams 中 流 = self.tryGetStreams()
文件“/usr/bin/Sipie/Sipie/Factory.py”,第 298 行,位于 tryGetStreams 汤 = BeautifulSoup(数据)
文件“/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py”,第 1499 行,位于 __init__ 中 BeautifulStoneSoup.__init__(self, *args, **kwargs)
文件“/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py”,第 1230 行,位于 __init__ 中 self._feed(isHTML=isHTML)
文件“/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py”,第 1263 行,位于 _feed self.builder.feed(标记)
文件“/usr/lib/python2.6/HTMLParser.py”,第 108 行,提要中 self.goahead(0)
文件“/usr/lib/python2.6/HTMLParser.py”,第 148 行,在 goahead 中 k = self.parse_starttag(i)
文件“/usr/lib/python2.6/HTMLParser.py”,第 226 行,在 parse_starttag 中 endpos = self.check_for_whole_start_tag(i)
文件“/usr/lib/python2.6/HTMLParser.py”,第 301 行,在 check_for_whole_start_tag 中 self.error("格式错误的开始标记")
文件“/usr/lib/python2.6/HTMLParser.py”,第115行,错误 引发 HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

我查看了这些文件和行号,但由于我不熟悉 Python,所以它没有多大意义。关于下一步该做什么有什么建议吗?

I just installed python, mplayer, beautifulsoup and sipie to run Sirius on my Ubuntu 10.04 machine. I followed some docs that seem straightforward, but am encountering some issues. I'm not that familiar with Python, so this may be out of my league.

I was able to get everything installed, but then running sipie gives this:

/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last):
File "/usr/bin/Sipie/sipie.py", line 22, in <module>
Sipie.cliPlayer()

File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer
completer = Completer(sipie.getStreams())

File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams
streams = self.tryGetStreams()

File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams
soup = BeautifulSoup(data)

File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)

File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)

File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)

File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)

File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)

File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)

File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")

File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())

HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3

I looked through these files and the line numbers, but since I am unfamiliar with Python, it doesn't make much sense. Any advice on what to do next?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

假设你使用的是BeautifulSoup4,我在官方文档中发现了一些关于此的内容: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

如果您使用的 Python 2 版本早于 2.7.3,或者版本
对于 3.2.2 之前的 Python 3,必须安装 lxml
或 html5lib——Python 的内置 HTML 解析器在以下方面不是很好
旧版本。

我尝试过这个,效果很好,就像@Joshua一样

soup = BeautifulSoup(r.text, 'html5lib')

Suppose you are using BeautifulSoup4, I found out something in the official document about this: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you’re using a version of Python 2 earlier than 2.7.3, or a version
of Python 3 earlier than 3.2.2, it’s essential that you install lxml
or html5lib–Python’s built-in HTML parser is just not very good in
older versions.

I tried this and it works well, just like what @Joshua

soup = BeautifulSoup(r.text, 'html5lib')
决绝 2024-09-15 13:00:25

您遇到的问题非常常见,它们专门处理格式错误的 HTML。就我而言,有一个 HTML 元素对属性值进行了双引号。我今天实际上遇到了这个问题,并在这样做时看到了您的帖子。在将其交给 BeautifulSoup 4 之前,我终于能够通过 html5lib 解析 HTML 来解决这个问题。

首先,您需要:

sudo easy_install bs4
sudo apt-get install python-html5lib

然后,运行此示例代码:

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

如果您对此代码有任何疑问或需要一点更具体的指导,请告诉我。 :)

The issues you are encountering are pretty common, and they deal specifically with mal-formed HTML. In my case, there was an HTML element which had double quoted an attribute's value. I ran into this issue today actually, and in so doing so came across your post. I was FINALLY able to resolve this issue through parsing the HTML through html5lib before handing it off the BeautifulSoup 4.

First off, you'll need to:

sudo easy_install bs4
sudo apt-get install python-html5lib

Then, run this example code:

from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib

url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)

# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)

# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)

for content in soup.findAll('div'):
    print content

If you have any questions about this code or need a little more specific guidance, just let me know. :)

公布 2024-09-15 13:00:25

较新版本的 BeautifulSoup 使用 HTMLParser 而不是 SGMLParser (由于 SGMLParser 是从 Python 3.0 标准库中删除)。因此,BeautifulSoup 无法再正确处理许多格式错误的 HTML 文档,我相信您在这里遇到了这种情况。

解决您的问题的方法可能是卸载BeautifulSoup,然后安装旧版本(在 Ubuntu 10.04LTS 上仍可与 Python 2.6 配合使用):

sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"

请注意,此临时解决方案将不再与 Python 3.0 配合使用(在未来版本的 Ubuntu 中可能会成为默认设置)。

Newer versions of BeautifulSoup uses HTMLParser rather than SGMLParser (due to SGMLParser being removed from the Python 3.0 standard library). As a result, BeautifulSoup can no longer process many malformed HTML documents correctly, which is what I believe you are encountering here.

A solution to your problem is likely to be to uninstall BeautifulSoup, and install an older version (which will still work with Python 2.6 on Ubuntu 10.04LTS):

sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"

Just be aware that this temporary solution will no longer work with Python 3.0 (which may become the default in future versions of Ubuntu).

静若繁花 2024-09-15 13:00:25

命令行:

$ pip install beautifulsoup4
$ pip install html5lib

Python 3:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')

for link in links:
    print(link.string, link['href'])

Command Line:

$ pip install beautifulsoup4
$ pip install html5lib

Python 3:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')

for link in links:
    print(link.string, link['href'])
如果没有 2024-09-15 13:00:25

查看文件“/usr/bin/Sipie/Sipie/Factory.py”第298行中提到的“数据”中第100行的第3列

Look at column 3 of line 100 in the "data" that is mentioned in File "/usr/bin/Sipie/Sipie/Factory.py", line 298

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文