BeautifulSoup 解析器将分号附加到裸露的 & 符号上,从而破坏 URL?

发布于 2024-12-01 11:30:21 字数 867 浏览 1 评论 0原文

我正在尝试解析 python 中的某些网站,其中包含指向其他网站的链接,但以纯文本形式,而不是“a”标记中。使用 BeautifulSoup 我得到了错误的答案。考虑这段代码:

import BeautifulSoup

html = """<html>
            <head>
              <title>Test html</title>
            </head>
            <body>
              <div>
                example.com/a.php?b=2&c=15
              </div>
            </body>
          </html>"""

parsed = BeautifulSoup.BeautifulSoup(html)
print parsed

当我运行上面的代码时,我得到以下输出:

<html>
  <head>
    <title>Test html</title>
  </head>
  <body>
    <div>
      example.com/a.php?b=2&c;=15
    </div>
  </body>
</html>

注意“div”中的链接和部分 b=2&c;=15。它与原始 HTML 不同。为什么 BeautifulSoup 会以这种方式弄乱链接。它是否试图自动创建 HTML 实体?如何防止这种情况发生?

I am trying to parse some site in python that has links in it to other sites, but in plain text, not in "a" tag. Using BeautifulSoup I get the wrong answer. Consider this code:

import BeautifulSoup

html = """<html>
            <head>
              <title>Test html</title>
            </head>
            <body>
              <div>
                example.com/a.php?b=2&c=15
              </div>
            </body>
          </html>"""

parsed = BeautifulSoup.BeautifulSoup(html)
print parsed

when I run the above code I get the following output:

<html>
  <head>
    <title>Test html</title>
  </head>
  <body>
    <div>
      example.com/a.php?b=2&c;=15
    </div>
  </body>
</html>

Notice the link in the "div" and the part b=2&c;=15. It's different from the original HTML. Why is BeautifulSoup messing with the links in such a way. Is it trying to automagically create HTML entites? How to prevent this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

星星的轨迹 2024-12-08 11:30:21

显然 BS 有一个记录不足的 解析&符号的问题在 URL 内,我刚刚搜索了他们的论坛“分号”。根据 2009 年的讨论,裸露的 & 严格来说是无效的,必须用 & 替换,尽管浏览器接受这一点,所以它看起来很迂腐。

我同意这种解析行为是伪造的,您应该联系他们的列表,要求他们至少更好地记录这一已知问题,并在将来修复它。

解决方法:无论如何,您的解决方法很可能是 re.sub(...) 来捕获并扩展 & -> & 仅在 URL 内。可能您需要一个反向函数来在输出中压缩它们。您需要一个更高级的正则表达式来仅捕获 URL 中的 & 符号,但无论如何:

# Minimal string to tickle this
#html = "<html>example.com/a.php?b=2&c=15&d=42</html>"
html = "<html>example.com/a.php?b=2&c=15&d=29&e=42</html>"

html = re.sub(r'&(?!amp;)', r'&', html)

parsed = BeautifulSoup.BeautifulSoup(html)
>>> print parsed.text.encode('utf-8')
'example.com/a.php?b=2&c=15'

>>> re.sub(r'&', r'&', parsed.text.encode('utf-8'))
'example.com/a.php?b=2&c=15'

可能还有其他更废话的方法。
您可能想帮助测试 4.0 beta。

Apparently BS has an underdocumented issue parsing ampersands inside URL, I just searched their discussion forum for 'semicolon'. According to that discussion from 2009, naked & is strictly not valid and must be replaced by & although browsers accept this so it seems waay pedantic.

I agree this parsing behavior is bogus, and you should contact their list to ask them to at least document this better as a known issue, and fix it in future.

Workaround: Anyway, your workaround will most likely be re.sub(...) to capture and expand & -> & only inside URLs. Possibly you need a reverse function to compress them in the output. You'll need a fancier regex to capture only ampersands inside URLs, but anyway:

# Minimal string to tickle this
#html = "<html>example.com/a.php?b=2&c=15&d=42</html>"
html = "<html>example.com/a.php?b=2&c=15&d=29&e=42</html>"

html = re.sub(r'&(?!amp;)', r'&', html)

parsed = BeautifulSoup.BeautifulSoup(html)
>>> print parsed.text.encode('utf-8')
'example.com/a.php?b=2&c=15'

>>> re.sub(r'&', r'&', parsed.text.encode('utf-8'))
'example.com/a.php?b=2&c=15'

There may be other more BS-thonic approaches.
You may want to help test the 4.0 beta.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文