BeautifulSoup 解析器将分号附加到裸露的 & 符号上，从而破坏 URL？

发布于 2024-12-01 11:30:21 字数 867 浏览 1 评论 0原文

我正在尝试解析 python 中的某些网站，其中包含指向其他网站的链接，但以纯文本形式，而不是“a”标记中。使用 BeautifulSoup 我得到了错误的答案。考虑这段代码：

import BeautifulSoup

html = """<html>
            <head>
              <title>Test html</title>
            </head>
            <body>
              <div>
                example.com/a.php?b=2&c=15
              </div>
            </body>
          </html>"""

parsed = BeautifulSoup.BeautifulSoup(html)
print parsed

当我运行上面的代码时，我得到以下输出：

<html>
  <head>
    <title>Test html</title>
  </head>
  <body>
    <div>
      example.com/a.php?b=2&c;=15
    </div>
  </body>
</html>

注意“div”中的链接和部分 b=2&c;=15。它与原始 HTML 不同。为什么 BeautifulSoup 会以这种方式弄乱链接。它是否试图自动创建 HTML 实体？如何防止这种情况发生？

原文

I am trying to parse some site in python that has links in it to other sites, but in plain text, not in "a" tag. Using BeautifulSoup I get the wrong answer. Consider this code:

import BeautifulSoup

html = """<html>
            <head>
              <title>Test html</title>
            </head>
            <body>
              <div>
                example.com/a.php?b=2&c=15
              </div>
            </body>
          </html>"""

parsed = BeautifulSoup.BeautifulSoup(html)
print parsed

when I run the above code I get the following output:

<html>
  <head>
    <title>Test html</title>
  </head>
  <body>
    <div>
      example.com/a.php?b=2&c;=15
    </div>
  </body>
</html>

Notice the link in the "div" and the part b=2&c;=15. It's different from the original HTML. Why is BeautifulSoup messing with the links in such a way. Is it trying to automagically create HTML entites? How to prevent this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

星星的轨迹 2024-12-08 11:30:21

显然 BS 有一个记录不足的解析＆符号的问题在 URL 内，我刚刚搜索了他们的论坛“分号”。根据 2009 年的讨论，裸露的 & 严格来说是无效的，必须用 & 替换，尽管浏览器接受这一点，所以它看起来很迂腐。

我同意这种解析行为是伪造的，您应该联系他们的列表，要求他们至少更好地记录这一已知问题，并在将来修复它。

解决方法：无论如何，您的解决方法很可能是 re.sub(...) 来捕获并扩展 & -> & 仅在 URL 内。可能您需要一个反向函数来在输出中压缩它们。您需要一个更高级的正则表达式来仅捕获 URL 中的 & 符号，但无论如何：

# Minimal string to tickle this
#html = "<html>example.com/a.php?b=2&c=15&d=42</html>"
html = "<html>example.com/a.php?b=2&c=15&d=29&e=42</html>"

html = re.sub(r'&(?!amp;)', r'&', html)

parsed = BeautifulSoup.BeautifulSoup(html)
>>> print parsed.text.encode('utf-8')
'example.com/a.php?b=2&c=15'

>>> re.sub(r'&', r'&', parsed.text.encode('utf-8'))
'example.com/a.php?b=2&c=15'

可能还有其他更废话的方法。
您可能想帮助测试 4.0 beta。

Apparently BS has an underdocumented issue parsing ampersands inside URL, I just searched their discussion forum for 'semicolon'. According to that discussion from 2009, naked & is strictly not valid and must be replaced by & although browsers accept this so it seems waay pedantic.

I agree this parsing behavior is bogus, and you should contact their list to ask them to at least document this better as a known issue, and fix it in future.

Workaround: Anyway, your workaround will most likely be re.sub(...) to capture and expand & -> & only inside URLs. Possibly you need a reverse function to compress them in the output. You'll need a fancier regex to capture only ampersands inside URLs, but anyway:

# Minimal string to tickle this
#html = "<html>example.com/a.php?b=2&c=15&d=42</html>"
html = "<html>example.com/a.php?b=2&c=15&d=29&e=42</html>"

html = re.sub(r'&(?!amp;)', r'&', html)

parsed = BeautifulSoup.BeautifulSoup(html)
>>> print parsed.text.encode('utf-8')
'example.com/a.php?b=2&c=15'

>>> re.sub(r'&', r'&', parsed.text.encode('utf-8'))
'example.com/a.php?b=2&c=15'

There may be other more BS-thonic approaches.
You may want to help test the 4.0 beta.

回复收藏 0 原文

~没有更多了~