Beautifuresoup bs.find_all（＆＃x27; a＆＃x27;）不在网页上工作

发布于 2025-01-18 04:15:52 字数 935 浏览 4 评论 0原文

有人可以确切地解释一下是否有一种方法可以从此网页刮擦链接 https：https：https：// hackmd 。

url = 'https://hackmd.io/@nearly-learning/near-201' 
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'lxml') # also tried all other parcers 
links = bs.find_all('a') # only obtains 23 links, when there are actually loads more.
for link in links: 
    if 'href' in link.attrs:
        print(link.attrs['href'])

仅在本文的实际主体中获得一些链接和非链接。

但是，我可以用硒来做到这一点：

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

但是如果可能的话，想使用美丽的小组！谁知道这是否？

原文

Could someone please explain exactly if there is a way to scrape links from this webpage https://hackmd.io/@nearly-learning/near-201 using BeautifulSoup or is it only possible with Selenium?

url = 'https://hackmd.io/@nearly-learning/near-201' 
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'lxml') # also tried all other parcers 
links = bs.find_all('a') # only obtains 23 links, when there are actually loads more.
for link in links: 
    if 'href' in link.attrs:
        print(link.attrs['href'])

Only obtain a few links and non in the actual body of the article.

I am however able to do it with Selenium:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

But would like to use BeautifulSoup if possible! Who knows if it is?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

从﹋此江山别 2025-01-25 04:15:52

如果您不想使用 selenium，可以使用 Markdown< /code>包将 Markdown 文本渲染为 HTML 并使用 BeautifulSoup 解析它

import markdown  # pip install Markdown
import requests
from bs4 import BeautifulSoup

# 1. get raw markdown text
url = "https://hackmd.io/@nearly-learning/near-201"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
md_txt = soup.select_one("#doc").text

# 2. render the markdown to HTML
html = markdown.markdown(md_txt)

# 3. parse it again and find all <a> links
soup = BeautifulSoup(html, "html.parser")

for a in soup.select("a[href]"):
    print(a["href"])

：

https://cdixon.org/2018/02/18/why-decentralization-matters
https://docs.near.org/docs/concepts/gas#ballpark-comparisons-to-ethereum
https://docs.near.org/docs/roles/integrator/exchange-integration#blocks-and-finality
https://docs.near.org/docs/concepts/architecture/papers
https://explorer.near.org/nodes/validators
https://explorer.near.org/stats
https://docs.near.org/docs/develop/contracts/rust/intro
https://docs.near.org/docs/develop/contracts/as/intro
https://docs.near.org/docs/api/rpc

...and so on.

If you don't want to use selenium, you can use Markdown package to render the markdown text to HTML and parse it with BeautifulSoup:

import markdown  # pip install Markdown
import requests
from bs4 import BeautifulSoup

# 1. get raw markdown text
url = "https://hackmd.io/@nearly-learning/near-201"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
md_txt = soup.select_one("#doc").text

# 2. render the markdown to HTML
html = markdown.markdown(md_txt)

# 3. parse it again and find all <a> links
soup = BeautifulSoup(html, "html.parser")

for a in soup.select("a[href]"):
    print(a["href"])

Prints:

https://cdixon.org/2018/02/18/why-decentralization-matters
https://docs.near.org/docs/concepts/gas#ballpark-comparisons-to-ethereum
https://docs.near.org/docs/roles/integrator/exchange-integration#blocks-and-finality
https://docs.near.org/docs/concepts/architecture/papers
https://explorer.near.org/nodes/validators
https://explorer.near.org/stats
https://docs.near.org/docs/develop/contracts/rust/intro
https://docs.near.org/docs/develop/contracts/as/intro
https://docs.near.org/docs/api/rpc

...and so on.

回复收藏 0 原文

寄与心 2025-01-25 04:15:52

如前所述，它需要 selenium 或类似的东西来渲染所有内容，并且如果您愿意，请确保您可以在 da mix 中使用 selenium 和 BeautifulSoup以这种方式选择你的元素。

只需将 driver.page_source 推送到您的 BeautifulSoup()

bs = BeautifulSoup(driver.page_source)

示例

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")

bs = BeautifulSoup(driver.page_source)

for link in bs.select('a[href]'):
    print(link['href'])

As mentioned it needs selenium or something similar to render all the content and sure you could use selenium and BeautifulSoup in da mix, if you prefer to select your elements in that way.

Just push the driver.page_source to your BeautifulSoup()

bs = BeautifulSoup(driver.page_source)

Example

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")

bs = BeautifulSoup(driver.page_source)

for link in bs.select('a[href]'):
    print(link['href'])

回复收藏 0 原文

~没有更多了~