如何在Python中使用Selenium连续爬网网页
我正在尝试爬行彭博社,并找到所有英语新闻文章的链接。以下代码的问题在于,它确实从第一页中找到了很多文章,但是它只是循环说,它不会返回任何东西,不时地进行。
from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")
def crawl_link(input_url):
options = Options()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options)
browser.get(input_url)
elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
for elem in elems:
#retrieve all href links and save it to url_element variable
url_element = elem.get_attribute("href")
if url_element not in visited:
to_crawl.append(url_element)
visited.add(url_element)
#save news articles
if 'www.bloomberg.com/news/articles' in url_element:
print(str(url_element))
with open("result.txt", "a") as outf:
outf.write(str(url_element) + "\n")
browser.close()
while len(to_crawl):
url_to_crawl = to_crawl.pop()
crawl_link(url_to_crawl)
我已经尝试使用队列然后使用堆栈,但是行为是相同的。我似乎无法完成我想要的东西。
您如何抓取这样的网站来抓取新闻网址?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您使用的方法应该可以正常工作,但是我自己运行后,我注意到有一些事情导致它悬挂或丢弃错误。
我进行了一些调整,并包括一些在线评论,以解释我的原因。
运行60多秒后,这是
result.txt
The approach you are using should work fine, however after running it myself there are a few things that I noticed are causing it to hang or throw errors.
I made some adjustments and included some in-line comments to explain my reasons.
After running for 60+ seconds this was the output of
result.txt
https://gist.github.com/alexpdev/b7545970c4e3002b1372e26651301a23