如何在Python中使用Selenium连续爬网网页

发布于 2025-01-25 06:25:47 字数 1267 浏览 1 评论 0 原文

我正在尝试爬行彭博社,并找到所有英语新闻文章的链接。以下代码的问题在于,它确实从第一页中找到了很多文章,但是它只是循环说,它不会返回任何东西,不时地进行。

from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
    for elem in elems:
        #retrieve all href links and save it to url_element variable
        url_element = elem.get_attribute("href")
        if url_element not in visited:
            to_crawl.append(url_element)
            visited.add(url_element)
            #save news articles
            if 'www.bloomberg.com/news/articles' in url_element:
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    browser.close()

while len(to_crawl):
    url_to_crawl = to_crawl.pop()
    crawl_link(url_to_crawl)

我已经尝试使用队列然后使用堆栈,但是行为是相同的。我似乎无法完成我想要的东西。

您如何抓取这样的网站来抓取新闻网址?

I'm trying to crawl bloomberg.com and find links for all English news articles. The problem with the below code is that, it does find a lot of articles from the first page but the it just goes into a loop that it does not return anything and goes once in a while.

from collections import deque
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

visited = set()
to_crawl = deque()
to_crawl.append("https://www.bloomberg.com")

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")
    for elem in elems:
        #retrieve all href links and save it to url_element variable
        url_element = elem.get_attribute("href")
        if url_element not in visited:
            to_crawl.append(url_element)
            visited.add(url_element)
            #save news articles
            if 'www.bloomberg.com/news/articles' in url_element:
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    browser.close()

while len(to_crawl):
    url_to_crawl = to_crawl.pop()
    crawl_link(url_to_crawl)

I've tried using a queue and then used a stack, but the behavior is the same. I cannot seem to be able to accomplish what im looking for.

How do you crawl websites like this to crawl news urls?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

巴黎盛开的樱花 2025-02-01 06:25:47

您使用的方法应该可以正常工作,但是我自己运行后,我注意到有一些事情导致它悬挂或丢弃错误。

我进行了一些调整,并包括一些在线评论,以解释我的原因。

from collections import deque
from selenium.common.exceptions import StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

base = "https://www.bloomberg.com"
article = base + "/news/articles"
visited = set()


# A set discards duplicates automatically and is more efficient for lookups
articles = set()

to_crawl = deque()
to_crawl.append(base)

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    print(input_url)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")

    # this part was the issue, before this line there was 
    # `to_crawl.append()` which was prematurely adding links 
    # to the visited list so those links were skipped over without
    # being crawled
    visited.add(input_url)

    for elem in elems:

        # checks for errors
        try:
            url_element = elem.get_attribute("href")
        except StaleElementReferenceException as err:
            print(err)
            continue

        # checks to make sure links aren't being crawled more than once
        # and that all the links are in the propper domain
        if base in url_element and all(url_element not in i for i in [visited, to_crawl]):

            to_crawl.append(url_element)

            # this checks if the link matches the correct url pattern
            if article in url_element and url_element not in articles:

                articles.add(url_element)
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    
    browser.quit() # guarantees the browser closes completely


while len(to_crawl):
    # popleft makes the deque a FIFO instead of LIFO.
    # A queue would achieve the same thing.
    url_to_crawl = to_crawl.popleft()

    crawl_link(url_to_crawl)

运行60多秒后,这是 result.txt

The approach you are using should work fine, however after running it myself there are a few things that I noticed are causing it to hang or throw errors.

I made some adjustments and included some in-line comments to explain my reasons.

from collections import deque
from selenium.common.exceptions import StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options

base = "https://www.bloomberg.com"
article = base + "/news/articles"
visited = set()


# A set discards duplicates automatically and is more efficient for lookups
articles = set()

to_crawl = deque()
to_crawl.append(base)

def crawl_link(input_url):
    options = Options()
    options.add_argument('--headless')
    browser = webdriver.Firefox(options=options)
    print(input_url)
    browser.get(input_url)
    elems = browser.find_elements(by=By.XPATH, value="//a[@href]")

    # this part was the issue, before this line there was 
    # `to_crawl.append()` which was prematurely adding links 
    # to the visited list so those links were skipped over without
    # being crawled
    visited.add(input_url)

    for elem in elems:

        # checks for errors
        try:
            url_element = elem.get_attribute("href")
        except StaleElementReferenceException as err:
            print(err)
            continue

        # checks to make sure links aren't being crawled more than once
        # and that all the links are in the propper domain
        if base in url_element and all(url_element not in i for i in [visited, to_crawl]):

            to_crawl.append(url_element)

            # this checks if the link matches the correct url pattern
            if article in url_element and url_element not in articles:

                articles.add(url_element)
                print(str(url_element))
                with open("result.txt", "a") as outf:
                    outf.write(str(url_element) + "\n")
    
    browser.quit() # guarantees the browser closes completely


while len(to_crawl):
    # popleft makes the deque a FIFO instead of LIFO.
    # A queue would achieve the same thing.
    url_to_crawl = to_crawl.popleft()

    crawl_link(url_to_crawl)

After running for 60+ seconds this was the output of result.txt https://gist.github.com/alexpdev/b7545970c4e3002b1372e26651301a23

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文