超链接的网络抓取如此慢

发布于 2025-02-08 20:00:57 字数 1062 浏览 1 评论 0原文

我正在使用以下功能从网站列表中刮擦Twitter URL。

import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter


def twitter_url(website): # website address is given to the function in a string format

    try:
        http = httplib2.Http()
        status, response = http.request(str('https://') + website)

        url = 'https://twitter.com'
        search_domain = urlparse(url).hostname

        l = []

        for link in bs.BeautifulSoup(response, 'html.parser',
                             parseOnlyThese=SoupStrainer('a')):
            if link.has_attr('href'):
                if search_domain in link['href']:
                    l.append(link['href'])
    
        return list(set(l))
    
    except:
        ConnectionRefusedError

然后,我将功能应用于数据框架中,其中包含URL地址

df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)

数据框的网站地址约为100,000个。即使我运行10,000个样本的代码,代码也会运行速度很慢。有什么办法可以更快地运行此?

I am using the following function to scrape the Twitter URLs from a list of websites.

import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter


def twitter_url(website): # website address is given to the function in a string format

    try:
        http = httplib2.Http()
        status, response = http.request(str('https://') + website)

        url = 'https://twitter.com'
        search_domain = urlparse(url).hostname

        l = []

        for link in bs.BeautifulSoup(response, 'html.parser',
                             parseOnlyThese=SoupStrainer('a')):
            if link.has_attr('href'):
                if search_domain in link['href']:
                    l.append(link['href'])
    
        return list(set(l))
    
    except:
        ConnectionRefusedError

and then I apply the function into the dataframe which includes the URL addresses

df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)

The dataframe has about 100,000 website addresses. Even when I run the code for 10,000 samples, the code is running so slow. Is there any way to run this faster?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

淡写薰衣草的香 2025-02-15 20:00:57

问题必须是为每个网站检索HTML代码所花费的时间的结果。

由于URL是一个接一个地处理的,即使每个URL摄入100ms,也仍需要1000秒(〜16分钟)才能完成。

但是,如果您在单独的线程中处理每个URL,则应大大减少所花费的时间。

您可以查看线程库以实现这一目标。

The issue must be a result of the time taken to retrieve the HTML code for each of the websites.

Since the URLs are processed one after the other, even if each one took 100ms it would still take 1000s (~16 mins) to finish up.

If you however process each URL in a separate thread, that should significantly cut down the time taken.

You can check out the threading library to accomplish that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文