超链接的网络抓取如此慢

发布于 2025-02-08 20:00:57 字数 1062 浏览 1 评论 0原文

我正在使用以下功能从网站列表中刮擦Twitter URL。

import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter


def twitter_url(website): # website address is given to the function in a string format

    try:
        http = httplib2.Http()
        status, response = http.request(str('https://') + website)

        url = 'https://twitter.com'
        search_domain = urlparse(url).hostname

        l = []

        for link in bs.BeautifulSoup(response, 'html.parser',
                             parseOnlyThese=SoupStrainer('a')):
            if link.has_attr('href'):
                if search_domain in link['href']:
                    l.append(link['href'])
    
        return list(set(l))
    
    except:
        ConnectionRefusedError

然后，我将功能应用于数据框架中，其中包含URL地址

df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)

数据框的网站地址约为100,000个。即使我运行10,000个样本的代码，代码也会运行速度很慢。有什么办法可以更快地运行此？

原文

I am using the following function to scrape the Twitter URLs from a list of websites.

import httplib2
import bs4 as bs
from bs4 import BeautifulSoup, SoupStrainer
from urllib.parse import urlparse
import pandas as pd
import swifter


def twitter_url(website): # website address is given to the function in a string format

    try:
        http = httplib2.Http()
        status, response = http.request(str('https://') + website)

        url = 'https://twitter.com'
        search_domain = urlparse(url).hostname

        l = []

        for link in bs.BeautifulSoup(response, 'html.parser',
                             parseOnlyThese=SoupStrainer('a')):
            if link.has_attr('href'):
                if search_domain in link['href']:
                    l.append(link['href'])
    
        return list(set(l))
    
    except:
        ConnectionRefusedError

and then I apply the function into the dataframe which includes the URL addresses

df ['twitter_id'] = df.swifter.apply(lambda x:twitter_url(x['Website address']), axis=1)

The dataframe has about 100,000 website addresses. Even when I run the code for 10,000 samples, the code is running so slow. Is there any way to run this faster?

分享到QQ

分享到微博