如何应用多线程以加快美丽的汤来快速刮擦数据

发布于 2025-02-07 13:47:46 字数 1605 浏览 2 评论 0原文

我对多线程不熟悉,以及如何将其应用于快速刮擦数据,因为美丽的scrape scrape scrape scrape show slow可以告诉我如何将多线程应用于我的代码,这是页面链接 https://baroul-timis.ro/tabloul-avocatilor/

import requests
from bs4 import BeautifulSoup
import pandas as pd



url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"

base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

productlink=[]
data = requests.get(url).json()
for i, d in enumerate(data["data"], 1):
    link = BeautifulSoup(d["actions"], "html.parser").a["href"]
    comp=base_url+link
    productlink.append(comp)
test=[]   
for link in productlink:
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        
        
        try:
            phone=pip.select('span',class_='font-weight-bolder')[2].text
           
        except:
            pass
        wev['phone']=phone.split('\xa0')
        
        
        
        
        
        
        try:
            email=pip.select('span',class_='font-weight-bolder')[3].text
        except:
            pass
        wev['email']=email.split('\xa0')
        
        
        
        test.append(wev)
        
        
        
df = pd.DataFrame(test)
print(df)

I am not familiar with multithreading and how I can apply it to scrape the data fast because beautifulsoup scrape the data slow can tell how I apply multithreading to my code this is the page link https://baroul-timis.ro/tabloul-avocatilor/

import requests
from bs4 import BeautifulSoup
import pandas as pd



url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"

base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

productlink=[]
data = requests.get(url).json()
for i, d in enumerate(data["data"], 1):
    link = BeautifulSoup(d["actions"], "html.parser").a["href"]
    comp=base_url+link
    productlink.append(comp)
test=[]   
for link in productlink:
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'html.parser')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        
        
        try:
            phone=pip.select('span',class_='font-weight-bolder')[2].text
           
        except:
            pass
        wev['phone']=phone.split('\xa0')
        
        
        
        
        
        
        try:
            email=pip.select('span',class_='font-weight-bolder')[3].text
        except:
            pass
        wev['email']=email.split('\xa0')
        
        
        
        test.append(wev)
        
        
        
df = pd.DataFrame(test)
print(df)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

风追烟花雨 2025-02-14 13:47:46

多线程是此类内容的理想选择,因为在访问URL并获取数据时会有很多I/O等待。这是您可以重新工作的方法:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

test = []

def process(link):
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'lxml')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        try:
            wev['phone']=pip.select('span',class_='font-weight-bolder')[2].text.split('\xa0')
        except:
            pass
        try:
            wev['email']=pip.select('span',class_='font-weight-bolder')[3].text.split('\xa0')
        except:
            pass
        test.append(wev)

productlink=[]
data = requests.get(url).json()
for d in data["data"]:
    link = BeautifulSoup(d["actions"], "lxml").a["href"]
    productlink.append(base_url+link)

with ThreadPoolExecutor() as executor:
    executor.map(process, productlink)

df = pd.DataFrame(test)
print(df)

这将在我的系统(24个线程)上生成941行数据帧 - ie,〜20 URLS/第二

注意:如果您还没有安装LXML,则将需要它。通常比html.parser 编辑更快

多处理版本

import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ProcessPoolExecutor

url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url = 'https://baroul-timis.ro'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}


def process(link):
    wev = {}
    test = []
    r = requests.get(link, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    prod = soup.find_all('div', class_='user-info text-left mb-50')
    for pip in prod:
        wev['title'] = pip.find('h4').text
        try:
            wev['phone'] = pip.select('span', class_='font-weight-bolder')[2].text.split('\xa0')
        except:
            pass
        try:
            wev['email'] = pip.select('span', class_='font-weight-bolder')[3].text.split('\x0a')
        except:
            pass
        test.append(wev)
    return test


def main():
    productlink = []
    for d in requests.get(url).json()["data"]:
        link = BeautifulSoup(d["actions"], "lxml").a["href"]
        productlink.append(base_url+link)
    test = []
    with ProcessPoolExecutor() as executor:
        for r in executor.map(process, productlink):
            test.extend(r)

    df = pd.DataFrame(test)
    print(df)


if __name__ == '__main__':
    main()

Multithreading is ideal for this kind of thing because there will be lots of I/O waits while the URLs are accessed and their data acquired. Here's how you could re-work it:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url= 'https://baroul-timis.ro'

headers ={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}

test = []

def process(link):
    wev={}
    r =requests.get(link,headers=headers)
    soup=BeautifulSoup(r.content, 'lxml')
    prod=soup.find_all('div',class_='user-info text-left mb-50')
    for pip in prod:
        title=pip.find('h4').text
        wev['title']=title
        try:
            wev['phone']=pip.select('span',class_='font-weight-bolder')[2].text.split('\xa0')
        except:
            pass
        try:
            wev['email']=pip.select('span',class_='font-weight-bolder')[3].text.split('\xa0')
        except:
            pass
        test.append(wev)

productlink=[]
data = requests.get(url).json()
for d in data["data"]:
    link = BeautifulSoup(d["actions"], "lxml").a["href"]
    productlink.append(base_url+link)

with ThreadPoolExecutor() as executor:
    executor.map(process, productlink)

df = pd.DataFrame(test)
print(df)

This generates a 941 row dataframe in <44 seconds on my system (24 threads) - i.e., ~20 URLs/second

Note: If you don't already have lxml installed, you'll need it. It's generally faster than html.parser

EDIT:

Multiprocessing version

import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ProcessPoolExecutor

url = "https://baroul-timis.ro/get-av-data?param=toti-avocatii"
base_url = 'https://baroul-timis.ro'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}


def process(link):
    wev = {}
    test = []
    r = requests.get(link, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    prod = soup.find_all('div', class_='user-info text-left mb-50')
    for pip in prod:
        wev['title'] = pip.find('h4').text
        try:
            wev['phone'] = pip.select('span', class_='font-weight-bolder')[2].text.split('\xa0')
        except:
            pass
        try:
            wev['email'] = pip.select('span', class_='font-weight-bolder')[3].text.split('\x0a')
        except:
            pass
        test.append(wev)
    return test


def main():
    productlink = []
    for d in requests.get(url).json()["data"]:
        link = BeautifulSoup(d["actions"], "lxml").a["href"]
        productlink.append(base_url+link)
    test = []
    with ProcessPoolExecutor() as executor:
        for r in executor.map(process, productlink):
            test.extend(r)

    df = pd.DataFrame(test)
    print(df)


if __name__ == '__main__':
    main()
始终不够爱げ你 2025-02-14 13:47:46

如果要使用螺纹,则可以使用threadpoolexecutor

from concurrent.futures import ThreadPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 10 threads at a time
executor = ThreadPoolExecutor(8)
results = executor.map(do_work, links)

i 建议使用ProcessPoolExecutor 。这不仅可以与IO一起使用,还可以与CPU绑定的任务一起使用。它也将使用您的所有CPU。

from concurrent.futures import ProcessPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 8 processes parallel.
executor = ProcessPoolExecutor(8)
results = executor.map(do_work, links, chunksize=40)

在这里,结果将是do_work函数的返回值列表。最好不要返回庞大的数据。然后序列化数据将使过程非常慢。而是将其保存到DB或文件。

阅读有关

You can use ThreadPoolExecutor if you want to use threading.

from concurrent.futures import ThreadPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 10 threads at a time
executor = ThreadPoolExecutor(8)
results = executor.map(do_work, links)

I recommend to use ProcessPoolExecutor. This not only works with IO but also with CPU bound tasks. It'll also use all your CPUs.

from concurrent.futures import ProcessPoolExecutor

links = [...] # All you product urls goes here.
def do_work(link):
   ...
   # Write code to process 1 url here. 
# Run 8 processes parallel.
executor = ProcessPoolExecutor(8)
results = executor.map(do_work, links, chunksize=40)

Here the results will be the list of return values from the do_work function. It's better not to return a huge data. Then serializing that data will make the process very slow. Instead save it to db or file.

Read more about the concurrent.futures

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文