使用硒和多线程时，如何将刮擦数据保存到CSV文件中？

发布于 2025-02-01 20:35:04 字数 2077 浏览 2 评论 0原文

我试图用硒来刮擦许多URL的当前时间。我将名称，价格和时间附加到三个不同的列表中，将其放入字典中，然后将其保存到CSV文件中。为了提高刮擦速度，我还使用螺纹同时刮擦多个页面。

但是，在使用ThreadPoolExecutor之后，刮擦产品的名称，并且在输出文件中的价格不匹配。它将与其他产品的价格相匹配。我想这是因为刮擦的顺序发生了变化，因为刮擦时加载页面的时间不同，因此两个列表中的名称和价格无法匹配。我该怎么做才能将名称与它的价格匹配？任何帮助都值得赞赏。谢谢

        urlList =  [https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33- 
        fl- oz/-/A-14523072#lnk=sametab,
        ...
        ...]
        priceArray = []
        nameArray = []
        GMTArray = []
        
        def ScrapingTarget(url):
            wait_imp = 10
            CO = webdriver.ChromeOptions()
            CO.add_experimental_option('useAutomationExtension', False)
            CO.add_argument('--ignore-certificate-errors')
            CO.add_argument('--start-maximized')
            wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)
            wd.get(url)
            wd.implicitly_wait(wait_imp)
            #start scraping
            name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
            nameArray.append(name)
            price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
            priceArray.append(price)
            tz = pytz.timezone('Europe/London')
            GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
            GMTArray.append(GMT)
        
        with concurrent.futures.ThreadPoolExecutor(2) as executor:
            executor.map(ScrapingTarget, urlList)
        
        data = {'prod-name': nameArray,
                'Price': priceArray,
                "GMT": GMTArray
                }
        #df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"])
        df = pd.DataFrame.from_dict(data, orient='index')
        df = df.transpose()
        df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)

原文

I am trying to scrape product name price and current time for many urls with selenium. And i append the name, price and time into three different lists, put them into dictionary and then save into csv file. To increase the speed of scraping, i also use threading to scrape multiple pages at the same time.

However, after using ThreadPoolExecutor, the name of the scraped products and just mismatch its price in output file. It will match with other products' price. I guess it is because the order of scraping changed because the time of loading pages when scraping is different so the name and price in two list cannot match. What should I do to match the name with its price? Any help appreciates. Thanks

        urlList =  [https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33- 
        fl- oz/-/A-14523072#lnk=sametab,
        ...
        ...]
        priceArray = []
        nameArray = []
        GMTArray = []
        
        def ScrapingTarget(url):
            wait_imp = 10
            CO = webdriver.ChromeOptions()
            CO.add_experimental_option('useAutomationExtension', False)
            CO.add_argument('--ignore-certificate-errors')
            CO.add_argument('--start-maximized')
            wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)
            wd.get(url)
            wd.implicitly_wait(wait_imp)
            #start scraping
            name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
            nameArray.append(name)
            price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
            priceArray.append(price)
            tz = pytz.timezone('Europe/London')
            GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
            GMTArray.append(GMT)
        
        with concurrent.futures.ThreadPoolExecutor(2) as executor:
            executor.map(ScrapingTarget, urlList)
        
        data = {'prod-name': nameArray,
                'Price': priceArray,
                "GMT": GMTArray
                }
        #df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"])
        df = pd.DataFrame.from_dict(data, orient='index')
        df = df.transpose()
        df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)

分享到QQ

分享到微博