使用硒和多线程时,如何将刮擦数据保存到CSV文件中?

发布于 2025-02-01 20:35:04 字数 2077 浏览 2 评论 0原文

我试图用硒来刮擦许多URL的当前时间。我将名称,价格和时间附加到三个不同的列表中,将其放入字典中,然后将其保存到CSV文件中。为了提高刮擦速度,我还使用螺纹同时刮擦多个页面。

但是,在使用ThreadPoolExecutor之后,刮擦产品的名称,并且在输出文件中的价格不匹配。它将与其他产品的价格相匹配。我想这是因为刮擦的顺序发生了变化,因为刮擦时加载页面的时间不同,因此两个列表中的名称和价格无法匹配。我该怎么做才能将名称与它的价格匹配?任何帮助都值得赞赏。谢谢

        urlList =  [https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33- 
        fl- oz/-/A-14523072#lnk=sametab,
        ...
        ...]
        priceArray = []
        nameArray = []
        GMTArray = []
        
        def ScrapingTarget(url):
            wait_imp = 10
            CO = webdriver.ChromeOptions()
            CO.add_experimental_option('useAutomationExtension', False)
            CO.add_argument('--ignore-certificate-errors')
            CO.add_argument('--start-maximized')
            wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)
            wd.get(url)
            wd.implicitly_wait(wait_imp)
            #start scraping
            name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
            nameArray.append(name)
            price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
            priceArray.append(price)
            tz = pytz.timezone('Europe/London')
            GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
            GMTArray.append(GMT)
        
        with concurrent.futures.ThreadPoolExecutor(2) as executor:
            executor.map(ScrapingTarget, urlList)
        
        data = {'prod-name': nameArray,
                'Price': priceArray,
                "GMT": GMTArray
                }
        #df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"])
        df = pd.DataFrame.from_dict(data, orient='index')
        df = df.transpose()
        df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)

I am trying to scrape product name price and current time for many urls with selenium. And i append the name, price and time into three different lists, put them into dictionary and then save into csv file. To increase the speed of scraping, i also use threading to scrape multiple pages at the same time.

However, after using ThreadPoolExecutor, the name of the scraped products and just mismatch its price in output file. It will match with other products' price. I guess it is because the order of scraping changed because the time of loading pages when scraping is different so the name and price in two list cannot match. What should I do to match the name with its price? Any help appreciates. Thanks

        urlList =  [https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33- 
        fl- oz/-/A-14523072#lnk=sametab,
        ...
        ...]
        priceArray = []
        nameArray = []
        GMTArray = []
        
        def ScrapingTarget(url):
            wait_imp = 10
            CO = webdriver.ChromeOptions()
            CO.add_experimental_option('useAutomationExtension', False)
            CO.add_argument('--ignore-certificate-errors')
            CO.add_argument('--start-maximized')
            wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)
            wd.get(url)
            wd.implicitly_wait(wait_imp)
            #start scraping
            name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
            nameArray.append(name)
            price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
            priceArray.append(price)
            tz = pytz.timezone('Europe/London')
            GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
            GMTArray.append(GMT)
        
        with concurrent.futures.ThreadPoolExecutor(2) as executor:
            executor.map(ScrapingTarget, urlList)
        
        data = {'prod-name': nameArray,
                'Price': priceArray,
                "GMT": GMTArray
                }
        #df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"])
        df = pd.DataFrame.from_dict(data, orient='index')
        df = df.transpose()
        df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文