使用硒和多线程时,如何将刮擦数据保存到CSV文件中?
我试图用硒来刮擦许多URL的当前时间。我将名称,价格和时间附加到三个不同的列表中,将其放入字典中,然后将其保存到CSV文件中。为了提高刮擦速度,我还使用螺纹同时刮擦多个页面。
但是,在使用ThreadPoolExecutor之后,刮擦产品的名称,并且在输出文件中的价格不匹配。它将与其他产品的价格相匹配。我想这是因为刮擦的顺序发生了变化,因为刮擦时加载页面的时间不同,因此两个列表中的名称和价格无法匹配。我该怎么做才能将名称与它的价格匹配?任何帮助都值得赞赏。谢谢
urlList = [https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33-
fl- oz/-/A-14523072#lnk=sametab,
...
...]
priceArray = []
nameArray = []
GMTArray = []
def ScrapingTarget(url):
wait_imp = 10
CO = webdriver.ChromeOptions()
CO.add_experimental_option('useAutomationExtension', False)
CO.add_argument('--ignore-certificate-errors')
CO.add_argument('--start-maximized')
wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)
wd.get(url)
wd.implicitly_wait(wait_imp)
#start scraping
name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
nameArray.append(name)
price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
priceArray.append(price)
tz = pytz.timezone('Europe/London')
GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
GMTArray.append(GMT)
with concurrent.futures.ThreadPoolExecutor(2) as executor:
executor.map(ScrapingTarget, urlList)
data = {'prod-name': nameArray,
'Price': priceArray,
"GMT": GMTArray
}
#df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"])
df = pd.DataFrame.from_dict(data, orient='index')
df = df.transpose()
df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)
I am trying to scrape product name price and current time for many urls with selenium. And i append the name, price and time into three different lists, put them into dictionary and then save into csv file. To increase the speed of scraping, i also use threading to scrape multiple pages at the same time.
However, after using ThreadPoolExecutor, the name of the scraped products and just mismatch its price in output file. It will match with other products' price. I guess it is because the order of scraping changed because the time of loading pages when scraping is different so the name and price in two list cannot match. What should I do to match the name with its price? Any help appreciates. Thanks
urlList = [https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33-
fl- oz/-/A-14523072#lnk=sametab,
...
...]
priceArray = []
nameArray = []
GMTArray = []
def ScrapingTarget(url):
wait_imp = 10
CO = webdriver.ChromeOptions()
CO.add_experimental_option('useAutomationExtension', False)
CO.add_argument('--ignore-certificate-errors')
CO.add_argument('--start-maximized')
wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO)
wd.get(url)
wd.implicitly_wait(wait_imp)
#start scraping
name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text
nameArray.append(name)
price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
priceArray.append(price)
tz = pytz.timezone('Europe/London')
GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
GMTArray.append(GMT)
with concurrent.futures.ThreadPoolExecutor(2) as executor:
executor.map(ScrapingTarget, urlList)
data = {'prod-name': nameArray,
'Price': priceArray,
"GMT": GMTArray
}
#df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"])
df = pd.DataFrame.from_dict(data, orient='index')
df = df.transpose()
df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论