Python Web刮擦/数据提取
对于我的硕士论文,我正在探索通过Web自动化从网站提取数据的可能性。步骤如下:
- 登录网站( https://wwww.metal.com/copper/ 201102250376 )
- 输入用户名和密码
- 单击“登录日期”
- 将日期更改为01/01/2020
- 刮擦生成的表数据,然后将其保存到CSV文件中,
- 将其保存到具有我的PC上的特定名称的特定文件夹,然后在我的PC上
- 运行相同的序列要在同一浏览器窗口中的一个新标签中下载其他材料的其他历史价格数据
我在第5、6和7 的步骤中
,selenium import webdriver
driver_path ='c:\ webdriver \ webdriver \ chromedriver.exe ' = chromeoptions)
driver.maximize_window()
driver.get('https:/https://wwwwww.metal.com/copper/com/copper/201102250376'
驱动程序= webdriver.chrome(executable_path = driver_path, chrome_options loginClick1 = driver.find_element_by_css_selector( '#__next> div> Div.smm-componten-header-en> Div.Main> Div.Right> button.button.sign-in')
loginClick1.click()
user_input = driver.find_element_by_id('user_name') user_input.send_keys('#####'
) ')
password_input.send_keys(' #### '身体> Div:nth-Child(17)> div> div.ant-modal-wrap.ant-modal以中心为中心div> Div.ant-Modal-content> div> div> div> div.smm-component-sign-en-content> > Div:nth-Child(3)> div> div>跨度> button')
smint.click()
time.sleep(2)
#scroll在页面中的兴趣点 driver.execute_script(“ window.scrollby(0,1000)”,“”)
#change Currency img [contains(@class,'icon ___ buqam')]”)。
driver.find_element (
by.xpath ,
“ // '//*[@ID =“ __ next”] a>/div [7]/div 1 /div/div a href 2 /div 1 /span 1
/div/div/i'date_input.click()
action = action链(驱动程序)
action.move_to_element(date_input).send_keys(keys.backspace).send_keys(send_keys( keys.backSpace).send_keys(keys.backspace).send_keys(keys.backspace).send_keys(keys.backspace).send_keys(keys.backspace).send_keys .send_keys(keys.backspace) .back -space).send_keys(keys.backspace).perform()
action.move_to_element(date_input).send_keys(“ 01/01/2020”)。 ).send_keys(keys.enter)。
action.move_to_element (
date_input 请参阅下面的HTML代码 生成的表
**May 27, 2022** **10,758.75-10,788.43** **10,773.59** **+97.94** **USD/mt**任何帮助都将不胜感激。
使用按钮下载文件 下载按钮
driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download.png')]").click()
time.sleep(1)
driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download_excel.png')]").click()
以节省时间,因为我有多个文件/数据要下载,我也在探索通过提供的下载按钮直接保存文件的可能性。
- 我遇到的问题是我无法直接指定我希望将其保存的文件名。
- 单击后,“下载”按钮打开一个新选项卡,然后关闭 在几秒钟内初始化文件下载。
- 然后,使用材料码 - today的日期文件下载该文件 命名格式。
您对如何解决这个问题有什么想法吗?
For my master thesis, I am exploring the possibility to extract data from a website via web automation. The steps are as follows:
- Sign in to the website ( https://www.metal.com/Copper/201102250376 )
- Input username and password
- Click sign-in
- Change date to 01/01/2020
- Scrape the table data generated and then save it to csv file
- Save to a specific folder with a specific name on my PC
- Run the same sequence to download additional historical price data for other materials in a new tab in the same browser window
I am stuck in steps 5, 6 and 7
from selenium import webdriver
DRIVER_PATH = 'C:\webdriver\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=ChromeOptions)
driver.maximize_window()
driver.get('https://www.metal.com/Copper/201102250376')
#Login steps
LoginClick1 = driver.find_element_by_css_selector(
'#__next > div > div.smm-component-header-en > div.main > div.right > button.button.sign-in')
LoginClick1.click()
user_input = driver.find_element_by_id('user_name')
user_input.send_keys('#####')
password_input = driver.find_element_by_id('password')
password_input.send_keys('####')
Submit = driver.find_element_by_css_selector(
'body > div:nth-child(17) > div > div.ant-modal-wrap.ant-modal-centered.smm-component-sign-en > div > div.ant-modal-content > div > div > div > div.smm-component-sign-en-content > form > div:nth-child(3) > div > div > span > button')
Submit.click()
time.sleep(2)
#scroll down the point of interest in page
driver.execute_script("window.scrollBy(0,1000)", "")
#change currency
driver.find_element(By.XPATH,"//img[contains(@class,'icon___BUqam')]").click()
time.sleep(1)
#change date from datepicker
date_input = driver.find_element_by_xpath(
'//*[@id="__next"]/div/div[5]/div1/div[7]/div1/div2/div1/span1/div/i')
date_input.click()
action = ActionChains(driver)
action.move_to_element(date_input).send_keys(Keys.BACKSPACE).send_keys(
Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).perform()
action.move_to_element(date_input).send_keys("01/01/2020").perform()
action.move_to_element(date_input).send_keys(Keys.ENTER).perform()
time.sleep(2)
I am stuck trying to scrape the data from the table generated and then save into a csv file using selenium. See HTML code below
table generated
**May 27, 2022**
**10,758.75-10,788.43**
**10,773.59**
**+97.94**
**USD/mt**
Any help would be massively appreciated.
Download file using button press
Download button
driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download.png')]").click()
time.sleep(1)
driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download_excel.png')]").click()
To save time since I have multiple files/data to download, I am also exploring the possibility of directly saving the file via the download button provided.
- The problem I encounter is that I am not able to directly specify the filename I want it to be saved as.
- Upon click, the download button opens a new tab and then closes
within seconds to initialize the file download. - The file is then downloaded with a materialcode-today's date file
naming format.
Have you any idea on how to go about this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
原因 in 按钮未得到单击是因为xpath
//*[@ID =“ __ next”]/div/div/div/div [3]/div [2]/div [2 ]/button [2]
是不正确的ID
Next
是主要容器div
,我们通过它导致到<<代码>符号按钮通过提供剩余的HTML nodre结构,您可以根据其类值直接选择“
clast ='button sign-in']
基于其类值的符号您的登录解决方案看起来像
刮擦数据
添加到csv
f.close
The reason
sign in
button is not getting clicked is because the xpath//*[@id="__next"]/div/div[3]/div[2]/div[2]/button[2]
is incorrect theid
ofnext
is the main containerdiv
through which we are naviagting to thesign button
by providing remaining html nodre structureInstead you can directly select the sign in button as
//button[@class='button sign-in']
based on its class valueYour solution for sign in would look like
To scrape data
Add To CSV
f.close