当前位置：文江博客话题详情

Python Web刮擦/数据提取

发布于 2025-02-02 10:54:53 字数 3424 浏览 2 评论 0原文

对于我的硕士论文，我正在探索通过Web自动化从网站提取数据的可能性。步骤如下：

登录网站（ https://wwww.metal.com/copper/ 201102250376 ）
输入用户名和密码
单击“登录日期”
将日期更改为01/01/2020
刮擦生成的表数据，然后将其保存到CSV文件中，
将其保存到具有我的PC上的特定名称的特定文件夹，然后在我的PC上
运行相同的序列要在同一浏览器窗口中的一个新标签中下载其他材料的其他历史价格数据

我在第5、6和7 的步骤中

，selenium import webdriver

driver_path ='c：\ webdriver \ webdriver \ chromedriver.exe ' = chromeoptions）

driver.maximize_window（）

driver.get（'https:/https://wwwwww.metal.com/copper/com/copper/201102250376'

驱动程序= webdriver.chrome（executable_path = driver_path， chrome_options loginClick1 = driver.find_element_by_css_selector（ '#__next＆gt; div＆gt; Div.smm-componten-header-en＆gt; Div.Main＆gt; Div.Right＆gt; button.button.sign-in'）

loginClick1.click（）

user_input = driver.find_element_by_id（'user_name'） user_input.send_keys（'#####'

） '）

password_input.send_keys（' #### '身体＆gt; Div：nth-Child（17）＆GT; div＆gt; div.ant-modal-wrap.ant-modal以中心为中心div＆gt; Div.ant-Modal-content＆gt; div＆gt; div＆gt; div＆gt; div.smm-component-sign-en-content＆gt; ＆gt; Div：nth-Child（3）＆GT; div＆gt; div＆gt;跨度＆gt; button'）

smint.click（）

time.sleep（2）

#scroll在页面中的兴趣点 driver.execute_script（“ window.scrollby（0,1000）”，“”）

#change Currency img [contains（@class，'icon ___ buqam'）]”）。

driver.find_element （

by.xpath ，

“ // '//*[@ID =“ __ next”] a>/div [7]/div 1 /div/div a href 2 /div 1 /span 1

/div/div/i'date_input.click（）

action = action链（驱动程序）

action.move_to_element（date_input）.send_keys（keys.backspace）.send_keys（send_keys（ keys.backSpace）.send_keys（keys.backspace）.send_keys（keys.backspace）.send_keys（keys.backspace）.send_keys（keys.backspace）.send_keys .send_keys（keys.backspace） .back -space）.send_keys（keys.backspace）.perform（）

action.move_to_element（date_input）.send_keys（“ 01/01/2020”）。）.send_keys（keys.enter）。

action.move_to_element （

date_input 请参阅下面的HTML代码生成的表

**May 27, 2022** **10,758.75-10,788.43** **10,773.59** **+97.94** **USD/mt**

任何帮助都将不胜感激。

使用按钮下载文件下载按钮

driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download.png')]").click()

time.sleep(1)

driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download_excel.png')]").click()

以节省时间，因为我有多个文件/数据要下载，我也在探索通过提供的下载按钮直接保存文件的可能性。

我遇到的问题是我无法直接指定我希望将其保存的文件名。
单击后，“下载”按钮打开一个新选项卡，然后关闭在几秒钟内初始化文件下载。
然后，使用材料码 - today的日期文件下载该文件命名格式。

您对如何解决这个问题有什么想法吗？

原文

For my master thesis, I am exploring the possibility to extract data from a website via web automation. The steps are as follows:

Sign in to the website ( https://www.metal.com/Copper/201102250376 )
Input username and password
Click sign-in
Change date to 01/01/2020
Scrape the table data generated and then save it to csv file
Save to a specific folder with a specific name on my PC
Run the same sequence to download additional historical price data for other materials in a new tab in the same browser window

I am stuck in steps 5, 6 and 7

from selenium import webdriver

DRIVER_PATH = 'C:\webdriver\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, chrome_options=ChromeOptions)

driver.maximize_window()

driver.get('https://www.metal.com/Copper/201102250376')

#Login steps
LoginClick1 = driver.find_element_by_css_selector(
'#__next > div > div.smm-component-header-en > div.main > div.right > button.button.sign-in')

LoginClick1.click()

user_input = driver.find_element_by_id('user_name')
user_input.send_keys('#####')

password_input = driver.find_element_by_id('password')
password_input.send_keys('####')

Submit = driver.find_element_by_css_selector(
'body > div:nth-child(17) > div > div.ant-modal-wrap.ant-modal-centered.smm-component-sign-en > div > div.ant-modal-content > div > div > div > div.smm-component-sign-en-content > form > div:nth-child(3) > div > div > span > button')

Submit.click()

time.sleep(2)

#scroll down the point of interest in page
driver.execute_script("window.scrollBy(0,1000)", "")

#change currency
driver.find_element(By.XPATH,"//img[contains(@class,'icon___BUqam')]").click()

time.sleep(1)

#change date from datepicker

date_input = driver.find_element_by_xpath(
'//*[@id="__next"]/div/div[5]/div1/div[7]/div1/div2/div1/span1/div/i')

date_input.click()

action = ActionChains(driver)

action.move_to_element(date_input).send_keys(Keys.BACKSPACE).send_keys(
Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).send_keys(Keys.BACKSPACE).perform()

action.move_to_element(date_input).send_keys("01/01/2020").perform()
action.move_to_element(date_input).send_keys(Keys.ENTER).perform()

time.sleep(2)

I am stuck trying to scrape the data from the table generated and then save into a csv file using selenium. See HTML code below
table generated

**May 27, 2022**
**10,758.75-10,788.43**
**10,773.59**
**+97.94**
**USD/mt**

Any help would be massively appreciated.

Download file using button press
Download button

driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download.png')]").click()

time.sleep(1)

driver.find_element(By.XPATH,"//img[contains(@src,'https://static.metal.com/www.metal.com/4.1.161/static/images/price/download_excel.png')]").click()

To save time since I have multiple files/data to download, I am also exploring the possibility of directly saving the file via the download button provided.

The problem I encounter is that I am not able to directly specify the filename I want it to be saved as.
Upon click, the download button opens a new tab and then closes
within seconds to initialize the file download.
The file is then downloaded with a materialcode-today's date file
naming format.

Have you any idea on how to go about this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽手叙旧 2025-02-09 10:54:53

原因 in 按钮未得到单击是因为xpath //*[@ID =“ __ next”]/div/div/div/div [3]/div [2]/div [2 ]/button [2]是不正确的ID Next是主要容器div，我们通过它导致到<<代码>符号按钮通过提供剩余的HTML nodre结构

，您可以根据其类值直接选择“ clast ='button sign-in']基于其类值的符号

您的登录解决方案看起来像

driver = webdriver.Chrome(executable_path='C:\webdrivers\chromedriver.exe')
driver.maximize_window()
driver.get('https://www.metal.com/Nickel/201102250239')
# Click on Sign In
driver.find_element(By.XPATH, "//button[@class='button sign-in']").click()
# Enter username
driver.find_element(By.ID, "user_name").send_keys("your username")
# Enter password
driver.find_element(By.ID, "password").send_keys("your password") 
# Click Sign In
driver.find_element(By.XPATH, "//button[@type='submit']").click()

刮擦数据

for element in driver.find_elements_by_class_name("historyBodyRow___1Bk9u"):
 elements =element.find_elements_by_tag_name("div")
 print("Date="+ elements[0].text)
 print("Price Range="+ elements[1].text)
 print("Avg="+ elements[2].text)
 print("Change="+ elements[3].text)
 print("Unit="+ elements[4].text)

添加到csv

import csv
f = open('Path where you want to store the file', 'w')
writer = csv.writer(f)
for element in driver.find_elements_by_class_name("historyBodyRow___1Bk9u"):
  elements =element.find_elements_by_tag_name("div")
  entry = [elements[0].text ,elements[1].text ,elements[2].text , elements[3].text, elements[4].text]
  writer.writerow(entry)

f.close

The reason sign in button is not getting clicked is because the xpath //*[@id="__next"]/div/div[3]/div[2]/div[2]/button[2] is incorrect the id of next is the main container div through which we are naviagting to the sign button by providing remaining html nodre structure

Instead you can directly select the sign in button as //button[@class='button sign-in'] based on its class value

Your solution for sign in would look like

driver = webdriver.Chrome(executable_path='C:\webdrivers\chromedriver.exe')
driver.maximize_window()
driver.get('https://www.metal.com/Nickel/201102250239')
# Click on Sign In
driver.find_element(By.XPATH, "//button[@class='button sign-in']").click()
# Enter username
driver.find_element(By.ID, "user_name").send_keys("your username")
# Enter password
driver.find_element(By.ID, "password").send_keys("your password") 
# Click Sign In
driver.find_element(By.XPATH, "//button[@type='submit']").click()

To scrape data

for element in driver.find_elements_by_class_name("historyBodyRow___1Bk9u"):
 elements =element.find_elements_by_tag_name("div")
 print("Date="+ elements[0].text)
 print("Price Range="+ elements[1].text)
 print("Avg="+ elements[2].text)
 print("Change="+ elements[3].text)
 print("Unit="+ elements[4].text)

Add To CSV

import csv
f = open('Path where you want to store the file', 'w')
writer = csv.writer(f)
for element in driver.find_elements_by_class_name("historyBodyRow___1Bk9u"):
  elements =element.find_elements_by_tag_name("div")
  entry = [elements[0].text ,elements[1].text ,elements[2].text , elements[3].text, elements[4].text]
  writer.writerow(entry)

f.close

回复收藏 0 原文

~没有更多了~