将.xls文件转换为Python中的最新版本

发布于 2025-01-22 02:37:40 字数 3023 浏览 2 评论 0 原文

我在网站上进行了一个Web_scrap，该网站通过该网站将.xls文件的下载引向下载，该文件替换了目标文件夹中的旧文件，根据Python脚本，下面：

我的脚本：：

import time 
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support.select import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import shutil
import os
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import matplotlib



driver = webdriver.Chrome(ChromeDriverManager().install())

driver = webdriver.Chrome()

driver.get('http://estatisticas.cetip.com.br/astec/series_v05/paginas/lum_web_v05_series_introducao.asp?str_Modulo=Ativo&int_Idioma=1&int_Titulo=6&int_NivelBD=2/')
driver.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div/dl/dd[2]/a').click()
time.sleep(3)
driver.switch_to.frame(driver.find_element(By.XPATH, '//iframe[@name="dados_corpo"]'))
driver.switch_to.frame(driver.find_element(By.XPATH, '//frame[@name="ativo"]'))
find_dp1 = driver.find_element(By.XPATH, '//select[@name="ativo"]')
select_find_dp1 = Select(find_dp1)
select_find_dp1.select_by_visible_text("CBIO - Crédito de descarbonização")
time.sleep(3)

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[1])

time.sleep(1)
informacoes = Select(driver.find_element(By.NAME, 'selectopcoes'))
informacoes.select_by_visible_text('Estoque')
    
driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[2])

time.sleep(1)
# Data Inicial 
driver.find_element(By.NAME, 'DT_DIA_DE').send_keys('16')
driver.find_element(By.NAME, 'DT_MES_DE').send_keys('10')
driver.find_element(By.NAME, 'DT_ANO_DE').send_keys('2020')

# Data Final
driver.find_element(By.NAME, 'DT_DIA_ATE').send_keys('10')
driver.find_element(By.NAME, 'DT_MES_ATE').send_keys('02')
driver.find_element(By.NAME, 'DT_ANO_ATE').send_keys('2022')

driver.find_elements(By.CLASS_NAME, 'button')[1].click()

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'iframe'))
time.sleep(1)
driver.find_element(By.CLASS_NAME, 'primary-text').find_element(By.TAG_NAME,'a').click()

time.sleep(4)

origem = 'C:\\Users\\prmatteo\\Downloads\\'
destino = os.path.join(origem, 'C:\\Users\\prmatteo\\xxx\\Área de Trabalho\\Arquivos Python\\renovabio2.xls')
extensao = '.xls'

for file in os.listdir(origem):
    if file.endswith(extensao):
        shutil.move(os.path.join(origem,file), destino)

它总是以旧的Excel格式下载.xls文件。我想当我去替换它时，将其转换为最新的Excel版本目的地，以免以兼容模式格式打开个人电脑。

原文

I did a web_scraping on a site that, through it, directs to a download of an .xls file, which replaces the old file in the destination folder, according to the Python script, below:

My script:

import time 
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support.select import Select
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import shutil
import os
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import matplotlib



driver = webdriver.Chrome(ChromeDriverManager().install())

driver = webdriver.Chrome()

driver.get('http://estatisticas.cetip.com.br/astec/series_v05/paginas/lum_web_v05_series_introducao.asp?str_Modulo=Ativo&int_Idioma=1&int_Titulo=6&int_NivelBD=2/')
driver.find_element_by_xpath('//*[@id="divContainerIframeBmf"]/div/dl/dd[2]/a').click()
time.sleep(3)
driver.switch_to.frame(driver.find_element(By.XPATH, '//iframe[@name="dados_corpo"]'))
driver.switch_to.frame(driver.find_element(By.XPATH, '//frame[@name="ativo"]'))
find_dp1 = driver.find_element(By.XPATH, '//select[@name="ativo"]')
select_find_dp1 = Select(find_dp1)
select_find_dp1.select_by_visible_text("CBIO - Crédito de descarbonização")
time.sleep(3)

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[1])

time.sleep(1)
informacoes = Select(driver.find_element(By.NAME, 'selectopcoes'))
informacoes.select_by_visible_text('Estoque')
    
driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.ID, 'dados_corpo'))
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'frameset').find_elements(By.TAG_NAME, 'frame')[2])

time.sleep(1)
# Data Inicial 
driver.find_element(By.NAME, 'DT_DIA_DE').send_keys('16')
driver.find_element(By.NAME, 'DT_MES_DE').send_keys('10')
driver.find_element(By.NAME, 'DT_ANO_DE').send_keys('2020')

# Data Final
driver.find_element(By.NAME, 'DT_DIA_ATE').send_keys('10')
driver.find_element(By.NAME, 'DT_MES_ATE').send_keys('02')
driver.find_element(By.NAME, 'DT_ANO_ATE').send_keys('2022')

driver.find_elements(By.CLASS_NAME, 'button')[1].click()

driver.switch_to.default_content()
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'iframe'))
time.sleep(1)
driver.find_element(By.CLASS_NAME, 'primary-text').find_element(By.TAG_NAME,'a').click()

time.sleep(4)

origem = 'C:\\Users\\prmatteo\\Downloads\\'
destino = os.path.join(origem, 'C:\\Users\\prmatteo\\xxx\\Área de Trabalho\\Arquivos Python\\renovabio2.xls')
extensao = '.xls'

for file in os.listdir(origem):
    if file.endswith(extensao):
        shutil.move(os.path.join(origem,file), destino)

It always downloads .xls file in old excel format. I would like to
convert it to the latest excel version when I go to replace it in the
destination so that it doesn't open in compatibility mode format on
pc.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自我难过 2025-01-29 02:37:40

我尝试从该站点下载文件，不幸的是，它根本不会产生Excel文件。许多站点通过生成带有表和伪造 xls 扩展名的html文件来假冒Excel导出。 Excel认识到这一点，并试图将文件导入，就好像是文本或HTML一样，并向您显示警告。

不幸的是，在这种情况下，文件甚至不是CSV。这是保存为文本的结果页面，包括标题。它甚至都不使用UTF8，因此非US角色会被弄脏。表数据只是一些行，带有选项卡作为定界符：

B3
ADA - Alongamento da Dívida Agrícola - Estoque

De: 20/03/2022 -->  Até: 18/04/2022

Valores Financeiros em Reais.
Estoque Valorizado.
Metologia de cálculo: Preço Unitário da Curva x Quantidade Depositada na data.

Data    Volume
21/03/2022  0
22/03/2022  0
23/03/2022  0
24/03/2022  0
25/03/2022  0
28/03/2022  0
29/03/2022  0
30/03/2022  0a
31/03/2022  0
01/04/2022  0
04/04/2022  0
05/04/2022  0
06/04/2022  0
07/04/2022  0
08/04/2022  0
11/04/2022  0
12/04/2022  0
13/04/2022  0
14/04/2022  0
18/04/2022  0

您必须解析此文本文件并自己创建一个Excel文件。您可以这样做的一种方法是使用pandas用 read_csv 跳过前9行，然后用 to_excel ：

import pandas as pd
filename="fake_excel.xls"
df=pd.read_csv(filename, sep='\t',skiprows=9)
# Display it, to see what we got
df

df.to_excel("real.xlsx")

方法允许您指定不同的定分列器，跳过标题和页脚行，更改编码等。

一个可能的问题是日期格式。除非您另行指定，否则 read_csv 将导入日期为文本。您可以将其告知似乎是日期的单元，甚至尝试使用适当的参数推断日期格式。

您可以通过几种方式检查加载数据。如果键入 DF ，您将看到数据框架的第一行和最后几行。您可以使用 df.info（）获取列数及其类型等。

I tried to download a file from that site and unfortunately for you, it doesn't produce Excel files at all. A lot of sites fake Excel exports by generating a CSV file or an HTML file with a table and a fake xls extension. Excel recognizes this and tries to import the file as if it was text or HTML and shows you a warning.

Unfortunately, in this case the file isn't even a CSV. It's the result page saved as text, including the headers. It doesn't even use UTF8 so non-US characters get mangled. The table data is just some rows with tabs as delimiters :

B3
ADA - Alongamento da Dívida Agrícola - Estoque

De: 20/03/2022 -->  Até: 18/04/2022

Valores Financeiros em Reais.
Estoque Valorizado.
Metologia de cálculo: Preço Unitário da Curva x Quantidade Depositada na data.

Data    Volume
21/03/2022  0
22/03/2022  0
23/03/2022  0
24/03/2022  0
25/03/2022  0
28/03/2022  0
29/03/2022  0
30/03/2022  0a
31/03/2022  0
01/04/2022  0
04/04/2022  0
05/04/2022  0
06/04/2022  0
07/04/2022  0
08/04/2022  0
11/04/2022  0
12/04/2022  0
13/04/2022  0
14/04/2022  0
18/04/2022  0

You'll have to parse this text file and create an Excel file yourself. One way you could do this is to use Pandas to read the file as a CSV with read_csv skipping the first 9 rows, and then save it as Excel with to_excel:

import pandas as pd
filename="fake_excel.xls"
df=pd.read_csv(filename, sep='\t',skiprows=9)
# Display it, to see what we got
df

df.to_excel("real.xlsx")

The read_csv method allows you to specify different delimiters, skip header and footer rows, change the encoding etc.

One possible problem is the date format. Unless you specify otherwise, read_csv will import dates as text. You can tell it to parse cells that appear to be dates and even try to infer the date format with the proper parameters.

You can inspect the loaded data in several ways. If you type df you'll see the first and last few rows of the DataFrame. You can use df.info() to get the number of columns and their types etc.