获取文件的日期修改 - python 中使用 beautifulsoup 进行网页抓取
我正在尝试从以下网站下载所有 csv 文件: https://emi .ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices 。我已经设法使用以下代码来做到这一点:
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
contents = []
for i in csv_links:
req = requests.get(i)
csv_contents = req.content
s=str(csv_contents,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
contents.append(df)
final_price = pd.concat(contents)
如果可行的话,我想简化这个过程。 网站上的文件每天都在修改,我不想每天运行脚本来提取所有文件;相反,我只想从昨天提取文件并将现有文件附加到我的文件夹中。为了实现这一点,我需要抓取“修改日期”列以及文件 URL。如果有人能告诉我如何获取文件更新的日期,我将不胜感激。
I am trying to download all csv files from the following website: https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices . I have managed to do that with the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://emi.ea.govt.nz/Wholesale/Datasets/FinalPricing/EnergyPrices'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
csv_links = ['https://emi.ea.govt.nz'+a['href'] for a in soup.select('td.csv a')]
contents = []
for i in csv_links:
req = requests.get(i)
csv_contents = req.content
s=str(csv_contents,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
contents.append(df)
final_price = pd.concat(contents)
If at all feasible, I'd like to streamline this process. The file on the website is modified every day, and I don't want to run the script every day to extract all of the files; instead, I simply want to extract files from Yesterday and append the existing files in my folder. And to achieve this, I need to scrape the Date Modified column along with the files URL. I'd be grateful if someone could tell me how to acquire the dates when the files were updated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用第 n 个子范围来过滤表的第 1 列和第 2 列以及最初按类匹配的表中的适当行偏移量。
然后在分割初始返回列表的列表推导中提取 url 或日期(作为文本)(如交替列 1、列 2、列 1 等)。在各自的列表推导式中完成 URL 或转换为实际日期(文本),压缩结果列表并转换为 DataFrame
You can use nth-child range to filter for columns 1 and 2 of table along with the appropriate row offset within the table initially matched by class.
Then extract url or date (as text) within list comprehensions over split initial returned list (as will alternate column 1 column 2 column 1 etc). Complete the url or convert to actual dates (text) within respective list comprehensions, zip the resultant lists and convert to DataFrame
您可以应用列表理解技术
输出:
You can apply list comprehension technique
Output: