如何将汤转换为数据框

发布于 2025-02-10 21:17:12 字数 2206 浏览 1 评论 0 原文

我是美丽的汤的新手。我正在努力从源刮擦一些Excel文件。 url到source: 原始数据来源: https://droughtmonitor.unl.unl.unl.edu/dmdata/dmdata/gisdata/gisdata.aspx/ <

我的主要目标是从该URL刮擦数据,并将其转换为数据框架,包括原始数据源URL中的所有文件,以及是否可以自动下载一些新文件并添加到源中。

from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

url2 = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'
r2 = requests.get(url2)
soup = BeautifulSoup(r2.text,'html.parser')
raw_data = [data.text for data in soup]

以上代码为我提供了一个输出: -

["MapDate,FIPS,County,State,Nothing,D0,D1,D2,D3,D4,ValidStart,ValidEnd\r\n20220621,01001,Autauga County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01003,Baldwin County,AL,81.22,18.78,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01005,Barbour County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01007,Bibb County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01009,Blount County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01011,Bullock County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01013,Butler County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01015,Calhoun County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01017,Chambers County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01019,Cherokee County,AL,69.27,30.73,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01021,Chilton County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n

我想拥有Inital 12值:mapdate,fips,county,state,nothene,d0,d0,d1,d2,d3,d4,valle vallestart,vallystart,valsect 成为我的列名,并与与之相同的值休息。

此外,原始数据源URL具有2000年至2022年的值。我需要以相同格式和单个CSV中的所有数据。

另外,我需要以一种方式将代码自动提取到源中添加的任何新数据。

有人可以指导我吗?

I am new to the beautiful soup. I am working to scrape some excel files from a source.
URL to source: https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=
Original Data source: https://droughtmonitor.unl.edu/DmData/GISData.aspx/

My main objective is to scrape the data from this URL and convert the same into a data frame including all the files in the original data source URL and also if some new files added could be downloaded automatically and added to the source.

from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

url2 = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'
r2 = requests.get(url2)
soup = BeautifulSoup(r2.text,'html.parser')
raw_data = [data.text for data in soup]

The above code gives me an output:-

["MapDate,FIPS,County,State,Nothing,D0,D1,D2,D3,D4,ValidStart,ValidEnd\r\n20220621,01001,Autauga County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01003,Baldwin County,AL,81.22,18.78,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01005,Barbour County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01007,Bibb County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01009,Blount County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01011,Bullock County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01013,Butler County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01015,Calhoun County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01017,Chambers County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01019,Cherokee County,AL,69.27,30.73,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n20220621,01021,Chilton County,AL,100.00,0.00,0.00,0.00,0.00,0.00,2022-06-21,2022-06-27\r\n

I want to have the inital 12 values: MapDate,FIPS,County,State,Nothing,D0,D1,D2,D3,D4,ValidStart,ValidEnd
to be my column name and rest to the values associated with the same.

Also, the original data source URL has values from the year 2000 to 2022. I need all the data in the same format and in a single CSV.

Also, I need to have the code in such a manner that it will automatically extract any new data added to the source.

Can someone guide me on this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

橙味迷妹 2025-02-17 21:17:12

它发送文件 csv ,因此您不需要

您可以使用 io pandas.read_csv()

import requests
import pandas as pd
import io

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

response = requests.get(url)

fh = io.StringIO(response.text)  # create file in memory
df = pd.read_csv(fh)

print(df)

或您可以使用Module CSV

import requests
import csv
import io

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

response = requests.get(url)

fh = io.StringIO(response.text)  # create file in memory
data = list(csv.reader(fh))

print(data)

编辑:

您甚至可以直接使用 url pandas 代码>

import pandas as pd

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

df = pd.read_csv(url)

print(df)

编辑:

现在,您只需要使用日期列表,然后运行 -loop的才能读取所有CSV并保持列表。稍后,您可以使用 pandas.concat()将此列表转换为单个 dataframe

pandas doc: Merge

“示例:

import pandas as pd

# --- before loop ---

all_dates = ["2022-06-21", "2022-06-14", "2022-06-07"]
all_dfs = []

# url without `2022-06-21` at the end
url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date='

# --- loop ---

for date in all_dates:
    print('date:', date)
    df = pd.read_csv( url + date )
    all_dfs.append( df )

# --- after loop --- 

full_df = pd.concat(all_dfs)
print(full_df)

要获取日期列表,您可以从网页上刮擦它们,但可能需要 selenium 而不是而不是<代码> BeautifulSoup 因为页面使用JavaScript在页面上添加日期。

或者,您应该使用 devtools (TAB:网络,filter: xhr )查看JavaScript使用的URL来获取日期并使用请求获取它们。

import requests

# without header `Content-Type` it sends `HTML` instead of `JSON`
headers = {
#    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
#    'X-Requested-With': 'XMLHttpRequest',
#    'Referer': 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/',
    'Content-Type': 'application/json; charset=utf-8',
}

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/ReturnDMWeeks'

response = requests.get(url, headers=headers)
#print(response.text)

data = response.json()

all_dates = data['d']
all_dates = [f"{d[:4]}-{d[4:6]}-{d[6:]}" for d in all_dates]

print(all_dates)

结果

['2022-06-21', '2022-06-14', '2022-06-07', ..., '2000-01-18', '2000-01-11', '2000-01-04']

It sends file csv so you don't need BeautifulSoup

You can use io with pandas.read_csv()

import requests
import pandas as pd
import io

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

response = requests.get(url)

fh = io.StringIO(response.text)  # create file in memory
df = pd.read_csv(fh)

print(df)

or you can use io with module csv

import requests
import csv
import io

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

response = requests.get(url)

fh = io.StringIO(response.text)  # create file in memory
data = list(csv.reader(fh))

print(data)

EDIT:

You can even use url directly with pandas

import pandas as pd

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date=2022-06-21'

df = pd.read_csv(url)

print(df)

EDIT:

Now you need only list with dates and run for-loop to read all csv and keep on list. And later you can use pandas.concat() to convert this list to single DataFrame.

Pandas doc: Merge, join, concatenate and compare

Minimal working example:

import pandas as pd

# --- before loop ---

all_dates = ["2022-06-21", "2022-06-14", "2022-06-07"]
all_dfs = []

# url without `2022-06-21` at the end
url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/?mode=table&aoi=county&date='

# --- loop ---

for date in all_dates:
    print('date:', date)
    df = pd.read_csv( url + date )
    all_dfs.append( df )

# --- after loop --- 

full_df = pd.concat(all_dfs)
print(full_df)

To get list of dates you could scrape them from webpage but it may need Selenium instead of beautifulsoupbecause page uses JavaScript to add dates on page.

Or you should use DevTools (tab: Network, filter: XHR) to see what url is used by JavaScript to get dates and use requests to get them.

import requests

# without header `Content-Type` it sends `HTML` instead of `JSON`
headers = {
#    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
#    'X-Requested-With': 'XMLHttpRequest',
#    'Referer': 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/',
    'Content-Type': 'application/json; charset=utf-8',
}

url = 'https://droughtmonitor.unl.edu/DmData/GISData.aspx/ReturnDMWeeks'

response = requests.get(url, headers=headers)
#print(response.text)

data = response.json()

all_dates = data['d']
all_dates = [f"{d[:4]}-{d[4:6]}-{d[6:]}" for d in all_dates]

print(all_dates)

Result

['2022-06-21', '2022-06-14', '2022-06-07', ..., '2000-01-18', '2000-01-11', '2000-01-04']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文