如何在Python中过滤和清理多个Dask框架?
将多个 .csv 文件作为 Dask 数据帧进行读取/附加,我试图通过排除不必要的内容来清理框架行。 但这会引发数据类型不匹配的错误,尽管下面的代码能够正确识别数据类型。 它既无法显示前 5 行 [dfb.head()],也无法通过 dfb = dfb.compute() 转换为 pandas 数据帧。
#########Reading and appending multiple .csv files###############
import pandas as pd[enter image description here][1]
import numpy as np
import glob
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from dask import dataframe as dd
path = 'C:\\Nitin Folder\\PYTHON\\Py4\\1800\\AS32\\300\\Input'
files = glob.glob(os.path.join(path +"/*.csv"))
data = []
for csv in files:
frame = dd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)
dfb = dd.concat(data, ignore_index=True)
dfb = dfb.repartition(npartitions=1)
dfb.dtypes
output:
id int64
Timestamp object
student_name object
country object
Distance(mts) int64
cellpower float64
filename object
dtype: object
############# filtering daskframe to remove non integer rows ###################
dfb = dfb[~(dfb.id == 'id')]
dfb.head()
dfb['Distance(mts)'] = dfb['Distance(mts)'].astype(int)
dfb['cellpower'] = dfb['cellpower'].astype(int)
dfb['id'] = dfb['id'].astype(int)
据我了解,此过滤器并未应用于整个 Daskframe。我什至尝试手动转换数据类型,但仍然存在相同的错误。 寻求支持以获得解决方案。:)
output error message:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+---------------+--------+----------+
| Column | Found | Expected |
+---------------+--------+----------+
| Distance(mts) | object | int64 |
| cellpower | object | float64 |
| id | object | int64 |
+---------------+--------+----------+
The following columns also raised exceptions on conversion:
- Distance(mts)
ValueError("invalid literal for int() with base 10: 'Distance(mts)'")
- cellpower
ValueError("could not convert string to float: 'cellpower'")
- id
ValueError("invalid literal for int() with base 10: 'id'")
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Distance(mts)': 'object',
'cellpower': 'object',
'id': 'object'}
to the call to `read_csv`/`read_table`.
Post reading/appending multiple .csv files as Dask dataframe ,I am trying to clean the frame by excluding unnecessary rows.
But this is throwing an error of mismatch dtypes inspite of below code being able to identify the dtypes correctly.
It is neither able to show the top 5 rows [dfb.head()] nor getting converted to pandas dataframe via dfb = dfb.compute().
#########Reading and appending multiple .csv files###############
import pandas as pd[enter image description here][1]
import numpy as np
import glob
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from dask import dataframe as dd
path = 'C:\\Nitin Folder\\PYTHON\\Py4\\1800\\AS32\\300\\Input'
files = glob.glob(os.path.join(path +"/*.csv"))
data = []
for csv in files:
frame = dd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)
dfb = dd.concat(data, ignore_index=True)
dfb = dfb.repartition(npartitions=1)
dfb.dtypes
output:
id int64
Timestamp object
student_name object
country object
Distance(mts) int64
cellpower float64
filename object
dtype: object
############# filtering daskframe to remove non integer rows ###################
dfb = dfb[~(dfb.id == 'id')]
dfb.head()
dfb['Distance(mts)'] = dfb['Distance(mts)'].astype(int)
dfb['cellpower'] = dfb['cellpower'].astype(int)
dfb['id'] = dfb['id'].astype(int)
As per my understanding this filter is not getting applied to the entire Daskframe.I have even tried converting the dtypes manually but still same error persists.
Seeking support for getting a solution for this.:)
output error message:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
+---------------+--------+----------+
| Column | Found | Expected |
+---------------+--------+----------+
| Distance(mts) | object | int64 |
| cellpower | object | float64 |
| id | object | int64 |
+---------------+--------+----------+
The following columns also raised exceptions on conversion:
- Distance(mts)
ValueError("invalid literal for int() with base 10: 'Distance(mts)'")
- cellpower
ValueError("could not convert string to float: 'cellpower'")
- id
ValueError("invalid literal for int() with base 10: 'id'")
Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:
dtype={'Distance(mts)': 'object',
'cellpower': 'object',
'id': 'object'}
to the call to `read_csv`/`read_table`.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论