如何在Python中过滤和清理多个Dask框架？

发布于 2025-01-17 16:50:53 字数 2313 浏览 1 评论 0原文

在此处输入图像描述

将多个 .csv 文件作为 Dask 数据帧进行读取/附加，我试图通过排除不必要的内容来清理框架行。但这会引发数据类型不匹配的错误，尽管下面的代码能够正确识别数据类型。它既无法显示前 5 行 [dfb.head()]，也无法通过 dfb = dfb.compute() 转换为 pandas 数据帧。

#########Reading and appending multiple .csv files###############
import pandas as pd[enter image description here][1]
import numpy as np
import glob
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from dask import dataframe as dd

path = 'C:\\Nitin Folder\\PYTHON\\Py4\\1800\\AS32\\300\\Input'
files = glob.glob(os.path.join(path +"/*.csv"))

data = [] 
for csv in files:
frame = dd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)

dfb = dd.concat(data, ignore_index=True)
dfb = dfb.repartition(npartitions=1) 
dfb.dtypes

output:
id                 int64
Timestamp         object
student_name      object
country           object
Distance(mts)      int64
cellpower        float64
filename          object
dtype: object

############# filtering daskframe to remove non integer rows ###################
dfb = dfb[~(dfb.id == 'id')]
dfb.head()

dfb['Distance(mts)'] = dfb['Distance(mts)'].astype(int)
dfb['cellpower'] = dfb['cellpower'].astype(int)
dfb['id'] = dfb['id'].astype(int)

据我了解，此过滤器并未应用于整个 Daskframe。我什至尝试手动转换数据类型，但仍然存在相同的错误。寻求支持以获得解决方案。:)

output error message:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+---------------+--------+----------+
| Column        | Found  | Expected |
+---------------+--------+----------+
| Distance(mts) | object | int64    |
| cellpower     | object | float64  |
| id            | object | int64    |
+---------------+--------+----------+

The following columns also raised exceptions on conversion:

- Distance(mts)
  ValueError("invalid literal for int() with base 10: 'Distance(mts)'")
- cellpower
  ValueError("could not convert string to float: 'cellpower'")
- id
  ValueError("invalid literal for int() with base 10: 'id'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Distance(mts)': 'object',
   'cellpower': 'object',
   'id': 'object'}

to the call to `read_csv`/`read_table`.

原文

enter image description here

Post reading/appending multiple .csv files as Dask dataframe ,I am trying to clean the frame by excluding unnecessary rows.
But this is throwing an error of mismatch dtypes inspite of below code being able to identify the dtypes correctly.
It is neither able to show the top 5 rows [dfb.head()] nor getting converted to pandas dataframe via dfb = dfb.compute().

#########Reading and appending multiple .csv files###############
import pandas as pd[enter image description here][1]
import numpy as np
import glob
import os
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
from dask import dataframe as dd

path = 'C:\\Nitin Folder\\PYTHON\\Py4\\1800\\AS32\\300\\Input'
files = glob.glob(os.path.join(path +"/*.csv"))

data = [] 
for csv in files:
frame = dd.read_csv(csv)
frame['filename'] = os.path.basename(csv)
data.append(frame)

dfb = dd.concat(data, ignore_index=True)
dfb = dfb.repartition(npartitions=1) 
dfb.dtypes

output:
id                 int64
Timestamp         object
student_name      object
country           object
Distance(mts)      int64
cellpower        float64
filename          object
dtype: object

############# filtering daskframe to remove non integer rows ###################
dfb = dfb[~(dfb.id == 'id')]
dfb.head()

dfb['Distance(mts)'] = dfb['Distance(mts)'].astype(int)
dfb['cellpower'] = dfb['cellpower'].astype(int)
dfb['id'] = dfb['id'].astype(int)

As per my understanding this filter is not getting applied to the entire Daskframe.I have even tried converting the dtypes manually but still same error persists.
Seeking support for getting a solution for this.:)

output error message:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+---------------+--------+----------+
| Column        | Found  | Expected |
+---------------+--------+----------+
| Distance(mts) | object | int64    |
| cellpower     | object | float64  |
| id            | object | int64    |
+---------------+--------+----------+

The following columns also raised exceptions on conversion:

- Distance(mts)
  ValueError("invalid literal for int() with base 10: 'Distance(mts)'")
- cellpower
  ValueError("could not convert string to float: 'cellpower'")
- id
  ValueError("invalid literal for int() with base 10: 'id'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Distance(mts)': 'object',
   'cellpower': 'object',
   'id': 'object'}

to the call to `read_csv`/`read_table`.

分享到QQ

分享到微博