dataprep.eda typeError：如果您指定块块，请以int或无用的方式提供npartitions

发布于 2025-02-03 06:03:06 字数 1845 浏览 4 评论 0 原文

努力了解DataPrep软件包中出现的这种类型。我的设置非常简单，如下所示：

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "phone": [
            "555-234-5678",
            "(555) 234-5678",
            "555.234.5678",
            "555/234/5678",
            15551234567,
            "(1) 555-234-5678",
            "+1 (234) 567-8901 x. 1234",
            "2345678901 extension 1234",
            "2345678",
            "800-299-JUNK",
            "1-866-4ZIPCAR",
            "123 ABC COMPANY",
            "+66 91 889 8948",
            "hello",
            np.nan,
            "NULL",
        ]
    }
)

from dataprep.clean import clean_phone
clean_phone(df, "phone")

结果错误消息被抛入终端（我省略了文件路径并用 x 用于安全目的而替换敏感值）：

Traceback (most recent call last):
  File "c:\Users\x\x\Documents\Repositories\test.py", line 14, in <module>
    clean_phone(df, "phone")
  File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\clean_phone.py", line 150, in clean_phone
    df = to_dask(df)
  File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\utils.py", line 73, in to_dask
    return dd.from_pandas(df, npartitions=npartitions)
  File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dask\dataframe\io\io.py", line 236, in from_pandas
    raise TypeError(
TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize.

这是重复复制的直接尝试DataPrep软件包团队显示的教程： /user_guide/clean/clean_phone.html

根据教程，预期输出为以下：

将其发布为TypeError时仅显示谷歌搜索时仅显示一个半相关结果。

原文

Struggling to understand this TypeError coming out of the dataprep package. My setup is very simple and as follows:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "phone": [
            "555-234-5678",
            "(555) 234-5678",
            "555.234.5678",
            "555/234/5678",
            15551234567,
            "(1) 555-234-5678",
            "+1 (234) 567-8901 x. 1234",
            "2345678901 extension 1234",
            "2345678",
            "800-299-JUNK",
            "1-866-4ZIPCAR",
            "123 ABC COMPANY",
            "+66 91 889 8948",
            "hello",
            np.nan,
            "NULL",
        ]
    }
)

from dataprep.clean import clean_phone
clean_phone(df, "phone")

The resulting error message gets thrown in the terminal (I've omitted file paths and replaced sensitive values with x for security purposes) :

Traceback (most recent call last):
  File "c:\Users\x\x\Documents\Repositories\test.py", line 14, in <module>
    clean_phone(df, "phone")
  File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\clean_phone.py", line 150, in clean_phone
    df = to_dask(df)
  File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dataprep\clean\utils.py", line 73, in to_dask
    return dd.from_pandas(df, npartitions=npartitions)
  File "C:\Users\x\Anaconda3\envs\myenv\lib\site-packages\dask\dataframe\io\io.py", line 236, in from_pandas
    raise TypeError(
TypeError: Please provide npartitions as an int, or possibly as None if you specify chunksize.

This is a direct attempt to replicate the tutorial shown by the dataprep package team found at: https://docs.dataprep.ai/user_guide/clean/clean_phone.html

The expected output is below, as per the tutorial:

Expected output.

Posting this as the TypeError only shows one semi-relevant result when Googled.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风吹雪碎 2025-02-10 06:03:06

dataprep 软件包中有一个小错误，您可以在这个pr 。

同时，避免该错误的一个选项是将数据明确转换为 dask dataframe并将其传递到该函数：

import numpy as np
import pandas as pd
from dask.dataframe import from_pandas
from dataprep.clean import clean_phone

df = pd.DataFrame(
    {
        "phone": [
            "555-234-5678",
            "(555) 234-5678",
            "555.234.5678",
            "555/234/5678",
            15551234567,
            "(1) 555-234-5678",
            "+1 (234) 567-8901 x. 1234",
            "2345678901 extension 1234",
            "2345678",
            "800-299-JUNK",
            "1-866-4ZIPCAR",
            "123 ABC COMPANY",
            "+66 91 889 8948",
            "hello",
            np.nan,
            "NULL",
        ]
    }
)

# to avoid the bug we are passing ddf, not df
ddf = from_pandas(df, npartitions=2)
clean_phone(ddf, "phone")

There is a small bug in dataprep package, you can track it in this PR.

In the meantime, one option to avoid the bug is to explicitly convert data to a dask dataframe and pass that into the function:

import numpy as np
import pandas as pd
from dask.dataframe import from_pandas
from dataprep.clean import clean_phone

df = pd.DataFrame(
    {
        "phone": [
            "555-234-5678",
            "(555) 234-5678",
            "555.234.5678",
            "555/234/5678",
            15551234567,
            "(1) 555-234-5678",
            "+1 (234) 567-8901 x. 1234",
            "2345678901 extension 1234",
            "2345678",
            "800-299-JUNK",
            "1-866-4ZIPCAR",
            "123 ABC COMPANY",
            "+66 91 889 8948",
            "hello",
            np.nan,
            "NULL",
        ]
    }
)

# to avoid the bug we are passing ddf, not df
ddf = from_pandas(df, npartitions=2)
clean_phone(ddf, "phone")

回复收藏 0 原文

~没有更多了~