验证“ prod_price”的价格。删除明显错误的行

发布于 2025-02-10 08:57:18 字数 2096 浏览 2 评论 0原文

我试图在数据框架上工作。这是一个大数据,我必须删除不一致的行,但是当我尝试检查不一致时,数据是如此之大,以至于我总是得到错误的答案。

import pandas as pd
import numpy as np

from checker.binder import binder; binder.bind(globals())
from intro_data_analytics.check_scrubbing import *

df = pd.read_csv('data/inu_neko_orderline.csv')
df

trans_id    prod_upc    cust_id trans_timestamp trans_year  trans_month trans_day   trans_hour  trans_quantity  cust_age    cust_state  prod_price  prod_title  prod_category   prod_animal_type    prod_size   total_sales
0   10300097    719638485153    1001019 2021-01-01 07:35:21.439873  2021    1   1   1   1   20  NY  72.99   Cat Cave    bedding cat NaN 0
1   10300093    73201504044 1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  18.95   Purrfect Puree  treat   cat NaN 0
2   10300093    719638485153    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  72.99   Cat Cave    bedding cat NaN 0
3   10300093    441530839394    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   2   34  NY  28.45   Ball and String toy cat NaN 0
4   10300093    733426809698    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  18.95   Yum Fish-Dish   food    cat NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38619   10327860    287663658863    1022098 2021-06-30 15:37:12.821020  2021    6   30  30  1   25  New York    9.95    All Veggie Yummies  treat   dog NaN 0
38620   10327960    140160459467    1022157 2021-06-30 15:45:09.872732  2021    6   30  30  2   31  Pennsylvania    48.95   Snoozer Essentails  bedding dog NaN 0
38621   10328009    425361189561    1022189 2021-06-30 15:57:44.295104  2021    6   30  30  2   53  New Jersey  15.99   Snack-em Fish   treat   cat NaN 0
38622   10328089    733426809698    1022236 2021-06-30 15:59:29.801593  2021    6   30  30  1   23  Tennessee   18.95   Yum Fish-Dish   food    cat NaN 0
38623   10328109    717036112695    1011924 2021-06-30 17:30:52.205912  2021    6   30  30  1   24  Pennsylvania    60.99   Reddy Beddy bedding dog medium  0
38624 rows × 17 columns

I have tried to work on the data frame. It is a big data and I have to remove inconsistent rows however when I try to check the inconsistency, the data is so big that i always get wrong answer.

import pandas as pd
import numpy as np

from checker.binder import binder; binder.bind(globals())
from intro_data_analytics.check_scrubbing import *

df = pd.read_csv('data/inu_neko_orderline.csv')
df

trans_id    prod_upc    cust_id trans_timestamp trans_year  trans_month trans_day   trans_hour  trans_quantity  cust_age    cust_state  prod_price  prod_title  prod_category   prod_animal_type    prod_size   total_sales
0   10300097    719638485153    1001019 2021-01-01 07:35:21.439873  2021    1   1   1   1   20  NY  72.99   Cat Cave    bedding cat NaN 0
1   10300093    73201504044 1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  18.95   Purrfect Puree  treat   cat NaN 0
2   10300093    719638485153    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  72.99   Cat Cave    bedding cat NaN 0
3   10300093    441530839394    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   2   34  NY  28.45   Ball and String toy cat NaN 0
4   10300093    733426809698    1001015 2021-01-01 09:33:37.499660  2021    1   1   1   1   34  NY  18.95   Yum Fish-Dish   food    cat NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38619   10327860    287663658863    1022098 2021-06-30 15:37:12.821020  2021    6   30  30  1   25  New York    9.95    All Veggie Yummies  treat   dog NaN 0
38620   10327960    140160459467    1022157 2021-06-30 15:45:09.872732  2021    6   30  30  2   31  Pennsylvania    48.95   Snoozer Essentails  bedding dog NaN 0
38621   10328009    425361189561    1022189 2021-06-30 15:57:44.295104  2021    6   30  30  2   53  New Jersey  15.99   Snack-em Fish   treat   cat NaN 0
38622   10328089    733426809698    1022236 2021-06-30 15:59:29.801593  2021    6   30  30  1   23  Tennessee   18.95   Yum Fish-Dish   food    cat NaN 0
38623   10328109    717036112695    1011924 2021-06-30 17:30:52.205912  2021    6   30  30  1   24  Pennsylvania    60.99   Reddy Beddy bedding dog medium  0
38624 rows × 17 columns

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

秋风の叶未落 2025-02-17 08:57:18

表中有一排,这是一个测试行,价格的值太大(6位),最高价格为72美元。
您需要删除此测试行,然后数据将是干净的。

我通过下载Coursera文件找到了这一行,并与Google表格进行检查

There is a row in the table which is a test row and the value of the price is too big (6 digits), while the maximum price is $72.
You need to delete this test row, then the data will be clean.

I found this row by downloading the coursera files and check it with google sheets

吻泪 2025-02-17 08:57:18

我想为已经说过的一个贾拉尔(Jalal)添加答案。

尽管这是正确的,但是基于我今天刚刚尝试的内容,如果总列的数量是错误的,则在您检查时仍会出现错误。即使我没有浮动类型过滤数据后,它仍然是错误的。因此,当我以NAN值掉下所有行时,那就是他们给我通行证。

很奇怪,我知道。

I want to add answer for the one Jalal already said.

Although it is correct, but based on what I just tried today, if the number of total columns is wrong, they would still give error when you check it. Even after I filter out the data with no Float type, it's still wrong. So when I dropped all rows with NaN values, it was then they give me a pass.

Weird, I know.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文