验证“ prod_price”的价格。删除明显错误的行
我试图在数据框架上工作。这是一个大数据,我必须删除不一致的行,但是当我尝试检查不一致时,数据是如此之大,以至于我总是得到错误的答案。
import pandas as pd
import numpy as np
from checker.binder import binder; binder.bind(globals())
from intro_data_analytics.check_scrubbing import *
df = pd.read_csv('data/inu_neko_orderline.csv')
df
trans_id prod_upc cust_id trans_timestamp trans_year trans_month trans_day trans_hour trans_quantity cust_age cust_state prod_price prod_title prod_category prod_animal_type prod_size total_sales
0 10300097 719638485153 1001019 2021-01-01 07:35:21.439873 2021 1 1 1 1 20 NY 72.99 Cat Cave bedding cat NaN 0
1 10300093 73201504044 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 1 34 NY 18.95 Purrfect Puree treat cat NaN 0
2 10300093 719638485153 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 1 34 NY 72.99 Cat Cave bedding cat NaN 0
3 10300093 441530839394 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 2 34 NY 28.45 Ball and String toy cat NaN 0
4 10300093 733426809698 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 1 34 NY 18.95 Yum Fish-Dish food cat NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38619 10327860 287663658863 1022098 2021-06-30 15:37:12.821020 2021 6 30 30 1 25 New York 9.95 All Veggie Yummies treat dog NaN 0
38620 10327960 140160459467 1022157 2021-06-30 15:45:09.872732 2021 6 30 30 2 31 Pennsylvania 48.95 Snoozer Essentails bedding dog NaN 0
38621 10328009 425361189561 1022189 2021-06-30 15:57:44.295104 2021 6 30 30 2 53 New Jersey 15.99 Snack-em Fish treat cat NaN 0
38622 10328089 733426809698 1022236 2021-06-30 15:59:29.801593 2021 6 30 30 1 23 Tennessee 18.95 Yum Fish-Dish food cat NaN 0
38623 10328109 717036112695 1011924 2021-06-30 17:30:52.205912 2021 6 30 30 1 24 Pennsylvania 60.99 Reddy Beddy bedding dog medium 0
38624 rows × 17 columns
I have tried to work on the data frame. It is a big data and I have to remove inconsistent rows however when I try to check the inconsistency, the data is so big that i always get wrong answer.
import pandas as pd
import numpy as np
from checker.binder import binder; binder.bind(globals())
from intro_data_analytics.check_scrubbing import *
df = pd.read_csv('data/inu_neko_orderline.csv')
df
trans_id prod_upc cust_id trans_timestamp trans_year trans_month trans_day trans_hour trans_quantity cust_age cust_state prod_price prod_title prod_category prod_animal_type prod_size total_sales
0 10300097 719638485153 1001019 2021-01-01 07:35:21.439873 2021 1 1 1 1 20 NY 72.99 Cat Cave bedding cat NaN 0
1 10300093 73201504044 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 1 34 NY 18.95 Purrfect Puree treat cat NaN 0
2 10300093 719638485153 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 1 34 NY 72.99 Cat Cave bedding cat NaN 0
3 10300093 441530839394 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 2 34 NY 28.45 Ball and String toy cat NaN 0
4 10300093 733426809698 1001015 2021-01-01 09:33:37.499660 2021 1 1 1 1 34 NY 18.95 Yum Fish-Dish food cat NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38619 10327860 287663658863 1022098 2021-06-30 15:37:12.821020 2021 6 30 30 1 25 New York 9.95 All Veggie Yummies treat dog NaN 0
38620 10327960 140160459467 1022157 2021-06-30 15:45:09.872732 2021 6 30 30 2 31 Pennsylvania 48.95 Snoozer Essentails bedding dog NaN 0
38621 10328009 425361189561 1022189 2021-06-30 15:57:44.295104 2021 6 30 30 2 53 New Jersey 15.99 Snack-em Fish treat cat NaN 0
38622 10328089 733426809698 1022236 2021-06-30 15:59:29.801593 2021 6 30 30 1 23 Tennessee 18.95 Yum Fish-Dish food cat NaN 0
38623 10328109 717036112695 1011924 2021-06-30 17:30:52.205912 2021 6 30 30 1 24 Pennsylvania 60.99 Reddy Beddy bedding dog medium 0
38624 rows × 17 columns
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
表中有一排,这是一个测试行,价格的值太大(6位),最高价格为72美元。
您需要删除此测试行,然后数据将是干净的。
我通过下载Coursera文件找到了这一行,并与Google表格进行检查
There is a row in the table which is a test row and the value of the price is too big (6 digits), while the maximum price is $72.
You need to delete this test row, then the data will be clean.
I found this row by downloading the coursera files and check it with google sheets
我想为已经说过的一个贾拉尔(Jalal)添加答案。
尽管这是正确的,但是基于我今天刚刚尝试的内容,如果总列的数量是错误的,则在您检查时仍会出现错误。即使我没有浮动类型过滤数据后,它仍然是错误的。因此,当我以NAN值掉下所有行时,那就是他们给我通行证。
很奇怪,我知道。
I want to add answer for the one Jalal already said.
Although it is correct, but based on what I just tried today, if the number of total columns is wrong, they would still give error when you check it. Even after I filter out the data with no Float type, it's still wrong. So when I dropped all rows with NaN values, it was then they give me a pass.
Weird, I know.