当前位置：文江博客话题详情

pandas numpy crosstab

将两列中的非空值合并到一列中

发布于 2025-01-21 03:55:55 字数 3480 浏览 3 评论 0 原文

数据最初是用此代码加载的：

file1 = "e:\\python\\pandas\\datasets\\Starbucks\\portfolio.json"
portfolio = pd.read_json(file1, orient='records', lines=True)

file2 = "e:\\python\\pandas\\datasets\\Starbucks\\profile.json"
profile = pd.read_json(file2, orient='records', lines=True)

file3 = "e:\\python\\pandas\\datasets\\Starbucks\\transcript.json"
transcript = pd.read_json(file3, orient='records', lines=True)

我有一个DF（成绩单），即一个列值为dicts。大多数命令是一个键：值对，但是有些dicts有两个键：值对。

我首先提取\爆炸列的列，以便为每个密钥都有一个新的列。

有四个独特的钥匙，我得到了四个新列。

我还打印了头（）并检查notnull（）计数：

transcript_cp = transcript.copy(deep=True)
transcript_cp = transcript_cp.join(pd.DataFrame(transcript_cp.pop('value').values.tolist()))
print(transcript_cp.head(), '\n')
print(transcript_cp['offer id'].notnull().sum())
print(transcript_cp['amount'].notnull().sum())
print(transcript_cp['offer_id'].notnull().sum())
print(transcript_cp['reward'].notnull().sum(), '\n')

输出：

                             person           event  time                          offer id  amount offer_id  reward
0  78afa995795e4d85b5d9ceeca43f5fef  offer received     0  9b98b8c7a33c4b65b9aebfe6a799e6d9     NaN      NaN     NaN
1  a03223e636434f42ac4c3df47e8bac43  offer received     0  0b1e1539f2cc45b7b9fa7c272da2e1d7     NaN      NaN     NaN
2  e2127556f4f64592b11af22de27a7932  offer received     0  2906b810c7d4411798c6938adc9daaa5     NaN      NaN     NaN
3  8ec6ce2a7e7949b1bf142def7d0e0586  offer received     0  fafdcd668e3743c1bb461111dcafc2a4     NaN      NaN     NaN
4  68617ca6246f4fbc85e91a2a49552598  offer received     0  4d5c57ea9a6940dd891ad53e9dbe8da0     NaN      NaN     NaN 

134002
138953
33579
33579

'guff ID'和'ewert_id'确实是同一件事。名称中有一个错别字，因此我希望\需要将这两个列组合到一个列中。

为此，必须保持以下假设的正确工作：

我不能在同一行中每个列中有非空值，否则我将覆盖值。
我可以在两个列中具有零值。
在一个列中有一个非零值的地方，另一列中有一个空值，我想要一个只有非零值的新列。

这是我证明自己的假设是有效的方式：

df1 = transcript_cp.isna()
df2 = pd.crosstab(df1['offer id'], df1['offer_id'])
print(df2)

offer_id  False    True
offer id               
False         0  134002
True      33579  138953

False\False == not null\not null There are zero instances of both columns being non null for any given single row.
False\True == not null\null There are 134002 instances where 'offer id' is not null but 'offer_id' is.
True\False == null\not null There are 33579 instances where 'offer id' is null but 'offer_id' is not.
True\True == null\null There are 138953 instances where both are null.

制作一个具有“要约ID”和“ geser_id”值的新行，我正在使用np.。

transcript_cp['TEMP'] = np.where(transcript_cp['offer_id'] != np.nan, transcript_cp['offer_id'], transcript_cp['offer id'])

但是，我对非零值的总数永远不会添加到134002 + 33579 = 167581。

使用上述NP。在哪里获得33579的代码

。如果我翻转我的'guert id ID'和offer_id'（请参阅下面），对我来说应该没有什么区别，我得到了134002。

transcript_cp['TEMP'] = np.where(transcript_cp['offer id'] != np.nan, transcript_cp['offer id'], transcript_cp['offer_id'])

我不是正确地使用NP吗？我认为它读到：如果condion为真，则结果1 else结果2。

因此，我是说，如果检查列不是空的，请返回该值，否则将返回另一列中的值。

基于我的串扰结果，我相信我应该获得167581的非零值，并且在我执行NP的顺序中不应有任何区别。

原文

I am following: https://medium.com/@anateresa.mdneto/starbucks-capstone-project-79f84b2a1558
Data is from the links in the article.

Data is initially loaded with this code:

file1 = "e:\\python\\pandas\\datasets\\Starbucks\\portfolio.json"
portfolio = pd.read_json(file1, orient='records', lines=True)

file2 = "e:\\python\\pandas\\datasets\\Starbucks\\profile.json"
profile = pd.read_json(file2, orient='records', lines=True)

file3 = "e:\\python\\pandas\\datasets\\Starbucks\\transcript.json"
transcript = pd.read_json(file3, orient='records', lines=True)

I have a df (transcript) that one columns values are dicts. Most dicts are a single key:value pair, but some dicts have two key:value pairs.

I start by extracting\exploding the column of dicts so that I get a new column for each of the keys.

There are four unique keys and I get four new columns.

I also print the head() and check the notnull() counts:

transcript_cp = transcript.copy(deep=True)
transcript_cp = transcript_cp.join(pd.DataFrame(transcript_cp.pop('value').values.tolist()))
print(transcript_cp.head(), '\n')
print(transcript_cp['offer id'].notnull().sum())
print(transcript_cp['amount'].notnull().sum())
print(transcript_cp['offer_id'].notnull().sum())
print(transcript_cp['reward'].notnull().sum(), '\n')

Output:

                             person           event  time                          offer id  amount offer_id  reward
0  78afa995795e4d85b5d9ceeca43f5fef  offer received     0  9b98b8c7a33c4b65b9aebfe6a799e6d9     NaN      NaN     NaN
1  a03223e636434f42ac4c3df47e8bac43  offer received     0  0b1e1539f2cc45b7b9fa7c272da2e1d7     NaN      NaN     NaN
2  e2127556f4f64592b11af22de27a7932  offer received     0  2906b810c7d4411798c6938adc9daaa5     NaN      NaN     NaN
3  8ec6ce2a7e7949b1bf142def7d0e0586  offer received     0  fafdcd668e3743c1bb461111dcafc2a4     NaN      NaN     NaN
4  68617ca6246f4fbc85e91a2a49552598  offer received     0  4d5c57ea9a6940dd891ad53e9dbe8da0     NaN      NaN     NaN 

134002
138953
33579
33579

'offer id' and 'offer_id' are really the same thing. There is a typo in the name, so I want\need to combine these two columns into a single column.

For this to work correctly The following assumptions must hold:

I CANNOT have non-null values in each column for the same row or I will just overwrite the values.
I can have null values in both columns.
Where I have a non-null value in one column and a null in the other I want a new column with just the non-null value.

Here is how I prove my assumptions are valid:

df1 = transcript_cp.isna()
df2 = pd.crosstab(df1['offer id'], df1['offer_id'])
print(df2)

offer_id  False    True
offer id               
False         0  134002
True      33579  138953

False\False == not null\not null There are zero instances of both columns being non null for any given single row.
False\True == not null\null There are 134002 instances where 'offer id' is not null but 'offer_id' is.
True\False == null\not null There are 33579 instances where 'offer id' is null but 'offer_id' is not.
True\True == null\null There are 138953 instances where both are null.

To make a new row that has the values from 'offer id' and 'offer_id' combined I am using np.where.

transcript_cp['TEMP'] = np.where(transcript_cp['offer_id'] != np.nan, transcript_cp['offer_id'], transcript_cp['offer id'])

However, my total for non null values never adds up to 134002 + 33579 = 167581.

Using the above np.where code I get 33579.

If I flip my 'offer id' and 'offer_id' (see below), which to me should make no difference whatsoever, I get 134002.

transcript_cp['TEMP'] = np.where(transcript_cp['offer id'] != np.nan, transcript_cp['offer id'], transcript_cp['offer_id'])

Am I not using np.where correctly? I thought it reads: if condtion True, result1 else result2.

So I am saying if the checked column is not null, return that value, otherwise return the value in the other column.

Based upon my crosstab results I believe I should be getting 167581 non null values, and it should not make any difference in which order I perform the np.where.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

许久 2025-01-28 03:55:55

np.nan！= np.nan 被评估为 true 。因此，两个命令之间存在差异（当提供ID 是 NAN ？）时会发生什么。

您为什么不使用 fillna ：

transcript_cp['offer id'].fillna(transcript_cp['offer_id'])

np.nan != np.nan is evaluated to True. So there are differences between the two commands (what happens when offer id is nan?).

Why don't you just use fillna:

transcript_cp['offer id'].fillna(transcript_cp['offer_id'])

回复收藏 0 原文

~没有更多了~

关于作者

一世旳自豪

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

将两列中的非空值合并到一列中

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

qq_aHcEbj

qq_ikhFfg

寻找我们的幸福

把昨日还给我

wj_zym

巴黎夜雨

友情链接

将两列中的非空值合并到一列中

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

qq_aHcEbj

qq_ikhFfg

寻找我们的幸福

把昨日还给我

wj_zym

巴黎夜雨

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。