如何计算熊猫列中单词实例的数量?

发布于 2025-01-29 19:41:55 字数 973 浏览 1 评论 0原文

我有一个熊猫的数据框架,其中包含额定电影的流派。有些电影属于多种类型,每种类型都被“ |”隔开。您可以在下面的代码中看到此示例。


import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine='python')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames, engine='python')

df = pd.merge(pd.merge(ratings, users), movies)
df["genres"].value_counts()

如您所见,value_counts()方法不是计算每个唯一类型的次数的有效方法。 是否有一种熊猫方法可以使我计算出每种唯一类型的“词”的次数,或者我需要使用一些循环将所有组合流派分开?

I have a Pandas dataframe that contains genres of rated movies. Some movies fall under multiple genres, each genre separated by a "|". You can see examples of this in the code below.


import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine='python')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames, engine='python')

df = pd.merge(pd.merge(ratings, users), movies)
df["genres"].value_counts()

enter image description here

As you can see, the value_counts() method isn't an effective way of counting the number of times each unique genre is rated. Is there a pandas method that would allow me to count the number of times each unique genre "word" appears or do I need to use some loops to separate all the combined genres out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

染墨丶若流云 2025-02-05 19:41:55

您可以使用REGEX r'\ s *\ | \ s *',甚至*[|] *将类型列分开,然后爆炸并进行计数。请注意,\ s代表空间。而且由于|是一个metacharacter,因此您需要通过backspace或将其放置在字符类中,IE []

df['genre'].str.split(' *[|] *').explode().value_counts()

Drama                                      4
Comedy                                     3
Romance                                    3
Western                                    1
Children's                                 1

You could use the regex r'\s*\|\s*' or even *[|] * to split your genre column then explode the column and do the count. Note that \s stands for space. and since | is a metacharacter, you need to escape it by a backspace or by placing it in a character class ie []

df['genre'].str.split(' *[|] *').explode().value_counts()

Drama                                      4
Comedy                                     3
Romance                                    3
Western                                    1
Children's                                 1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文