如何从单词袋数据集中创建有效的术语矩阵

发布于 2025-01-30 19:54:27 字数 542 浏览 2 评论 0原文

我正在尝试 uci单词袋。我已将文档ID，单词（单词ID）和单词计数读取为三个单独的列表。这些列表的前10个项目与以下内容相似：

['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

我不知道如何从这些列表中制作术语文档矩阵，而没有任何冗余。我想将行变成文档，列为WordID，并将相应的单元格值变成单词计数。使用Python（Pandas）做到这一点的有效方法是什么？

原文

I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below:

['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

I can't figure out how to make term-document matrix from these lists, without any redundancy. I'd like to turn rows into docIDs, columns into wordIDs, and corresponding cell values into word count. What is the efficient way to do this with python (pandas) ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久光 2025-02-06 19:54:27

我认为这回答了您的问题：

列表：

docid = ['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
wordid = ['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
counted = ['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

单独列中的每个列表的dataFrame：

df = pd.DataFrame([docid, wordid, counted],
                  index = ["docIDs", "wordIDs", "count"]).T

将索引旋转为“ docIds”，列为“ wordids”，值为“ count”：

df = df.pivot(index="docIDs", columns="wordIDs", values="count")

output：output：

#wordIDs  118  129  168   20  285  529 6941    7  890
#docIDs                                              
#1          1    1    1    2    1  NaN  NaN  NaN  NaN
#2        NaN  NaN  NaN  NaN  NaN    1    1    5  NaN
#3        NaN  NaN  NaN  NaN    1  NaN  NaN  NaN    1

另外，您可以使用unspack unstack （）通过将所需的索引和列设置为索引，然后取消列表：

df.set_index(["docIDs", "wordIDs"])["count"].unstack("wordIDs")

产生相同结果的列。这应该使用更少的内存。

I think this answers your question:

Lists:

docid = ['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
wordid = ['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
counted = ['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count

DataFrame with each list in a separate column:

df = pd.DataFrame([docid, wordid, counted],
                  index = ["docIDs", "wordIDs", "count"]).T

Pivot this for index as "docIDs", columns as "wordIDs", values as "count":

df = df.pivot(index="docIDs", columns="wordIDs", values="count")

Output:

#wordIDs  118  129  168   20  285  529 6941    7  890
#docIDs                                              
#1          1    1    1    2    1  NaN  NaN  NaN  NaN
#2        NaN  NaN  NaN  NaN  NaN    1    1    5  NaN
#3        NaN  NaN  NaN  NaN    1  NaN  NaN  NaN    1

Alternatively, you can use unstack() by setting the desired index and columns as the index, then unstacking the columns:

df.set_index(["docIDs", "wordIDs"])["count"].unstack("wordIDs")

Which produces the same result. This should use less memory.

回复收藏 0 原文

~没有更多了~

关于作者

嘿看小鸭子会跑

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

如何从单词袋数据集中创建有效的术语矩阵

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如何从单词袋数据集中创建有效的术语矩阵

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。