如何从单词袋数据集中创建有效的术语矩阵
我正在尝试 uci单词袋。我已将文档ID,单词(单词ID)和单词计数读取为三个单独的列表。这些列表的前10个项目与以下内容相似:
['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count
我不知道如何从这些列表中制作术语文档矩阵,而没有任何冗余。我想将行变成文档,列为WordID,并将相应的单元格值变成单词计数。使用Python(Pandas)做到这一点的有效方法是什么?
I am experimenting with UCI Bag of Words Dataset. I have read document IDs, words (word IDs), and word counts into three separate lists. The first 10 items of those lists are similar to what is below:
['1', '1', '1', '1', '1', '2', '2', '2', '3', '3'] #docIDs
['118', '285', '129', '168', '20', '529', '6941', '7', '890', '285'] #wordIDs
['1', '1', '1', '1', '2', '1', '1', '5', '1', '1'] #count
I can't figure out how to make term-document matrix from these lists, without any redundancy. I'd like to turn rows into docIDs, columns into wordIDs, and corresponding cell values into word count. What is the efficient way to do this with python (pandas) ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这回答了您的问题:
列表:
单独列中的每个列表的dataFrame:
将索引旋转为“ docIds”,列为“ wordids”,值为“ count”:
output:output:
另外,您可以使用
unspack unstack ()
通过将所需的索引和列设置为索引,然后取消列表:产生相同结果的列。这应该使用更少的内存。
I think this answers your question:
Lists:
DataFrame with each list in a separate column:
Pivot this for index as "docIDs", columns as "wordIDs", values as "count":
Output:
Alternatively, you can use
unstack()
by setting the desired index and columns as the index, then unstacking the columns:Which produces the same result. This should use less memory.