我需要一个简洁的数据结构建议来存储非常大的数据集(在 Python 中训练朴素贝叶斯)
我将使用 Python 实现朴素贝叶斯分类器,并将电子邮件分类为垃圾邮件或非垃圾邮件。我有一个非常稀疏且很长的数据集,其中包含许多条目。每个条目如下所示:
1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 ...
其中 1 是标签(垃圾邮件) ,不是垃圾邮件),每对对应一个单词及其频率。例如,9:3 对应于单词 9,并且它在此电子邮件样本中出现了 3 次。
我需要读取这个数据集并将其存储在结构中。 简洁的数据结构来存储以下变量:
- 每个电子邮件
- 标签的索引(1或-1)
- 单词以及每封电子邮件的频率
- 由于它是一个非常大且稀疏的数据集,我正在寻找一个 还需要创建一个包含所有单词及其频率和标签信息的语料库
对于这样的数据结构有什么建议吗?
I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:
1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 ...
Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.
I need to read this dataset and store it in a structure. Since it's a very big and sparse dataset, I'm looking for a neat data structure to store the following variables:
- the index of each e-mail
- label of it (1 or -1)
- word and it's frequency per each e-mail
- I also need to create a corpus of all words and their frequency with the label information
Any suggestions for such a data structure?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我会生成一个类,
您将稀疏向量存储在
bowdict
中,例如,并将所有数据保存在
Document
列表中,以获取具有给定标签的所有文档的聚合:
您可以使用
cPickle
模块保留所有数据。另一种方法是使用 http://docs.scipy.org/doc/scipy /reference/sparse.html。您可以将弓向量表示为一行的稀疏矩阵。如果你想聚合弓,你只需将它们相加即可。这可能比上面的简单解决方案要快得多。
此外,您可以将所有稀疏文档存储在一个大矩阵中,其中 Document 实例保存对矩阵的引用以及关联行的行索引。
I would generate a class
You store your sparse vector in
bowdict
, egand hold all your data in a list of
Document
sTo get an aggregation about all docs with a given label :
You can persist all your data with the
cPickle
module.Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.
Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.
如果您假设您不关心电子邮件中每个单词的多次出现,那么您真正需要知道的是(即您的特征是布尔值):
对于每个特征,正关联和负关联的计数是多少?
您可以一次轻松地在线完成此操作,只需跟踪每个功能的这两个数字。
非布尔特征意味着您必须以某种方式离散化这些特征,但您并没有真正询问如何做到这一点。
If you assume you didn't care about multiple occurrences of each word in an email, then all you really need to know is (that is, your features are booleans):
For each feature, what is the count of positive associations and negative associations?
You can do this online very easily in one pass, keeping track of just those two numbers for each feature.
The non-boolean features means you'll have to discretize the features some how, but you aren't really asking about how to do that.
https://github.com/Yelp/sqlite3dbm
http://packages.python.org/sqlite3dbm/
这就像一个 python 字典,只不过它存储您在磁盘上提供的所有内容都是持久的!这不会使大量内存膨胀,因为它将内容写入磁盘。您可以让一个程序设置这些文件,然后由另一个程序使用它们进行分类,而不必担心序列化问题。
您可以将第一个问题干净地建模为
您可以将第二个问题建模为
https://github.com/Yelp/sqlite3dbm
http://packages.python.org/sqlite3dbm/
This is like a python dictionary except that it stores everything you give it on disk and so, is persistent! This is not going to bloat up tons of memory because its writing stuff to disk. You can have one program set up these files, and a different one use them for classification, without having to worry about serialization problems.
You can cleanly model the first problem as
You can model the second problem as
我将从一些关系数据库开始(SQLite 很容易设置),并使用以下表结构:
要获取可以使用的条目
要获取某些条目的词频,您可以使用:
要获取词频语料库,您可以使用:
如果您愿意,您还可以包含该词本身:
I would start with some relational database (SQLite is easy to set up), and use the following table structure:
To get the entries you can use
To get the word frequencies for some entry, you can use:
To get the word frequency corpus, you can use:
You could also include the word itself, if you want: