我需要一个简洁的数据结构建议来存储非常大的数据集(在 Python 中训练朴素贝叶斯)

发布于 2024-12-05 02:42:23 字数 446 浏览 0 评论 0原文

我将使用 Python 实现朴素贝叶斯分类器,并将电子邮件分类为垃圾邮件或非垃圾邮件。我有一个非常稀疏且很长的数据集,其中包含许多条目。每个条目如下所示:

1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 ...

其中 1 是标签(垃圾邮件) ,不是垃圾邮件),每对对应一个单词及其频率。例如,9:3 对应于单词 9,并且它在此电子邮件样本中出现了 3 次。

我需要读取这个数据集并将其存储在结构中。 简洁的数据结构来存储以下变量:

  • 每个电子邮件
  • 标签的索引(1或-1)
  • 单词以及每封电子邮件的频率
  • 由于它是一个非常大且稀疏的数据集,我正在寻找一个 还需要创建一个包含所有单词及其频率和标签信息的语料库

对于这样的数据结构有什么建议吗?

I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:

1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 ...

Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.

I need to read this dataset and store it in a structure. Since it's a very big and sparse dataset, I'm looking for a neat data structure to store the following variables:

  • the index of each e-mail
  • label of it (1 or -1)
  • word and it's frequency per each e-mail
  • I also need to create a corpus of all words and their frequency with the label information

Any suggestions for such a data structure?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鱼忆七猫命九 2024-12-12 02:42:23

我会生成一个类,

class Document(object):

    def __init__(self, index, label, bowdict):
        self.index = index
        self.label = label
        self.bowdict = bowdict

您将稀疏向量存储在bowdict中,例如

{ 9:3, 94:1, 109:1,  ... } 

,并将所有数据保存在Document列表中,

以获取具有给定标签的所有文档的聚合:

from collections import defaultdict

def aggregate(docs, label):
    bow = defaultdict(int)
    for doc in docs:
        if doc.label == label:
           for (word, counter) in doc.bowdict.items():
                bow[word] += counter  
    return bow    

您可以使用 cPickle 模块保留所有数据。

另一种方法是使用 http://docs.scipy.org/doc/scipy /reference/sparse.html。您可以将弓向量表示为一行的稀疏矩阵。如果你想聚合弓,你只需将它们相加即可。这可能比上面的简单解决方案要快得多。

此外,您可以将所有稀疏文档存储在一个大矩阵中,其中 Document 实例保存对矩阵的引用以及关联行的行索引。

I would generate a class

class Document(object):

    def __init__(self, index, label, bowdict):
        self.index = index
        self.label = label
        self.bowdict = bowdict

You store your sparse vector in bowdict, eg

{ 9:3, 94:1, 109:1,  ... } 

and hold all your data in a list of Documents

To get an aggregation about all docs with a given label :

from collections import defaultdict

def aggregate(docs, label):
    bow = defaultdict(int)
    for doc in docs:
        if doc.label == label:
           for (word, counter) in doc.bowdict.items():
                bow[word] += counter  
    return bow    

You can persist all your data with the cPickle module.

Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.

Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.

够钟 2024-12-12 02:42:23

如果您假设您不关心电子邮件中每个单词的多次出现,那么您真正需要知道的是(即您的特征是布尔值):

对于每个特征,正关联和负关联的计数是多少?

您可以一次轻松地在线完成此操作,只需跟踪每个功能的这两个数字。

非布尔特征意味着您必须以某种方式离散化这些特征,但您并没有真正询问如何做到这一点。

If you assume you didn't care about multiple occurrences of each word in an email, then all you really need to know is (that is, your features are booleans):

For each feature, what is the count of positive associations and negative associations?

You can do this online very easily in one pass, keeping track of just those two numbers for each feature.

The non-boolean features means you'll have to discretize the features some how, but you aren't really asking about how to do that.

这个俗人 2024-12-12 02:42:23

https://github.com/Yelp/sqlite3dbm
http://packages.python.org/sqlite3dbm/

这就像一个 python 字典,只不过它存储您在磁盘上提供的所有内容都是持久的!这不会使大量内存膨胀,因为它将内容写入磁盘。您可以让一个程序设置这些文件,然后由另一个程序使用它们进行分类,而不必担心序列化问题。

您可以将第一个问题干净地建模为

doc_to_info[doc_id] = {'label': 'label_0', 'word_freqs': {'this': 3, 'is': 4, ...}}

您可以将第二个问题建模为

word_to_freq[word] = {'label_0': 42, 'label_1': 314}

https://github.com/Yelp/sqlite3dbm
http://packages.python.org/sqlite3dbm/

This is like a python dictionary except that it stores everything you give it on disk and so, is persistent! This is not going to bloat up tons of memory because its writing stuff to disk. You can have one program set up these files, and a different one use them for classification, without having to worry about serialization problems.

You can cleanly model the first problem as

doc_to_info[doc_id] = {'label': 'label_0', 'word_freqs': {'this': 3, 'is': 4, ...}}

You can model the second problem as

word_to_freq[word] = {'label_0': 42, 'label_1': 314}
满意归宿 2024-12-12 02:42:23

我将从一些关系数据库开始(SQLite 很容易设置),并使用以下表结构:

Word
-----
Number    INT   -- The word number in your data
Word      TEXT  -- The word itself


Entry
-----
ID        INT  -- Some number to make it unique
Spam      INT  -- -1 or 1 as you described


Entry_Word
----------
EntryID   INT  -- The entry this row corresponds to
WordNo    INT  -- The number of the word
Frequency INT  -- The number of occurences of the word

要获取可以使用的条目

SELECT ID, Spam
FROM Entry

要获取某些条目的词频,您可以使用:

SELECT WordNo, Frequency
FROM Entry_Word
WHERE EntryID = ?

要获取词频语料库,您可以使用:

SELECT
    WordNo,
    SUM(MIN(0,Spam*Frequency)) AS NotSpamFrequency,
    SUM(MAX(0,Spam*Frequency)) AS SpamFrequency
FROM Entry
INNER JOIN Entry_Word ON EntryID = ID
GROUP BY WordNo

如果您愿意,您还可以包含该词本身:

SELECT
    Word,
    WordNo,
    SUM(MIN(0,Spam*Frequency)) AS NotSpamFrequency,
    SUM(MAX(0,Spam*Frequency)) AS SpamFrequency
FROM Entry
INNER JOIN Entry_Word ON EntryID = ID
LEFT JOIN Word ON Number = WordNo
GROUP BY Word, WordNo

I would start with some relational database (SQLite is easy to set up), and use the following table structure:

Word
-----
Number    INT   -- The word number in your data
Word      TEXT  -- The word itself


Entry
-----
ID        INT  -- Some number to make it unique
Spam      INT  -- -1 or 1 as you described


Entry_Word
----------
EntryID   INT  -- The entry this row corresponds to
WordNo    INT  -- The number of the word
Frequency INT  -- The number of occurences of the word

To get the entries you can use

SELECT ID, Spam
FROM Entry

To get the word frequencies for some entry, you can use:

SELECT WordNo, Frequency
FROM Entry_Word
WHERE EntryID = ?

To get the word frequency corpus, you can use:

SELECT
    WordNo,
    SUM(MIN(0,Spam*Frequency)) AS NotSpamFrequency,
    SUM(MAX(0,Spam*Frequency)) AS SpamFrequency
FROM Entry
INNER JOIN Entry_Word ON EntryID = ID
GROUP BY WordNo

You could also include the word itself, if you want:

SELECT
    Word,
    WordNo,
    SUM(MIN(0,Spam*Frequency)) AS NotSpamFrequency,
    SUM(MAX(0,Spam*Frequency)) AS SpamFrequency
FROM Entry
INNER JOIN Entry_Word ON EntryID = ID
LEFT JOIN Word ON Number = WordNo
GROUP BY Word, WordNo
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文