高效地处理(并生成)大型文本文件

发布于 2024-12-18 02:46:37 字数 1645 浏览 2 评论 0原文

作为我工作的一部分,我正在处理非常大的文本文件,并部分分析它们的单词和短语频率。我遇到了计算时间、内存限制以及提取相关信息的困难。

对于这个程序,我使用一个大文本文件(比如 50MB),该文件已经被清理,变成小写。但除此之外,它只是非结构化文本。我正在尝试生成“二元组”、“三元组”、“四元组”和“五元组”列表 - 分别是重复出现的两个、三个、四个和五个单词短语的组合(即“我是”是一个二元组,“我很自由”是一个三元组,“我总是自由”是一个四元组)。

我现在在做什么?

这是我当前的代码,其中inputlower是一个全小写的字符串(使用 Mathematica 抓取的网络数据)。

inputlower=Import["/directory/allTextLowered.txt"];
bigrams = 
  Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];    
Clear[bigrams];
trigrams = 
  Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];    
quadgrams = 
  Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams = 
  Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];

在某种程度上,它工作得很好:我确实生成了信息,并且在较小的范围内,我发现这段代码运行得足够快,我可以得到一些近似可行的 Manipulate[] 程序。但是当我们处理输入时...

当我使用大文件时有什么问题?

最重要的是,我的输出文件太大而无法使用。有没有办法在代码中指定断点:例如,我不想要任何只出现一次的“二元组”?如果事实证明仍然留下太多信息,是否有办法指定我不希望文件中出现任何“二元组”,除非它们出现超过 10 次?即,如果“我的奶酪”出现 20 次,我想知道它,但如果“i pad”只出现一次,也许丢失它会使文件更易于管理?

其次,这些过程需要很长时间:仅生成二元组输出就花费了两三个小时以上。我是否以有效的方式解决这个问题?

第三,如果我确实有一个包含所有信息的大型二元组文件(~650MB+),Mathematica 有没有一种方法可以访问信息而不将其全部加载到内存中 - 即获取一个名为 bigrams.txt 的文件,了解它包含{{"i","am"},55} 而不会使系统陷入困境?

编辑

[截至 2011 年 12 月 7 日,我已经删除了我放置的示例文件 - 再次感谢大家]

As part of my work, I am working with very large text files and, in part, analyzing them for word and phrase frequency. I am running into difficulties of computing time, memory restrictions, and in extracting relevant information.

For this program, I am taking a large text file (say 50MB) which has already been cleaned up, turned into lower case. But otherwise it is just unstructured text. I am trying to generate lists of 'bigrams,' 'trigrams,' 'quadgrams,' and 'fivegrams' - respectively, combinations of recurring two, three, four, and five word phrases (i.e. "i am" is a bigram, "i am free" is a trigram, "i am free always" is a quadgram).

What am I currently doing?

Here is my current code, where inputlower is an all lowercase String (scraped web data w/ Mathematica).

inputlower=Import["/directory/allTextLowered.txt"];
bigrams = 
  Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];    
Clear[bigrams];
trigrams = 
  Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];    
quadgrams = 
  Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams = 
  Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];

In a way, it works well: I do get information generated, and on smaller scales I find that this code works fast enough that I can have something approximating a workable Manipulate[] program. But when we're dealing with big inputs...

What is wrong with it when I use big files?

Most importantly, my output files are too big to be usable. Is there a way to specify a breaking point in the code: for example, I do not want any 'bigrams' that only appear once? If that proves to still leave too much information, would there be a way to specify that I do not want any 'bigrams' in the file unless they appear more than say 10 times? i.e. if "my cheese" appears 20 times, I want to know about it, but if "i pad" only appears once, maybe losing it would make the file more manageable?

Secondly, these processes take a long time: it took well over two or three hours to generate the bigram output alone. Am I approaching this problem in an efficient manner?

Thirdly, if I did have a large bigram file (~650MB+) that contained ALL of the information, is there a way for Mathematica to access information without loading it all into memory - i.e. taking a file named bigrams.txt, learning that it contains {{"i","am"},55} without bogging the system down?

Edit

[As of 7 Dec 11, I've removed the example file that I've put up - thanks to all again]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

方觉久 2024-12-25 02:46:37

简介

我将提出的建议与之前给出的大多数建议不同,并且基于索引、哈希表、打包数组、压缩、.mx 文件和转储保存的组合,以及其他一些事情。基本思想是以智能方式预处理文件,并将预处理后的定义保存在 .mx 文件中以便快速加载。我建议不要将大部分工作建立在从磁盘读取的基础上,而是要改变重点,并在内存中完成大部分工作,但找到从磁盘加载数据、将其存储在 RAM 中、处理数据并保存数据的方法以节省时间和内存的方式存储在磁盘上。在努力实现这一目标的过程中,我将使用我所知道的大多数高效 Mathematica 结构,无论是用于内存工作还是与文件系统的交互。

代码

下面是代码:

Clear[words];
words[text_String] :=  ToLowerCase[StringCases[text, WordCharacter ..]];

(* Rules to replace words with integer indices, and back *)
Clear[makeWordIndexRules];
makeWordIndexRules[sym_Symbol, words : {__String}] :=
   With[{distinctWords = DeleteDuplicates[words]},
    sym["Direct"] = Dispatch[Thread[distinctWords -> Range[Length[distinctWords]]]];
    sym["Inverse"] = Dispatch[Thread[ Range[Length[distinctWords]] -> distinctWords]];
    sym["keys"] = distinctWords;
];

(* Make a symbol with DownValues / OwnValues self - uncompressing *)
ClearAll[defineCompressed];
SetAttributes[defineCompressed, HoldFirst];
defineCompressed[sym_Symbol, valueType_: DownValues] :=
  With[{newVals = 
     valueType[sym] /.
       Verbatim[RuleDelayed][
         hpt : Verbatim[HoldPattern][HoldPattern[pt_]], rhs_] :>
            With[{eval = Compress@rhs}, hpt :> (pt = Uncompress@ eval)]
        },
        ClearAll[sym];
        sym := (ClearAll[sym]; valueType[sym] = newVals; sym)
];


(* Get a list of indices corresponding to a full list of words in a text *)
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
    Developer`ToPackedArray[words /. sym["Direct"]];

(* Compute the combinations and their frequencies *)
Clear[getSortedNgramsAndFreqs];
getSortedNgramsAndFreqs[input_List, n_Integer] :=
   Reverse[#[[Ordering[#[[All, 2]]]]]] &@ Tally[Partition[input, n, 1]];


(* 
 ** Produce n-grams and store them in a hash-table. We split combinations from
 ** their frequencies, and assume indices for input, to utilize packed arrays 
 *)
Clear[produceIndexedNgrams];
produceIndexedNgrams[sym_Symbol, input_List, range : {__Integer}] :=
 Do[
   With[{ngramsAndFreqs = getSortedNgramsAndFreqs[input, i]},
     sym["NGrams", i] = Developer`ToPackedArray[ngramsAndFreqs[[All, 1]]];
     sym["Frequencies", i] =  Developer`ToPackedArray[ngramsAndFreqs[[All, 2]]]
   ],
   {i, range}];


(* Higher - level function to preprocess the text and populate the hash - tables *)
ClearAll[preprocess];
SetAttributes[preprocess, HoldRest];
preprocess[text_String, inputWordList_Symbol, wordIndexRuleSym_Symbol,
    ngramsSym_Symbol, nrange_] /; MatchQ[nrange, {__Integer}] :=
  Module[{},
    Clear[inputWordList, wordIndexRuleSym, ngramsSym];
    inputWordList = words@text;
    makeWordIndexRules[wordIndexRuleSym, inputWordList];
    produceIndexedNgrams[ngramsSym,
    getWordsIndices[wordIndexRuleSym, inputWordList], nrange]
  ];

(* Higher - level function to make the definitions auto-uncompressing and save them*)
ClearAll[saveCompressed];
SetAttributes[saveCompressed, HoldRest];
saveCompressed[filename_String, inputWordList_Symbol, wordIndexRuleSym_Symbol, 
   ngramsSym_Symbol] :=
Module[{},
   defineCompressed /@ {wordIndexRuleSym, ngramsSym};
   defineCompressed[inputWordList, OwnValues];
   DumpSave[filename, {inputWordList, wordIndexRuleSym, ngramsSym}];
];

上面的功能非常消耗内存:为了处理 @Ian 的文件,它在某些时候占用了近 5Gb 的 RAM。然而,这是值得的,如果没有足够的 RAM,也可以使用较小的文件来测试上述内容。一般来说,大文件可以分成几个部分来解决这个问题。

预处理并以优化格式保存

我们现在开始。在我的机器上,预处理大约需要一分钟:

test = Import["C:\\Temp\\lowered-text-50.txt", "Text"];

In[64]:= preprocess[test,inputlower,wordIndexRules,ngrams,{2,3}];//Timing
Out[64]= {55.895,Null}

符号 inputlowerwordIndexRulesngrams 这里是我选择用于单词列表的一些符号在文件中和哈希表中。以下是一些附加输入,说明了这些符号的使用方式及其含义:

In[65]:= ByteCount[inputlower]
Out[65]= 459617456

In[69]:= inputlower[[1000;;1010]]
Out[69]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[67]:= toNumbers = inputlower[[1000;;1010]]/.wordIndexRules["Direct"]
Out[67]= {58,220,28,58,392,393,25,1,216,25,1}

In[68]:= toWords =toNumbers/. wordIndexRules["Inverse"]
Out[68]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[70]:= {ngrams["NGrams",2],ngrams["Frequencies",2]}//Short
Out[70]//Short= {{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}

这里的主要思想是我们使用整数索引而不是单词(字符串),这使我们能够利用 n-gram 的打包数组。

压缩和保存还需要半分钟:

In[71]:= saveCompressed["C:\\Temp\\largeTextInfo.mx", inputlower, 
      wordIndexRules, ngrams] // Timing
Out[71]= {30.405, Null}

生成的 .mx 文件大约为 63MB,大约是原始文件的大小。请注意,由于我们保存的一部分内容是(自压缩)变量inputlower,其中包含原始顺序的所有输入单词,因此与原始文件相比,我们不会丢失任何信息。原则上,从现在开始只能使用新的 .mx 文件。

使用优化的文件

我们现在通过退出内核来启动一个新会话。加载文件几乎不需要时间(.mx 格式非常高效):

In[1]:= Get["C:\\Temp\\largeTextInfo.mx"] // Timing
Out[1]= {0.016, Null}

加载单词列表需要一些时间(自解压缩):

In[2]:= inputlower//Short//Timing
Out[2]= {6.52,{la,présente,collection,numérisée,<<8000557>>,quicktime,3,0}}

但我们不会将其用于任何用途 - 存储它只是为了以防万一。加载 2-gram 及其频率:

In[3]:= Timing[Short[ngrams2 = {ngrams["NGrams",2],ngrams["Frequencies",2]}]]
Out[3]= {0.639,{{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}}

请注意,这里的大部分时间都花在自解压缩上,这是有选择性的(因此,ngrams["NGrams",3] 仍然是压缩的) 。加载 3-gram 及其频率:

In[4]:= Timing[Short[ngrams3 = {ngrams["NGrams",3],ngrams["Frequencies",3]}]]
Out[4]= {1.357,{{{11333,793,11334},{793,11334,11356},<<4642628>>,{18,21,22},{20,18,21}},{<<1>>}}}

考虑到列表的大小,时机还不错。请注意,DumpSave - GetCompress - Uncompress 都不会解压打包数组,因此我们的数据非常有效地存储在 Mathematica 的内存中:

In[5]:= Developer`PackedArrayQ/@ngrams3
Out[5]= {True,True}

我们解压缩与单词相关的索引规则:

In[6]:= Timing[Short[wordIndexRules["Inverse"]]]
Out[6]= {0.905,Dispatch[{1->la,2->présente,<<160350>>,160353->7631,160354->jomac},-<<14>>-]}

这里 足以开始处理数据,但在下一节中,我概述了一些关于如何使这项工作更加高效的提示。

使用未压缩数据进行高效工作

例如,如果我们尝试查找频率为 1 的 2-gram 的所有位置,简单的方法是这样的:

In[8]:= Position[ngrams["Frequencies",3],1,{1}]//Short//Timing
Out[8]= {1.404,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

但是,我们可以利用以下事实:我们使用存储在中的整数索引(而不是单词)一个打包数组。这是自定义位置函数的一个版本(由 Norbert Pozar 提供):

extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]; 
positionExtr[x_List, n_] := 
    extractPositionFromSparseArray[SparseArray[Unitize[x - n], Automatic, 1]]

使用此函数,我们可以将其速度提高 10 倍(可以使用编译为 C 的函数,速度还要快两倍):

In[9]:= positionExtr[ngrams["Frequencies",3],1]//Short//Timing
Out[9]= {0.156,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

以下是一些更方便的函数:

Clear[getNGramsWithFrequency];
getNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=
  Extract[ngramSym["NGrams", n], positionExtr[ngramSym["Frequencies", n], freq]];

Clear[deleteNGramsWithFrequency];
deleteNGramsWithFrequency[{ngrams_List, freqs_List}, freq_Integer] :=  
    Delete[#, positionExtr[freqs, freq]] & /@ {ngrams, freqs};

deleteNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=  
   deleteNGramsWithFrequency[{ngramSym["NGrams", n], ngramSym["Frequencies", n]}, freq];

使用,我们可以非常有效地得到很多东西。例如,删除频率为 1 的 2-gram:

In[15]:= deleteNGramsWithFrequency[ngrams,2,1]//Short//Timing
Out[15]= {0.218,{{{793,791},{25,1},{4704,791},<<696333>>,{29,66},{36,37},{18,21}},{<<1>>}}}

或者,删除频率小于 100 的 2-gram(这是次优方法,但仍然相当快):

In[17]:= (twogramsLarger100 = 
  Fold[deleteNGramsWithFrequency,deleteNGramsWithFrequency[ngrams,2,1],Range[2,100]])
  //Short//Timing
Out[17]= {0.344,{{{793,791},{25,1},{4704,791},{25,10},<<6909>>,
  {31,623},{402,13},{234,25}},{<<1>>}}}

主要思想是整数索引起着单词的“指针”作用,大多数事情都可以用它们完成。当需要时,我们可以回到正常的词语:

In[18]:= twogramsLarger100/.wordIndexRules["Inverse"]//Short//Timing
Out[18]= {0.063,{{{of,the},{de,la},<<6912>>,{société,du},{processus,de}},{<<1>>}}}

结束语

这里实现的加速似乎是巨大的。通过以细粒度的块加载数据,可以控制数据占用的 RAM 量。通过利用打包数组,内存使用本身已经得到了极大的优化。磁盘上的内存节省是由于 CompressDumpSave 的组合所致。哈希表、Dispatch 规则和自解压缩等技术使其更加方便。

这里还有足够的空间进行进一步的改进。人们可以将数据分割成更小的块并分别压缩/保存它们,以避免中间步骤中的高内存使用。还可以根据频率范围分割数据,并将分割后的数据保存到单独的文件中,以加快加载/自解压缩阶段。对于许多文件,需要概括这一点,因为这里使用了哈希的全局符号。这似乎是应用一些 OOP 技术的好地方。一般来说,这只是一个起点,但我的信息是,这种方法在 IMO 中具有高效处理此类文件的良好潜力。

Introduction

What I will propose is different from most suggestions given before, and based on a combination of indexing, hash-tables, packed arrays, Compress, .mx files and DumpSave, and a few other things. The basic idea is to preprocess the file in a smart way, and save the preprocessed definitions in an .mx file for fast loading. Rather than base most of the work on reading from disk, I propose to shift the accents, and do most of the work in-memory, but find ways to load data from disk, store it in RAM, work with data, and save it on disk in both time and memory - efficient manner. In trying to achieve this goal, I will use most of the efficient Mathematica constructions I am aware of, both for in-memory work and interaction with the file system.

Code

Here is the code:

Clear[words];
words[text_String] :=  ToLowerCase[StringCases[text, WordCharacter ..]];

(* Rules to replace words with integer indices, and back *)
Clear[makeWordIndexRules];
makeWordIndexRules[sym_Symbol, words : {__String}] :=
   With[{distinctWords = DeleteDuplicates[words]},
    sym["Direct"] = Dispatch[Thread[distinctWords -> Range[Length[distinctWords]]]];
    sym["Inverse"] = Dispatch[Thread[ Range[Length[distinctWords]] -> distinctWords]];
    sym["keys"] = distinctWords;
];

(* Make a symbol with DownValues / OwnValues self - uncompressing *)
ClearAll[defineCompressed];
SetAttributes[defineCompressed, HoldFirst];
defineCompressed[sym_Symbol, valueType_: DownValues] :=
  With[{newVals = 
     valueType[sym] /.
       Verbatim[RuleDelayed][
         hpt : Verbatim[HoldPattern][HoldPattern[pt_]], rhs_] :>
            With[{eval = Compress@rhs}, hpt :> (pt = Uncompress@ eval)]
        },
        ClearAll[sym];
        sym := (ClearAll[sym]; valueType[sym] = newVals; sym)
];


(* Get a list of indices corresponding to a full list of words in a text *)
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
    Developer`ToPackedArray[words /. sym["Direct"]];

(* Compute the combinations and their frequencies *)
Clear[getSortedNgramsAndFreqs];
getSortedNgramsAndFreqs[input_List, n_Integer] :=
   Reverse[#[[Ordering[#[[All, 2]]]]]] &@ Tally[Partition[input, n, 1]];


(* 
 ** Produce n-grams and store them in a hash-table. We split combinations from
 ** their frequencies, and assume indices for input, to utilize packed arrays 
 *)
Clear[produceIndexedNgrams];
produceIndexedNgrams[sym_Symbol, input_List, range : {__Integer}] :=
 Do[
   With[{ngramsAndFreqs = getSortedNgramsAndFreqs[input, i]},
     sym["NGrams", i] = Developer`ToPackedArray[ngramsAndFreqs[[All, 1]]];
     sym["Frequencies", i] =  Developer`ToPackedArray[ngramsAndFreqs[[All, 2]]]
   ],
   {i, range}];


(* Higher - level function to preprocess the text and populate the hash - tables *)
ClearAll[preprocess];
SetAttributes[preprocess, HoldRest];
preprocess[text_String, inputWordList_Symbol, wordIndexRuleSym_Symbol,
    ngramsSym_Symbol, nrange_] /; MatchQ[nrange, {__Integer}] :=
  Module[{},
    Clear[inputWordList, wordIndexRuleSym, ngramsSym];
    inputWordList = words@text;
    makeWordIndexRules[wordIndexRuleSym, inputWordList];
    produceIndexedNgrams[ngramsSym,
    getWordsIndices[wordIndexRuleSym, inputWordList], nrange]
  ];

(* Higher - level function to make the definitions auto-uncompressing and save them*)
ClearAll[saveCompressed];
SetAttributes[saveCompressed, HoldRest];
saveCompressed[filename_String, inputWordList_Symbol, wordIndexRuleSym_Symbol, 
   ngramsSym_Symbol] :=
Module[{},
   defineCompressed /@ {wordIndexRuleSym, ngramsSym};
   defineCompressed[inputWordList, OwnValues];
   DumpSave[filename, {inputWordList, wordIndexRuleSym, ngramsSym}];
];

The above functionality is very memory - hungry: for processing of @Ian's file it took at some point nearly 5Gb of RAM. However, this is worth it, and one can also test the above with smaller files if there is not enough RAM. Generally, large files could be split into several parts, to work around this problem.

Preprocessing and saving in optimized format

We now start. Preprocessing takes about a minute on my machine:

test = Import["C:\\Temp\\lowered-text-50.txt", "Text"];

In[64]:= preprocess[test,inputlower,wordIndexRules,ngrams,{2,3}];//Timing
Out[64]= {55.895,Null}

The symbols inputlower, wordIndexRules, ngrams here are some symbols I chose to use for the list of words in a file and for hash-tables. Here are some additional inputs illustrating how these symbols are used and what they mean:

In[65]:= ByteCount[inputlower]
Out[65]= 459617456

In[69]:= inputlower[[1000;;1010]]
Out[69]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[67]:= toNumbers = inputlower[[1000;;1010]]/.wordIndexRules["Direct"]
Out[67]= {58,220,28,58,392,393,25,1,216,25,1}

In[68]:= toWords =toNumbers/. wordIndexRules["Inverse"]
Out[68]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[70]:= {ngrams["NGrams",2],ngrams["Frequencies",2]}//Short
Out[70]//Short= {{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}

The main idea here is that we use integer indices instead of words (strings), which allows us to utilize packed arrays for n-grams.

Compressing and saving takes another half a minute:

In[71]:= saveCompressed["C:\\Temp\\largeTextInfo.mx", inputlower, 
      wordIndexRules, ngrams] // Timing
Out[71]= {30.405, Null}

The resulting .mx file is about 63MB, which is about the size of the original file. Note that since a part of what we save is a (self-compressed) variable inputlower, which contains all the input words in the original order, we are not losing any information as compared to the original file. In principle, one can from now on start working with the new .mx file only.

Working with the optimized file

We now start a new session by quitting the kernel. It takes almost no time to load the file (.mx format is extremely efficient):

In[1]:= Get["C:\\Temp\\largeTextInfo.mx"] // Timing
Out[1]= {0.016, Null}

Loading the list of words takes some time (self - uncompressing):

In[2]:= inputlower//Short//Timing
Out[2]= {6.52,{la,présente,collection,numérisée,<<8000557>>,quicktime,3,0}}

but we don't use it for anything - it is stored just in case. Loading the 2-grams and their frequencies:

In[3]:= Timing[Short[ngrams2 = {ngrams["NGrams",2],ngrams["Frequencies",2]}]]
Out[3]= {0.639,{{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}}

Note that most of the time here was spent on self - uncompressing, which is selective (so, e.g., ngrams["NGrams",3] is still compressed). Loading the 3-grams and their frequencies:

In[4]:= Timing[Short[ngrams3 = {ngrams["NGrams",3],ngrams["Frequencies",3]}]]
Out[4]= {1.357,{{{11333,793,11334},{793,11334,11356},<<4642628>>,{18,21,22},{20,18,21}},{<<1>>}}}

The timings are decent, given the size of the lists. Note that neither DumpSave - Get nor Compress - Uncompress unpack packed arrays, so our data is stored in Mathematica's memory pretty efficiently:

In[5]:= Developer`PackedArrayQ/@ngrams3
Out[5]= {True,True}

Here we uncompress the rules relating indices to words:

In[6]:= Timing[Short[wordIndexRules["Inverse"]]]
Out[6]= {0.905,Dispatch[{1->la,2->présente,<<160350>>,160353->7631,160354->jomac},-<<14>>-]}

This is enough to start working with the data, but in the next section I outline some hints on how to make this work yet more efficient.

Efficient work with uncompressed data

If we try to find, for example, all positions of 2-grams with frequency 1, the naive way is this:

In[8]:= Position[ngrams["Frequencies",3],1,{1}]//Short//Timing
Out[8]= {1.404,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

However, we can leverage the fact that we work with integer indices (instead of words) stored in a packed array. Here is one version of the custom position function (due to Norbert Pozar):

extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]; 
positionExtr[x_List, n_] := 
    extractPositionFromSparseArray[SparseArray[Unitize[x - n], Automatic, 1]]

Using this, we get it 10 times faster (one can use a compiled to C function which is yet twice faster):

In[9]:= positionExtr[ngrams["Frequencies",3],1]//Short//Timing
Out[9]= {0.156,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

Here are some more convenience functions:

Clear[getNGramsWithFrequency];
getNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=
  Extract[ngramSym["NGrams", n], positionExtr[ngramSym["Frequencies", n], freq]];

Clear[deleteNGramsWithFrequency];
deleteNGramsWithFrequency[{ngrams_List, freqs_List}, freq_Integer] :=  
    Delete[#, positionExtr[freqs, freq]] & /@ {ngrams, freqs};

deleteNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=  
   deleteNGramsWithFrequency[{ngramSym["NGrams", n], ngramSym["Frequencies", n]}, freq];

Using which, we can get many things pretty efficiently. For example, delete the 2-grams with frequency 1:

In[15]:= deleteNGramsWithFrequency[ngrams,2,1]//Short//Timing
Out[15]= {0.218,{{{793,791},{25,1},{4704,791},<<696333>>,{29,66},{36,37},{18,21}},{<<1>>}}}

Or, 2-grams with frequencies less than 100 (this is sub-optimal way to do this, but it is still pretty fast):

In[17]:= (twogramsLarger100 = 
  Fold[deleteNGramsWithFrequency,deleteNGramsWithFrequency[ngrams,2,1],Range[2,100]])
  //Short//Timing
Out[17]= {0.344,{{{793,791},{25,1},{4704,791},{25,10},<<6909>>,
  {31,623},{402,13},{234,25}},{<<1>>}}}

The main idea is that integer indices play a role of "pointers" for words, and most things can be done with them. When needed, we can return to the normal words:

In[18]:= twogramsLarger100/.wordIndexRules["Inverse"]//Short//Timing
Out[18]= {0.063,{{{of,the},{de,la},<<6912>>,{société,du},{processus,de}},{<<1>>}}}

Concluding remarks

The speed-up achieved here seems substantial. One can control how much RAM is occupied by the data, by loading data in fine-grained chunks. The memory usage itself has been hugely optimized by utilizing packed arrays. The memory savings on disk are due to the combination of Compress and DumpSave. Hash-tables, Dispatch-ed rules and self-uncompressing are techniques used to make it more convenient.

There is plenty of room here for further refinements. One can split data in smaller chunks and compress / save those separately, to avoid high memory usage in intermediate steps. One can also split data according to the frequency ranges, and save the data so split into separate files, to speed up the load / self - uncompress stage. For many files, one would need to generalize this, since global symbols for hashes were used here. This seems a good place to apply some OOP techniques. Generally, this is just a starting point, but my message is that this approach IMO has a good potential for efficient work with such files.

半仙 2024-12-25 02:46:37

这些幻灯片是当前有关如何处理导入和使用大量数据的最佳智慧:

http ://library.wolfram.com/infocenter/Conferences/8025/

它涵盖了此处提到的一些主题,并提供了一些图表,这些图表将显示您可以通过切换获得多少速度远离进口。

These slides are the current best wisdom on how to deal with importing and working with large sets of data:

http://library.wolfram.com/infocenter/Conferences/8025/

It covers some of the subjects mentioned here and gives some graphs which will show you how much of a speed up you can expect from switching away from Import.

暮凉 2024-12-25 02:46:37

这些是我的建议:

  1. 我建议使用ReadList[file, Word]。通常它比导入快得多。这也会将其分解为单词。

  2. 您也可以考虑使用 gzip 压缩文件。 Import/Export 无缝支持这些,但 ReadList 不支持。对于磁盘限制的操作,这实际上比读取/写入未压缩的数据更快。

  3. 您的排序可能会很慢(我没有用大文件测试过您的操作,所以我不确定)。 查看昨天关于如何快速完成此操作的问题

在完成之前您无法退出 Tally,但您始终可以使用 SelectCasesDeleteCases在导出之前修剪二元组列表。

最后,作为你最后一个问题的答案:恐怕 Mathematica 只有在将所有数据加载到内存中时才能高效/方便地使用。该系统似乎只适合处理内存中的数据。这是来自个人经验。


编辑 使用 50 MB 文本文件的速度很慢,但在我的(相当旧且慢)机器上仍然可以忍受。只需确保您使用 SortBy

In[1]:= $HistoryLength = 0; (* save memory *)

In[2]:= Timing[
 data = ReadList["~/Downloads/lowered-text-50.txt", Word, 
    WordSeparators -> {" ", "\t", ".", ","}];]

Out[2]= {6.10038, Null}

In[3]:= Timing[counts = Tally@Partition[data, 2, 1];]

Out[3]= {87.3695, Null}

In[4]:= Timing[counts = SortBy[counts, Last];]

Out[4]= {28.7538, Null}

In[5]:= Timing[counts = DeleteCases[counts, {_, 1}];]

Out[5]= {3.11619, Null}

我无法让 ReadList 正确处理 UTF-8,因此您可能需要坚持使用 Import

These are my suggestions:

  1. I suggest using ReadList[file, Word]. Usually it is much faster than Import. This will also break it into words.

  2. You can consider working with gzip-compressed files too. Import/Export support these seamlessly, but ReadList does not. For disk-limited operations this will actually be faster than reading/writing uncompressed data.

  3. Your Sorts may be slow (I haven't tested your operations with large files, so I'm not sure). See yesterday's question on how to do this fast.

You can't break out from Tally before it finishes, but you can always use Select, Cases or DeleteCases to trim the bigram list before exporting.

Finally, as an answer to your last question: I am afraid Mathematica is only efficient/convenient to work with of you load all data into memory. The system seems to be conceived to work well with in-memory data only. This is from personal experience.


EDIT Working with your 50 MB text file is slow, but still bearable on my (rather old and slow) machine. Just make sure you use SortBy:

In[1]:= $HistoryLength = 0; (* save memory *)

In[2]:= Timing[
 data = ReadList["~/Downloads/lowered-text-50.txt", Word, 
    WordSeparators -> {" ", "\t", ".", ","}];]

Out[2]= {6.10038, Null}

In[3]:= Timing[counts = Tally@Partition[data, 2, 1];]

Out[3]= {87.3695, Null}

In[4]:= Timing[counts = SortBy[counts, Last];]

Out[4]= {28.7538, Null}

In[5]:= Timing[counts = DeleteCases[counts, {_, 1}];]

Out[5]= {3.11619, Null}

I couldn't get ReadList to handle UTF-8 properly, so you may need to stick to Import there.

海的爱人是光 2024-12-25 02:46:37

为了扩展我发表的评论,阅读ReadListImport 的有用替代方案。与ReadList类似,您指定一个类型,如果您指定String,它会读取整行。因此,您可以处理整个文件,一次处理一行(或多行)。唯一的困难是您必须手动监视 EndOfFile。例如,

strm = OpenRead[file];
While[ (line = Read[ str, String ]) =!= EndOfFile,
 (* Do something with the line *)
];
Close[ strm ];

要将其扩展到一次多行,请将上面的 String 替换为您希望一次处理的仅包含 String 的行数长度的列表。最好使用 ConstantArray[String, n] 超过几行。当然,可以使用 Word 来逐字处理文件。

逐行处理文件有一个缺点,如果您需要中止该过程,strm将保持打开状态。因此,我建议将代码包装在 CheckAbort 中,或使用描述的功能此处

To extend a comment I made, Read is a useful alternative to ReadList or Import. Like ReadList, you specify a type, and if you specify String it reads in the entire line. So, you can process the entire file, one (or more) lines at a time. The only difficulty, is that you have to watch for EndOfFile by hand. For instance,

strm = OpenRead[file];
While[ (line = Read[ str, String ]) =!= EndOfFile,
 (* Do something with the line *)
];
Close[ strm ];

To extend this to multiple lines at once, replace String above with a list the length of the number of lines you wish to process at once containing only String. This is best accomplished using ConstantArray[String, n] for more than a few lines. Of course, Word can be used instead, to process the file on a word for word basis.

Processing files line-by-line has a down side, if you need to Abort the process, strm will be left open. So, I suggest wrapping the code in CheckAbort, or using the functionality described here.

早茶月光 2024-12-25 02:46:37

您可以查看“字符串模式”,这是 Mathematica 的版本正则表达式。也许类似于 StringCases[data, RegularExpression["\\w+?\\W+?\\w+?"]] 之类的内容,它应该返回单词-空白-单词的所有匹配序列。我不能说这是否会比您的分区代码更快。

该页面底部有一个“高效匹配提示”。

您可以在排序之前应用 "DeleteDuplicates" 来修剪列表。

如果我用另一种语言执行此操作,我会将 n-gram 存储在 哈希表 中,以文本作为键,实例作为值。这对于逐行文件解析器来说可以很好地工作。然而,在 Mathematica 中使用哈希表似乎并不简单。

还有一个观察:您可以将所有 5-gram 生成到一个文件中,而不是运行 4 遍,并使用简单的 命令行文本处理以从该文件生成 2-、3-和 4-gram。当然,这只有在您让 5 克提取器在合理的时间内运行后才有用。

You might look at "String Patterns", which are Mathematica's version of Regular Expressions. Perhaps something like StringCases[data, RegularExpression["\\w+?\\W+?\\w+?"]], which should return all the matching sequences of word-whitespace-word. I can't say whether this will be faster than your partition code or not.

There's a "Tips For Efficient Matching" at the bottom of that page.

You could apply "DeleteDuplicates" to trim the list before sorting.

If I were doing this in another language, I would store the n-grams in a Hash Table, with the text as key, and the instance count as a value. This would work well with a line-by line file parser. However, it seems using a Hash Table in Mathematica is not straightforward.

And one observation: Instead of running 4 passes, you could just generate all the 5-grams to a single file, and use simple command line text processing to generate the 2-, 3- and 4-grams from that file. Of course, this is only useful after you get the 5-gram extractor to run in a reasonable time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文