基于大量信息的案件监控流程

发布于 2024-12-10 04:36:14 字数 398 浏览 7 评论 0原文

我目前正在对大量文本进行操作（一个文件中约 290MB 的纯文本）。将其导入 Mathematica 8 后，我目前正在开始将其分解为小写单词等操作，以便我可以开始文本分析。

问题是这些过程需要很长时间。有没有办法通过 Mathematica 监控这些操作？对于变量的操作，我使用了 ProgressIndicator 等。但这是不同的。我搜索文档和 StackOverflow 没有发现任何类似的东西。

下面，我想监控 Cases[ ] 命令的进程：

input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];

原文

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.

The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.

In the following, I would like to monitor the process of the Cases[ ] command:

input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千秋岁 2024-12-17 04:36:14

像 StringCases[ToLowerCase[input], WordCharacter..] 之类的东西似乎更快一些。我可能会使用 DeleteCases[expr, ""] 而不是 Cases[expr, except[""]]。

回复收藏 0 原文

哑剧 2024-12-17 04:36:14

通过将“计数器”操作注入到正在匹配的模式中，可以查看 StringSplit 和 Cases 操作的进度。下面的代码临时显示了两个进度条：第一个显示了StringSplit处理的字符数，第二个显示了Cases处理的单词数：

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

技术的关键两种情况下使用的模式都与通配符 _ 匹配。然而，该通配符受到一个总是失败的条件的保护——但直到它作为副作用增加了一个计数器。然后处理“真实”匹配条件作为替代。

It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

回复收藏 0 原文

机场等船 2024-12-17 04:36:14

这在一定程度上取决于您的文本的外观，但您可以尝试将文本拆分为块并迭代这些块。然后，您可以使用 Monitor 监视迭代器以查看进度。例如，如果您的文本由以换行符结尾的文本行组成，您可以

Module[{list, t = 0},
 list = ReadList["/users/USER/alltext.txt", "String"];
 Monitor[wordlist = 
   Flatten@Table[
     StringCases[ToLowerCase[list[[t]]], WordCharacter ..], 
      {t, Length[list]}], 
  Labeled[ProgressIndicator[t/Length[list]], N@t/Length[list], Right]];
 Print["Ready"]]

在大约 3 MB 的文件上执行类似的操作，这仅比 Joshua 的建议花费的时间稍长一些。

It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this

Module[{list, t = 0},
 list = ReadList["/users/USER/alltext.txt", "String"];
 Monitor[wordlist = 
   Flatten@Table[
     StringCases[ToLowerCase[list[[t]]], WordCharacter ..], 
      {t, Length[list]}], 
  Labeled[ProgressIndicator[t/Length[list]], N@t/Length[list], Right]];
 Print["Ready"]]

On a file of about 3 MB this took only marginally more time than Joshua's suggestion.

回复收藏 0 原文

傻比既视感 2024-12-17 04:36:14

我不知道 Cases 是如何工作的，但是 List 处理可能非常耗时，特别是在构建 List 时。由于处理后的表达式中存在未知数量的术语，因此 Cases 可能会出现这种情况。因此，我会尝试一些稍微不同的方法：用 Sequence[] 替换“”。例如，此 List

{"5", "6", "7", Sequence[]}

变为

{"5", "6", "7"}.

So, try

bigList /. "" -> Sequence[]

它应该运行得更快，因为它不是从无到有构建一个大型 List。

I don't know how Cases works, but List processing can be time consuming, especially if it is building the List as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring with Cases. So, I'd try something slightly different: replacing "" with Sequence[]. For instance, this List