基于大量信息的案件监控流程
我目前正在对大量文本进行操作(一个文件中约 290MB 的纯文本)。将其导入 Mathematica 8 后,我目前正在开始将其分解为小写单词等操作,以便我可以开始文本分析。
问题是这些过程需要很长时间。有没有办法通过 Mathematica 监控这些操作?对于变量的操作,我使用了 ProgressIndicator 等。但这是不同的。我搜索文档和 StackOverflow 没有发现任何类似的东西。
下面,我想监控 Cases[ ] 命令的进程:
input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];
I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.
The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.
In the following, I would like to monitor the process of the Cases[ ] command:
input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
像
StringCases[ToLowerCase[input], WordCharacter..]
之类的东西似乎更快一些。我可能会使用DeleteCases[expr, ""]
而不是Cases[expr, except[""]]
。Something like
StringCases[ToLowerCase[input], WordCharacter..]
seems to be a little faster. And I would probably useDeleteCases[expr, ""]
instead ofCases[expr, Except[""]]
.通过将“计数器”操作注入到正在匹配的模式中,可以查看
StringSplit
和Cases
操作的进度。下面的代码临时显示了两个进度条:第一个显示了StringSplit
处理的字符数,第二个显示了Cases
处理的单词数:技术的关键两种情况下使用的模式都与通配符
_
匹配。然而,该通配符受到一个总是失败的条件的保护——但直到它作为副作用增加了一个计数器。然后处理“真实”匹配条件作为替代。It is possible to view the progress of the
StringSplit
andCases
operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed byStringSplit
and the second showing the number of words processed byCases
:The key to the technique is that the patterns used in both cases match against the wildcard
_
. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.这在一定程度上取决于您的文本的外观,但您可以尝试将文本拆分为块并迭代这些块。然后,您可以使用
Monitor
监视迭代器以查看进度。例如,如果您的文本由以换行符结尾的文本行组成,您可以在大约 3 MB 的文件上执行类似的操作,这仅比 Joshua 的建议花费的时间稍长一些。
It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using
Monitor
to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like thisOn a file of about 3 MB this took only marginally more time than Joshua's suggestion.
我不知道
Cases
是如何工作的,但是List
处理可能非常耗时,特别是在构建List
时。由于处理后的表达式中存在未知数量的术语,因此Cases
可能会出现这种情况。因此,我会尝试一些稍微不同的方法:用Sequence[]
替换“”。例如,此List
变为
So, try
它应该运行得更快,因为它不是从无到有构建一个大型
List
。I don't know how
Cases
works, butList
processing can be time consuming, especially if it is building theList
as it goes. Since there is an unknown number of terms present in the processed expression, it is likely that is what is occurring withCases
. So, I'd try something slightly different: replacing "" withSequence[]
. For instance, thisList
becomes
So, try
it should operate faster as it is not building up a large
List
from nothing.