Perl 代码,列出文本文件中给定字符串后面的所有单词
这很难描述,但对于提取我正在处理的输出中的数据很有用(我希望将此代码用于大量目的)
下面是一个示例: 假设我有一个包含单词和一些特殊字符($、#、! 等)的文本文件,内容如下:
blah blah
blah 将该词添加到列表中:1234.56 blah blah
巴拉巴拉
等等,现在不要忘记将这个词添加到列表中:PINAPPLE blah blah
为了获得奖励积分,
很高兴知道该脚本
将能够将此单词添加到列表中:1!@#$%^&*()[]{};:'",<.>/?asdf blah blah
如示例所示,
我想将任何“单词”(定义为在此上下文中不包含空格的任何字符串)添加到某种形式的列表中,以便我可以将列表的元素提取为 list[2] 列表[3] 或 list(4) list(5) 或类似的内容。
这将是非常通用的,并且在另一个线程和另一个论坛中进行一些质疑之后,我希望将它放在 perl 中可以使其执行速度相对较快——因此即使对于大型文本文件它也能很好地工作。 我打算用它来从不同程序生成的输出文件中读取数据,而不管输出文件的结构如何,即如果我知道要搜索的字符串,我就可以获取数据。
This is difficult to describe but useful in extracting data in the output I am dealing with (I hope to use this code for a large number of purposes)
Here is an example:
Say I have a text file with words and some special characters ($, #, !, etc) that reads:
blah blah
blah add this word to the list: 1234.56 blah blah
blah blah
blah now don't forget to add this word to the list: PINAPPLE blah blah
And for bonus points,
it would be nice to know that the script
would be able to add this word to the list: 1!@#$%^&*()[]{};:'",<.>/?asdf blah blah
blah blah
As the example implies, I would like to add whatever "word" (defined as any string that does not contain spaces in this context) to some form of list such that I can extract elements of the list as list[2] list[3] or list(4) list(5), or something along those lines.
This would be very versatile, and after some questioning in another thread and another forum, I am hoping that having it in perl would make it relatively fast in execution--so it will work well even for large text files.
I intend to use this to read data from output files generated from different programs regardless of structure of the output file, i.e. if I know the string to search for, I can get the data.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为你的问题中有一些遗漏的词:)
但这听起来像是您想要的(假设即使是“大文本文件”也适合内存 - 如果不是,您将逐行循环推送到 $list 上)。
I think there are some missing words in your question :)
But this sounds like what you want (assuming even the "large text files" fit in memory - if not, you'd loop through line by line pushing onto $list instead).
如果搜索的字符串相同,则让Perl进行处理,将搜索短语作为输入记录分隔符:
上面是“几乎可以”,它会保存$list[0] 中文件的第一个单词,因为
的处理方式。但这种方式很容易理解(恕我直言)
问:为什么不简单地用一个正则表达式在整个数据上查找字符串(正如这里已经建议的那样)。因为根据我的经验,使用每条记录正则表达式(在实际用例中可能非常复杂的正则表达式)进行记录处理会更快 - 特别是在非常大的文件上。这就是原因。
真实世界测试
为了支持这一说法,我使用包含 10,000 个数据的 200MB 数据文件进行了一些测试。
你的标记。测试源如下:
输出以下计时结果:
这基本上意味着自定义 IRS 的简单逐行读取和逐块读取
比通过常规方式读取文件和扫描大约快 2.3 倍(大约 4 秒内通过一次)
表达。
这基本上是说,如果您在像我这样的系统上处理这种大小的文件;-),
您应该逐行阅读如果您的搜索问题位于一行并阅读
通过自定义输入记录分隔符如果您的搜索问题涉及多行(我的 0.02 美元)。
你也想参加测试吗?这一个:
创建 200MB 输入文件“data.dat”。
问候
rbo
If the string for the searches is the same, let Perl do the processing by using the search phrase as input record separator:
The above is "almost ok", it will save the first word of the file in $list[0] because
of the way of the processing. But this way makes it very easy to comprehend (imho)
Q: why not simply look the strings up with one regex over the entire data (as has already been suggested here). Because in my experience, the record-wise procesing with per-record regular expression (probably very complicated regex in a real use case) will be faster - especially on very large files. Thats the reason.
Real world test
To back this claim up, I performed some tests with a 200MB data file containing 10,000 of
your markers. The test source follows:
which outputs the following timing results:
which basically means that the simple line-wise reading and the block-wise reading by custom IRS
are about 2.3 times faster (one pass in ~4 sec) than slurping the file and scanning by regular
expression.
This basically says, that if you are processing files of this size on a system like mine ;-),
you should read line-by-line if your search problem is located on one line and read
by custom input record separator if your search problem involves more than one line (my $0.02).
Want to make the test too? This one:
creates the 200MB input file 'data.dat'.
Regards
rbo
怎么样:
这允许包含多个“添加”标记的长行。如果肯定只能有一个,请将内部的
while
替换为if
。 (当然,除了我使用了贪婪的“.*
”,它将所有内容都抓取到最后一次出现的匹配项...)带有可选择标记:
不重复:
等等。
并且,正如 @ysth 指出的,你(我)不需要替换 - Perl DWIM 正确地在内循环中进行 g 限定匹配:
How about:
This allows for long lines containing more than one of the 'add' markers. If there definitively can only be one, replace the inner
while
withif
. (Except, of course, that I used a greedy '.*
' which snaffles up everything to the last occurrence of the match...)With a selectable marker:
With no repeats:
Etc.
And, as @ysth points out, you (I) don't need the substitution - Perl DWIM's correctly a g-qualified match in the inner loop: