在 Perl 中更快地搜索文件
我有一个问题,我当前的算法使用朴素的线性搜索算法通过匹配字符串从多个数据文件中检索数据。
它是这样的(伪代码):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
对于现实生活中的大型工作,一个过程可能需要大约 6 小时才能完成。
基本上,我有一大堆字符串,它们使用程序来搜索同一组文件(例如,1 个实例中为 10 个,在程序运行的下一个实例中可能为 3 个)。由于参考数据文件可能会更改,因此我认为为这些文件建立永久索引并不明智。
我几乎是一个初学者,不知道有任何更快的未排序数据技术。
我在想,由于搜索在一段时间后会重复,一旦文件数组构建完成(文件已知),是否可以在不使用任何外部 perl 库的情况下预先构建数据引用文件中特定行位置的索引?该脚本将被移植到可能只安装了标准 Perl 的服务器上。
我认为在处理作业之前花 3-5 分钟为搜索构建某种索引可能是值得的。
是否有适用于我的情况的索引/搜索的特定概念?
谢谢大家!
I have a problem where my current algorithm uses a naive linear search algorithm to retrieve data from several data files through matching strings.
It is something like this (pseudo code):
while count < total number of files
open current file
extract line from this file
build an arrayofStrings from this line
foreach string in arrayofStrings
foreach file in arrayofDataReferenceFiles
search in these files
close file
increment count
For a large real life job, a process can take about 6 hours to complete.
Basically I have a large set of strings that uses the program to search through the the same set of files (for example 10 in 1 instance and can be 3 in the next instance the program runs). Since the reference data files can change, I do not think it is smart to build a permanent index of these files.
I'm pretty much a beginner and am not aware of any faster techniques for unsorted data.
I was thinking since the search gets repetitive after a while, is it possible to prebuild an index of locations of specific lines in the data reference files without using any external perl libraries once the file array gets built (files are known)? This script is going to be ported onto a server that probably only has standard Perl installed.
I figured it might be worth spending 3-5 minutes building some sort of index for a search before processing the job.
Is there a specific concept of indexing/searching that applies to my situation?
Thanks everyone!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
很难准确理解您想要实现的目标。
我假设该数据集不适合 RAM。
如果您尝试将许多文件中的每一行与一组模式进行匹配,那么最好一次性读取每一行,然后在继续之前将其与内存中的所有模式进行匹配。这将减少每个模式的 IO 过度循环。
另一方面,如果匹配是花费时间的事情,那么您最好使用可以同时匹配大量模式的库。
It is difficult to understand exactly what you're trying to achieve.
I assume the data set does not fit in RAM.
If you are trying to match each line in many files against a set of patterns, it may be better to read each line in once, then match it against all the patterns while it's in memory before moving on. This will reduce IO over looping for each pattern.
On the other hand, if the matching is what's taking the time you're probably better off using a library which can simultaneously match lots of patterns.
您可能可以替换它:
使用预处理步骤来构建 DBM 文件(即磁盘上的哈希)作为反向索引,该索引将参考文件中的每个单词映射到包含该单词(或您需要的任何内容)的文件列表。 Perl 核心包括 DBM 支持:
您通常可以通过
tie
访问这些内容,但事实并非如此重要的是,每个 Perl 都应该对至少一个磁盘哈希库有一定的支持,而不需要安装非核心包。You could probably replace this:
with a preprocessing step to build a DBM file (i.e. an on-disk hash) as a reverse index which maps each word in your reference files to a list of the files containing that word (or whatever you need). The Perl core includes DBM support:
You'd normally access this stuff through
tie
but that's not important, every Perl should have some support for at least one hash-on-disk library without needing non-core packages installed.正如 MarkR 所说,您希望从每个文件中读取每一行不超过一次。您发布的伪代码看起来像是您多次读取每个文件的每一行(对于搜索的每个单词一次),这会大大减慢速度,尤其是在大型搜索时。颠倒两个最内循环的顺序应该(根据发布的伪代码判断)可以解决这个问题。
但是,您也说过,“由于参考数据文件可以更改,我认为为这些文件建立永久索引并不明智。”这很可能是不正确的。如果性能是一个问题(如果您要获得 6 小时的运行时间,我想说这可能会成为一个问题),并且平均而言,每个文件在对该特定文件的更改之间都会被多次读取,然后构建一个索引在磁盘上(甚至......使用数据库!)将是一件非常明智的事情。如今磁盘空间非常便宜;人们花在等待结果上的时间却不是。
即使文件经常经历多次更改而不被读取,按需索引(当您想要检查文件时,首先查看索引是否存在,如果不存在,则在进行搜索之前构建一个索引)将是一种很好的方法 -当一个文件被多次搜索时,您可以从索引中受益;如果没有,首先构建索引,然后在索引外进行搜索将比线性搜索慢一点,以至于基本上不相关。
As MarkR said, you want to read each line from each file no more than one time. The pseudocode you posted looks like you're reading each line of each file multiple times (once for each word that is searched for), which will slow things down considerably, especially on large searches. Reversing the order of the two innermost loops should (judging by the posted pseudocode) fix this.
But, also, you said, "Since the reference data files can change, I do not think it is smart to build a permanent index of these files." This is, most likely, incorrect. If performance is a concern (if you're getting 6-hour runtimes, I'd say that probably makes it a concern) and, on average, each file gets read more than once between changes to that particular file, then building an index on disk (or even... using a database!) would be a very smart thing to do. Disk space is very cheap these days; time that people spend waiting for results is not.
Even if files frequently undergo multiple changes without being read, on-demand indexing (when you want to check the file, first look to see whether an index exists and, if not, build one before doing the search) would be an excellent approach - when a file gets searched more than once, you benefit from the index; when it doesn't, building the index first, then doing an search off the index will be slower than a linear search by such a small margin as to be largely irrelevant.