多行模式匹配
问题:
在一个大文件(纯文本)中,有一些“有趣”的行包含一些特定的单词。目的是提取包含这些单词的所有行。然而,在某些情况下,即使一行包含此类单词,它也可能不是真正“有趣”,具体取决于其上下文(该行上方和下方各行的内容)。应排除此类行。
我的算法:
我每个有趣的单词都有一个正则表达式,并将该正则表达式应用于文件的每一行。如果找到匹配项,我会通过应用另一组正则表达式(可能跨行)来检查该行是否被排除(取决于其上下文)。如果再次找到匹配项,则该行不是有趣的行,并转到其余行。如果没有,我会将这一行注册为有趣的行,然后转到下一行。
为了检查某行是否被排除,我创建了一个新字符串,如下所示:
N number of lines above current line\n The current line\n N number of lines below current line
这需要花费大量时间。
我的问题:有更好的方法吗?
感谢您抽出时间。
Problem:
In a large file (plain text), there are some "interesting" lines which contain some specific words. The aim is to extract all those lines that contain such words. However, in some cases, even if a line contains such words, it may not be really "interesting", depending on its context (contents of lines above and below that line). Such lines should be excluded.
My algorithm:
I have a regex each for the interesting words and apply this regex on each line of the file. If a match is found, I check if this line was excluded (depending on its context) by applying another set of regexes (which can potentially span across lines). If a match is found again, this line is not an interesting line and move on to remaining lines. If not, I register this line as a interesting line and move on to next line.
To check if a line was excluded, I create a new string that looks like:
N number of lines above current line\n The current line\n N number of lines below current line
This takes an awful amount of time.
My question: Is there a better way of doing this?
Thanks for your time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正则表达式不一定很快。有更快的字符串搜索算法。
一种更加启发式的方法怎么样?
从头到尾处理文件。将感兴趣单词的每一行+偏移量存储在查找结构中。
填充查找结构后,开始使用类似以下算法的方法对其进行处理:
这里的关键是您处理文件一次,然后使用元数据结构进行实际的上下文检查。如果使用正确的数据结构,它应该会快一些。
regex is not necessarily fast. There are faster string search algorithms out there.
How about a more heuristic-based approach.
Process the file from start to finish. Store every line + offset in line of a word of interest in a lookup structure.
Once the lookup structure is populated, start to process through it using something like the following algorithm:
the key here is that you're processing the file once, then using the metadata structure to do your actual context checking. It should be quite a bit faster if you use the right data structures.
很大程度上取决于数据的形式。
您的背景有多复杂?你会放弃寻找有趣的比赛吗?如果是这样尝试
避免走回头路。也许您可以首先确定导致以下几行有趣匹配的上下文。
另外,你需要Java吗?使用 unix/linux cli 工具,您可以对文本文件进行非常强大且快速的操作。
请发布你的算法和你的数据是什么样的。不需要真实的数据,只需要真实的数据。
A lot depends on the form of your data.
How complex is your context? Do you backtrack on finding interesting matches? If so try and
avoid backtracking. Perhaps you can first identify the context which leads to interesting matches on the following lines.
Also, do you need Java for this? Using unix/linux cli tools you can do quite powerful and quick manipulation of text files.
Please post your algorithm and what your data looks like. Don't need real data just realistic data.
在正则表达式中使用多行开关
(?m)
,并在查询中包含前行和后行 - 这使得正则表达式可以在多行上工作(即结尾-行$
只是另一个字符)。像这样的东西:并使用它将所有输入作为单个字符串进行匹配。
Use the multiline switch
(?m)
in your regex and include the pre and post lines in your query - this makes the regex work over multiple lines (ie end-of-line$
is just another character). Something like this:And use that to match all your input as a single String.