多行模式匹配

发布于 2024-12-03 11:16:57 字数 548 浏览 2 评论 0原文

问题：

在一个大文件（纯文本）中，有一些“有趣”的行包含一些特定的单词。目的是提取包含这些单词的所有行。然而，在某些情况下，即使一行包含此类单词，它也可能不是真正“有趣”，具体取决于其上下文（该行上方和下方各行的内容）。应排除此类行。

我的算法：

我每个有趣的单词都有一个正则表达式，并将该正则表达式应用于文件的每一行。如果找到匹配项，我会通过应用另一组正则表达式（可能跨行）来检查该行是否被排除（取决于其上下文）。如果再次找到匹配项，则该行不是有趣的行，并转到其余行。如果没有，我会将这一行注册为有趣的行，然后转到下一行。

为了检查某行是否被排除，我创建了一个新字符串，如下所示：

N number of lines above current line\n
The current line\n
N number of lines below current line

这需要花费大量时间。

我的问题：有更好的方法吗？

感谢您抽出时间。

原文

Problem:

In a large file (plain text), there are some "interesting" lines which contain some specific words. The aim is to extract all those lines that contain such words. However, in some cases, even if a line contains such words, it may not be really "interesting", depending on its context (contents of lines above and below that line). Such lines should be excluded.

My algorithm:

I have a regex each for the interesting words and apply this regex on each line of the file. If a match is found, I check if this line was excluded (depending on its context) by applying another set of regexes (which can potentially span across lines). If a match is found again, this line is not an interesting line and move on to remaining lines. If not, I register this line as a interesting line and move on to next line.

To check if a line was excluded, I create a new string that looks like:

N number of lines above current line\n
The current line\n
N number of lines below current line

This takes an awful amount of time.

My question: Is there a better way of doing this?

Thanks for your time.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红ご颜醉 2024-12-10 11:16:57

正则表达式不一定很快。有更快的字符串搜索算法。

一种更加启发式的方法怎么样？

从头到尾处理文件。将感兴趣单词的每一行+偏移量存储在查找结构中。
填充查找结构后，开始使用类似以下算法的方法对其进行处理：

for elem in selected_word_items:
    check line + index of related search items in structure.
    if within_desired_range:
        flag_for_further_processing()

这里的关键是您处理文件一次，然后使用元数据结构进行实际的上下文检查。如果使用正确的数据结构，它应该会快一些。

regex is not necessarily fast. There are faster string search algorithms out there.

How about a more heuristic-based approach.

Process the file from start to finish. Store every line + offset in line of a word of interest in a lookup structure.
Once the lookup structure is populated, start to process through it using something like the following algorithm:

for elem in selected_word_items:
    check line + index of related search items in structure.
    if within_desired_range:
        flag_for_further_processing()

the key here is that you're processing the file once, then using the metadata structure to do your actual context checking. It should be quite a bit faster if you use the right data structures.

回复收藏 0 原文

遇到 2024-12-10 11:16:57

很大程度上取决于数据的形式。

您的背景有多复杂？你会放弃寻找有趣的比赛吗？如果是这样尝试
避免走回头路。也许您可以首先确定导致以下几行有趣匹配的上下文。

另外，你需要Java吗？使用 unix/linux cli 工具，您可以对文本文件进行非常强大且快速的操作。

请发布你的算法和你的数据是什么样的。不需要真实的数据，只需要真实的数据。

回复收藏 0 原文

风月客 2024-12-10 11:16:57

在正则表达式中使用多行开关(?m)，并在查询中包含前行和后行 - 这使得正则表达式可以在多行上工作（即结尾-行 $ 只是另一个字符）。像这样的东西：

String regex = "(?m)pre lines.*?interesting words.*?post lines";

并使用它将所有输入作为单个字符串进行匹配。

Use the multiline switch (?m) in your regex and include the pre and post lines in your query - this makes the regex work over multiple lines (ie end-of-line $ is just another character). Something like this:

String regex = "(?m)pre lines.*?interesting words.*?post lines";

And use that to match all your input as a single String.

回复收藏 0 原文

~没有更多了~

关于作者

只有影子陪我不离不弃

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

多行模式匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

多行模式匹配

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。