当前位置：文江博客话题详情

.NET RegEx - 前 M 行的前 N 个字符

发布于 2024-10-09 17:04:46 字数 1464 浏览 1 评论 0原文

我想要针对以下 4 种基本情况的 4 个常规正则表达式：

从行首开始的 B 字符之后开始的最多 A 个字符从文件开始的 D 行之后开始的最多 C
行从行首开始的 B 字符之后开始最多 A 个字符从文件末尾开始的 D 行之前最多出现
C 行从行尾开始的 B 字符之前开始最多 A 个字符从文件开头开始的 D 行之后开始最多 C 行从
行尾开始最多 A 个字符在 B 字符之前开始从文件末尾的 D 行之前开始最多 C 行

这将允许选择文件中任何位置的任意文本块。

到目前为止，我已经设法提出仅分别适用于行和字符的情况：

(?<=(?m:^[^\r]{N}))[^\r]{1, M} = 向上每行的 M 个字符，在第一行之后 N 个字符
[^\r]{1,M}(?=(?m:.{N}\r$)) = 每行最多 M 个字符，最后 N 个字符之前

上面 2 个表达式适用于字符，它们返回许多匹配项（每行一个）。

(?<=(\A([^\r]*\r\n){N}))(?m:\n*[^\r]*\r$){1,M} = 前 N 行之后最多 M 行
(((?=\r?)\n[^\r]*\r)|((?=\r?)\n[^\ r]+\r?)){1,M}(?=((\n[^\r]*\r)|(\n[^\r]+\r?)){N}\Z) = UP TO M 行 BEFORE LAST N 行（从末尾算起）

这 2 个表达式与这些行等效，但它们始终只返回一个匹配项。

任务是将这些表达式组合起来以实现场景 1-4。有人可以帮忙吗？

请注意，问题标题中的情况只是场景 #1 的子类，其中 B = 0 且 D = 0。

示例 1：第 3-5 行的字符 3-6。共进行3场比赛。

来源：

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

结果：

<match>ne3 </match>
<match>ne4 </match>
<match>ne5 </match>

示例 2：最后 1 行之前 2 行的最后 4 个字符。共2场比赛。

来源：

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

结果：

<match>ah 4</match>
<match>ah 5</match>

原文

I want 4 general RegEx expressions for the following 4 basic cases:

Up to A chars starting after B chars from start of line on up to C lines starting after D lines from start of file
Up to A chars starting after B chars from start of line on up to C lines occurring before D lines from end of file
Up to A chars starting before B chars from end of line on up to C lines starting after D lines from start of file
Up to A chars starting before B chars from end of line on up to C lines starting before D lines from end of file

These would allow to select arbitrary text blocks anywhere in the file.

So far I have managed to come up with cases that only work for lines and chars separately:

(?<=(?m:^[^\r]{N}))[^\r]{1,M} = UP
TO M chars OF EVERY LINE, AFTER FIRST
N chars
[^\r]{1,M}(?=(?m:.{N}\r$))
= UP TO M chars OF EVERY LINE, BEFORE LAST N chars

The above 2 expressions are for chars, and they return MANY matches (one for each line).

(?<=(\A([^\r]*\r\n){N}))(?m:\n*[^\r]*\r$){1,M} = UP TO M lines AFTER FIRST N lines
(((?=\r?)\n[^\r]*\r)|((?=\r?)\n[^\r]+\r?)){1,M}(?=((\n[^\r]*\r)|(\n[^\r]+\r?)){N}\Z) = UP TO M lines BEFORE LAST N lines from end

These 2 expressions are equivalents for the lines, but they always return just ONE match.

The task is to combine these expressions to allow for scenarios 1-4. Anyone can help?

Note that the case in the title of the question, is just a subclass of scenario #1, where both B = 0 and D = 0.

EXAMPLE 1: Characters 3-6 of lines 3-5. A total of 3 matches.

SOURCE:

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

RESULT:

<match>ne3 </match>
<match>ne4 </match>
<match>ne5 </match>

EXAMPLE 2: Last 4 characters of 2 lines before 1 last line. A total of 2 matches.

SOURCE:

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

RESULT:

<match>ah 4</match>
<match>ah 5</match>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

禾厶谷欠 2024-10-16 17:04:46

这是基本情况 2 的一个正则表达式：

Regex regexObj = new Regex(
    @"(?<=              # Assert that the following can be matched before the current position
     ^                # Start of line
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)
    (?=               # Assert that the following can be matched after the current position
     .*$              # rest of the current line
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

在文本中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

它将匹配

ne2
ne3
ne4

（ne2 从倒数第五行 (C+D = 5) 中的第三个字符 (B=2) 开始，等等.)

Here's one regex for basic case 2:

Regex regexObj = new Regex(
    @"(?<=              # Assert that the following can be matched before the current position
     ^                # Start of line
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)
    (?=               # Assert that the following can be matched after the current position
     .*$              # rest of the current line
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

In the text

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

it will match

ne2
ne3
ne4

(ne2 starts at the third character (B=2) in the fifth-to-last line (C+D = 5), etc.)

回复收藏 0 原文

烟─花易冷 2024-10-16 17:04:46

对于初学者，这是“基本情况 1”的答案：

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

您现在可以使用 So 迭代匹配

Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();
}

，在文本中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

它将匹配

ne3
ne4
ne5

For starters, here's an answer for "Basic Case 1":

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

You can now iterate over the matches using

Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();
}

So, in the text

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

it will match

ne3
ne4
ne5

回复收藏 0 原文

空城仅有旧梦在 2024-10-16 17:04:46

编辑：根据您的评论，这听起来确实是您无法控制的事情。我发布这个答案的原因是，我经常觉得，尤其是在涉及正则表达式时，开发人员很容易陷入技术挑战，而忽视了实际目标：解决问题问题。我知道我也是这样的。我认为这只是技术和创造性思维的不幸后果。

因此，如果可能的话，我想让您重新关注手头的问题，并强调，在有丰富的工具集的情况下，正则表达式不是完成这项工作的正确工具。如果出于您无法控制的原因，它是您可以使用的唯一工具，那么您当然别无选择。

我认为您可能有真正的理由要求使用正则表达式解决方案；但由于这些原因没有得到充分解释，我觉得你仍然有可能只是固执;)

你说这需要在正则表达式中完成，但我不相信！

首先，我仅限于 .NET 2.0 [ . 。。 ]

没问题。谁说您需要 LINQ 来解决这样的问题？ LINQ 只是让事情变得更容易；它不会使不可能的事情成为可能。

例如，您可以通过以下方式实现问题中的第一个案例（将其重构为更灵活的东西会相当简单，也允许您涵盖案例 2-3）：

public IEnumerable<string> ScanText(TextReader reader,
                                    int start,
                                    int count,
                                    int lineStart,
                                    int lineCount)
{
    int i = 0;
    while (i < lineStart && reader.Peek() != -1)
    {
        reader.ReadLine();
        ++i;
    }

    i = 0;
    while (i < lineCount && reader.Peek() != -1)
    {
        string line = reader.ReadLine();

        if (line.Length < start)
        {
            yield return ""; // or null? or continue?
        }
        else
        {
            int length = Math.Min(count, line.Length - start);
            yield return line.Substring(start, length);
        }

        ++i;
    }
}

因此，有一个 .NET 2.0 友好的无需使用正则表达式（或 LINQ）即可解决一般问题。

其次，我需要 RegEx 的灵活性，以允许基于这些 [ 构建的更复杂的表达式。。。 ]

也许我只是太笨了；是什么阻止您从非正则表达式开始，然后使用正则表达式来实现更“复杂”的行为？例如，如果您需要对上面 ScanText 返回的行进行额外处理，您当然可以使用 Regex 来实现。但从一开始就坚持使用正则表达式似乎......我不知道，只是没有必要。

不幸的是，由于项目的性质，它必须在 RegEx [ . 。。 ]

如果真是这样的话，那就太好了。但是，如果您的原因只是上面摘录中的那些，那么我不同意问题的这个特定方面（从某些文本行扫描某些字符）需要使用正则表达式来解决，即使正则表达式 < em>将需要解决本问题范围内未涵盖的问题的其他方面。

另一方面，如果您出于某种任意原因被迫使用正则表达式（例如，有人选择编写某些需求/规范，可能没有经过太多考虑，则正则表达式将用于此任务）好吧，我个人建议反对它。向任何有能力改变此要求的人解释正则表达式不是必需的，并且无需使用正则表达式即可轻松解决问题......或使用“正常”代码和正则表达式的组合。

我能想到的唯一其他可能性（尽管这可能是我自己缺乏想象力的结果）可以解释您需要使用正则表达式来解决您在问题中描述的问题 是您仅限于使用专门接受正则表达式作为用户输入的特定工具。但是你的问题被标记为.net，所以我必须假设在某种程度上你可以编写自己的代码来使用在解决这个问题时。如果是这样的话，那么我会再说一遍：我认为你不需要正则表达式；）

Edit: Based on your comments, it sounds like this really is something out of your control. The reason I posted this answer is that I feel like often, especially when it comes to regular expressions, developers get easily caught up in the technical challenge and lose sight of the actual goal: solving the problem. I know I'm this way too. I think it's just an unfortunate consequence of being both technically and creatively minded.

So I wanted to refocus you, if possible, on the problem at hand, and stress that, in the presence of a well-stocked toolset, Regex is not the right tool for this job. If it's the only tool at your disposal for reasons outside your control, then, of course, you have no choice.

I figured you probably had real reasons for demanding a Regex solution; but since those reasons weren't fully explained, I felt there was still a chance you were just being stubborn ;)

You say this needs to be done in Regex, but I'm not convinced!

First of all I am restricted to .NET 2.0 [ . . . ]

No problem. Who says you need LINQ for a problem like this? LINQ just makes things easier; it doesn't make impossible things possible.

Here's one way you could implement the first case from your question, for example (and it would be fairly straightforward to refactor this into something more flexible, allowing you to cover cases 2–3 as well):

public IEnumerable<string> ScanText(TextReader reader,
                                    int start,
                                    int count,
                                    int lineStart,
                                    int lineCount)
{
    int i = 0;
    while (i < lineStart && reader.Peek() != -1)
    {
        reader.ReadLine();
        ++i;
    }

    i = 0;
    while (i < lineCount && reader.Peek() != -1)
    {
        string line = reader.ReadLine();

        if (line.Length < start)
        {
            yield return ""; // or null? or continue?
        }
        else
        {
            int length = Math.Min(count, line.Length - start);
            yield return line.Substring(start, length);
        }

        ++i;
    }
}

So there's a .NET 2.0-friendly solution to the general problem, without using regular expressions (or LINQ).

Secondly I need the flexibility of RegEx to allow for more sophisticated expressions that will build on these [ . . . ]

Maybe I'm just being dense; what's preventing you from starting with something non-Regex, and then using Regex for more "sophisticated" behavior on top of that? If you need to do additional processing on the lines returned by ScanText above, for instance, you can certainly do so using Regex. But to insist on using Regex from the start seems... I don't know, just unnecessary.

Unfortunately due to nature of the project it has to be done in RegEx [ . . . ]

If that's truly the case, then very well. But if your reasons are only those from the excerpts above, then I disagree that this particular aspect of the problem (scanning certain characters from certain lines of text) needs to be addressed using Regex, even if Regex will be required for other aspects of the problem not covered in the scope of this question.

If, on the other hand, you're being forced to use Regex for some arbitrary reason—say, someone chose to write in some requirement/spec, possibly without putting much thought into it, that regular expressions would be used for this task—well, I would personally advise fighting against it. Explain to whoever is in a position to change this requirement that Regex is not necessary and that the problem can easily be solved without using Regex... or using a combination of "normal" code and Regex.

The only other possibility I can think of (though this may be the result of my own lack of imagination) that would explain you needing to use Regex for the problem you've described in your question is that you're restricted to using a particular tool that exclusively accepts regular expressions as user input. But your question is tagged .net, and so I have to assume there is some degree to which you can write your own code to be used in solving this problem. And if that's the case, then I will say it again: I don't think you need Regex ;)

回复收藏 0 原文

榆西 2024-10-16 17:04:46

最后是基本情况 4 的一个解决方案：

Regex regexObj = new Regex(
    @"(?=             # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )                 # End of lookahead assertion
    .{1,3}            # Match three characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

在文本中，

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

这将匹配

2 b
3 b
4 b

(2 b 因为它是三个字符 (A = 3)，从倒数第 8 个字符 (B = 8) 开始倒数第五行 (C+D = 5) 等）

And finally one solution for basic case 4:

Regex regexObj = new Regex(
    @"(?=             # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )                 # End of lookahead assertion
    .{1,3}            # Match three characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

In the text

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

this will match

2 b
3 b
4 b

(2 b because it's three characters (A = 3), starting at the 8th-to-last character (B = 8) in the fifth-to-last line (C+D = 5), etc.)

回复收藏 0 原文

故事↓在人 2024-10-16 17:04:46

这是基本情况 3 的情况：

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .*               # any number of characters
    )                 # End of lookbehind assertion
    (?=               # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     $                # end of line
    )                 # End of lookahead assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

因此在文本中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

它将匹配

3 b
4 b
5 b

(3 b 因为它有 3 个字符 (A = 3)，从倒数第 8 个字符 (B = 8) 开始，在第三行（D = 2）等）

Here's one for basic case 3:

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .*               # any number of characters
    )                 # End of lookbehind assertion
    (?=               # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     $                # end of line
    )                 # End of lookahead assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

So in the text

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

it will match

3 b
4 b
5 b

(3 b because it's 3 characters (A = 3), starting at the 8th-to-last character (B = 8), starting in the third line (D = 2), etc.)

回复收藏 0 原文

森末i 2024-10-16 17:04:46

你为什么不做这样的事情：

//Assuming you have it read into a string name sourceString
String[] SplitString = sourceString.Split(Environment.Newline); //You will probably need to account for any line delimeter
String[M] NewStrings;
for(i=0;i<M;i++) {
    NewStrings[i] = SplitString[i].SubString(0,N) //Or (N, SplitString[i].Length -1) depending on what you need
}

你不需要 RegEx，你不需要 LINQ。

好吧，我重读了你问题的开头，你可以简单地参数化 for 循环和 Split 的开始和结束，以获得你所需要的。

Why don't you just do something like this:

//Assuming you have it read into a string name sourceString
String[] SplitString = sourceString.Split(Environment.Newline); //You will probably need to account for any line delimeter
String[M] NewStrings;
for(i=0;i<M;i++) {
    NewStrings[i] = SplitString[i].SubString(0,N) //Or (N, SplitString[i].Length -1) depending on what you need
}

You don't need RegEx, you don't need LINQ.

Well I reread the start of your question and you could simply parameterize the start and end of the for loop and the Split to get exactly what you need.

回复收藏 0 原文

恏ㄋ傷疤忘ㄋ疼 2024-10-16 17:04:46

请原谅我提出两点：

我提出的解决方案并不完全基于正则表达式。我知道，我读到您需要纯正则表达式解决方案。但我遇到了一个有趣的问题，并且很快得出结论，使用正则表达式来解决这个问题会使问题变得过于复杂。我无法用纯正则表达式解决方案来回答。我找到了以下内容，并展示了它们；也许，他们可以给你一些想法。
我不懂 C# 或 .NET，只会 Python。由于正则表达式在所有语言中几乎都是相同的，我想我只会用正则表达式来回答，这就是我开始搜索这个问题的原因。现在，我仍然用 Python 展示我的解决方案，因为我认为无论如何它都很容易理解。

我认为通过唯一的正则表达式来捕获文本中所需的所有字母是非常困难的，因为在几行中查找多个字母对我来说似乎是在匹配中查找嵌套匹配的问题（也许我不够熟练）在正则表达式中）。

因此，我认为最好首先搜索所有行中字母的所有出现并将它们放入列表中，然后通过在列表中切片来选择所需的出现。

对于搜索一行中的字母，正则表达式首先对我来说似乎没问题。所以用函数 selectRE() 来解决。

后来，我意识到选择一行中的字母与在方便的索引处切割一行相同，并且与切割列表相同。因此函数 select()。

我把两个解放在一起给出，这样就可以验证两个函数的两个结果的相等性。

import re

def selectRE(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    def pat(a,which_chars,b):
        if which_chars=='to':
            print repr(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
        elif which_chars=='before':
            print repr('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
            return re.compile('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
        elif which_chars=='after':
            print repr(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return pat(a,which_chars,b).findall(ch)[x:y]


def select(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    if   which_chars=='to'    :  a   = a-1
    elif which_chars=='after' :  a,b = b,a+b

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return [ line[len(line)-a-b:len(line)-b] if which_chars=='before' else line[a:b]
             for i,line in enumerate(ch.splitlines()) if x<=i<y ]


ch = '''line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
'''
print ch,'\n'

print 'Characters 3-6 of lines 3-5. A total of 3 matches.'
print selectRE(3,'to',6,3,'to',5,ch)
print   select(3,'to',6,3,'to',5,ch)
print
print 'Characters 1-5 of lines 4-5. A total of 2 matches.'
print selectRE(1,'to',5,4,'to',5,ch)
print   select(1,'to',5,4,'to',5,ch)
print
print '7 characters before the last 3 chars of lines 2-6. A total of 5 matches.'
print selectRE(7,'before',3,2,'to',6,ch)
print   select(7,'before',3,2,'to',6,ch)
print
print '6 characters before the 2 last characters of 3 lines before the 3 last lines.'
print selectRE(6,'before',2,3,'before',3,ch)
print   select(6,'before',2,3,'before',3,ch)
print 
print '4 last characters of 2 lines before 1 last line. A total of 2 matches.'
print selectRE(4,'before',0,2,'before',1,ch)
print   select(4,'before',0,2,'before',1,ch)
print
print 'last 1 character of 4 last lines. A total of 2 matches.'
print selectRE(1,'before',0,4,'before',0,ch)
print   select(1,'before',0,4,'before',0,ch)
print
print '7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.'
print selectRE(7,'before',3,3,'after',2,ch)
print   select(7,'before',3,3,'after',2,ch)
print
print '5 characters before the 3 last chars of the 5 first lines'
print selectRE(5,'before',3,5,'after',0,ch)
print   select(5,'before',3,5,'after',0,ch)
print
print 'Characters 3-6 of the 4 first lines'
print selectRE(3,'to',6,4,'after',0,ch)
print   select(3,'to',6,4,'after',0,ch)
print
print '9 characters after the 2 first chars of the 3 lines after the 1 first line'
print selectRE(9,'after',2,3,'after',1,ch)
print   select(9,'after',2,3,'after',1,ch)

结果

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6


Characters 3-6 of lines 3-5. A total of 3 matches.
'.{2}(.{4}).*(?:\n|$)'
['ne3 ', 'ne4 ', 'ne5 ']
['ne3 ', 'ne4 ', 'ne5 ']

Characters 1-5 of lines 4-5. A total of 2 matches.
'.{0}(.{5}).*(?:\n|$)'
['line4', 'line5']
['line4', 'line5']

7 characters before the last 3 chars of lines 2-6. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']

6 characters before the 2 last characters of 3 lines before the 3 last lines.
'.*(.{6}).{2}(?:\n|$)'
['2 blah', '3 blah', '4 blah']
['2 blah', '3 blah', '4 blah']

4 last characters of 2 lines before 1 last line. A total of 2 matches.
'.*(.{4})(?:\n|$)'
['ah 5', 'ah 6']
['ah 5', 'ah 6']

last 1 character of 4 last lines. A total of 2 matches.
'.*(.{1})(?:\n|$)'
['4', '5', '6']
['4', '5', '6']

7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne3 bla', 'ne4 bla', 'ne5 bla']
['ne3 bla', 'ne4 bla', 'ne5 bla']

5 characters before the 3 last chars of the 5 first lines
'.*(.{5}).{3}(?:\n|$)'
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']

Characters 3-6 of the 4 first lines
'.{2}(.{4}).*(?:\n|$)'
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']

9 characters after the 2 first chars of the 3 lines after the 1 first line
'.{2}(.{9}).*(?:\n|$)'
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']

现在我将研究 Tim Pietzcker 的棘手解决方案

Excuse me for two points:

I propose solutions that aren’t entirely Regex based. I know, I read that you need pure Regex solutions. But I went into the interesting problem and I rapidly concluded that usage of regexes for this problem is overcomplicating it. I didn’t fell able to answer with pure Regex solutions. I found the following ones, and I show them; maybe, they could give you ideas.
I dont know C# or .NET, only Python. As regexes are nearly the same in all languages, I thought I was going to answer with just regexes, that’s why I began to search about the problem. Now, I show my solutions in Python all the same because I think that anyway it’s easy to understand.

I think it’s very difficult to capture all the occurences of letters that you want in a text by means of a unique regex, because finding several letters in several lines seems to me a problem of finding nested matches in matches (maybe am I not enough skilled in regexes).

So I thought better to search primarily all the occurences of letters in all lines and to put them in a list, and next to select the whished occurences by slicing in the list.

For the search of letters in a line, a regex seemed OK to me first. SO the solution with function selectRE().

Afterwarrds, I realized that selecting the letters in a line is the same as slicing a line at convenient indexes and that’s the same as slicng a list. Hence the function select().

I give the two solutions together, so the equality of the two results of the two functions can be verified.

import re

def selectRE(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    def pat(a,which_chars,b):
        if which_chars=='to':
            print repr(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
        elif which_chars=='before':
            print repr('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
            return re.compile('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
        elif which_chars=='after':
            print repr(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return pat(a,which_chars,b).findall(ch)[x:y]


def select(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    if   which_chars=='to'    :  a   = a-1
    elif which_chars=='after' :  a,b = b,a+b

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return [ line[len(line)-a-b:len(line)-b] if which_chars=='before' else line[a:b]
             for i,line in enumerate(ch.splitlines()) if x<=i<y ]


ch = '''line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
'''
print ch,'\n'

print 'Characters 3-6 of lines 3-5. A total of 3 matches.'
print selectRE(3,'to',6,3,'to',5,ch)
print   select(3,'to',6,3,'to',5,ch)
print
print 'Characters 1-5 of lines 4-5. A total of 2 matches.'
print selectRE(1,'to',5,4,'to',5,ch)
print   select(1,'to',5,4,'to',5,ch)
print
print '7 characters before the last 3 chars of lines 2-6. A total of 5 matches.'
print selectRE(7,'before',3,2,'to',6,ch)
print   select(7,'before',3,2,'to',6,ch)
print
print '6 characters before the 2 last characters of 3 lines before the 3 last lines.'
print selectRE(6,'before',2,3,'before',3,ch)
print   select(6,'before',2,3,'before',3,ch)
print 
print '4 last characters of 2 lines before 1 last line. A total of 2 matches.'
print selectRE(4,'before',0,2,'before',1,ch)
print   select(4,'before',0,2,'before',1,ch)
print
print 'last 1 character of 4 last lines. A total of 2 matches.'
print selectRE(1,'before',0,4,'before',0,ch)
print   select(1,'before',0,4,'before',0,ch)
print
print '7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.'
print selectRE(7,'before',3,3,'after',2,ch)
print   select(7,'before',3,3,'after',2,ch)
print
print '5 characters before the 3 last chars of the 5 first lines'
print selectRE(5,'before',3,5,'after',0,ch)
print   select(5,'before',3,5,'after',0,ch)
print
print 'Characters 3-6 of the 4 first lines'
print selectRE(3,'to',6,4,'after',0,ch)
print   select(3,'to',6,4,'after',0,ch)
print
print '9 characters after the 2 first chars of the 3 lines after the 1 first line'
print selectRE(9,'after',2,3,'after',1,ch)
print   select(9,'after',2,3,'after',1,ch)

result

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6


Characters 3-6 of lines 3-5. A total of 3 matches.
'.{2}(.{4}).*(?:\n|$)'
['ne3 ', 'ne4 ', 'ne5 ']
['ne3 ', 'ne4 ', 'ne5 ']

Characters 1-5 of lines 4-5. A total of 2 matches.
'.{0}(.{5}).*(?:\n|$)'
['line4', 'line5']
['line4', 'line5']

7 characters before the last 3 chars of lines 2-6. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']

6 characters before the 2 last characters of 3 lines before the 3 last lines.
'.*(.{6}).{2}(?:\n|$)'
['2 blah', '3 blah', '4 blah']
['2 blah', '3 blah', '4 blah']

4 last characters of 2 lines before 1 last line. A total of 2 matches.
'.*(.{4})(?:\n|$)'
['ah 5', 'ah 6']
['ah 5', 'ah 6']

last 1 character of 4 last lines. A total of 2 matches.
'.*(.{1})(?:\n|$)'
['4', '5', '6']
['4', '5', '6']

7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne3 bla', 'ne4 bla', 'ne5 bla']
['ne3 bla', 'ne4 bla', 'ne5 bla']

5 characters before the 3 last chars of the 5 first lines
'.*(.{5}).{3}(?:\n|$)'
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']

Characters 3-6 of the 4 first lines
'.{2}(.{4}).*(?:\n|$)'
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']

9 characters after the 2 first chars of the 3 lines after the 1 first line
'.{2}(.{9}).*(?:\n|$)'
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']

And now I will study the tricky solutions of Tim Pietzcker

回复收藏 0 原文

~没有更多了~