.NET RegEx - 前 M 行的前 N 个字符
我想要针对以下 4 种基本情况的 4 个常规正则表达式:
- 从行首开始的 B 字符之后开始的最多 A 个字符 从文件开始的 D 行之后开始的最多 C
- 行 从行首开始的 B 字符之后开始最多 A 个字符从文件末尾开始的 D 行之前最多出现
- C 行 从行尾开始的 B 字符之前开始最多 A 个字符 从文件开头开始的 D 行之后开始最多 C 行 从
- 行尾开始最多 A 个字符在 B 字符之前开始从文件末尾的 D 行之前开始最多 C 行
这将允许选择文件中任何位置的任意文本块。
到目前为止,我已经设法提出仅分别适用于行和字符的情况:
(?<=(?m:^[^\r]{N}))[^\r]{1, M}
= 向上 每行的 M 个字符,在第一行之后 N 个字符[^\r]{1,M}(?=(?m:.{N}\r$))
= 每行最多 M 个字符,最后 N 个字符之前
上面 2 个表达式适用于字符,它们返回许多匹配项(每行一个)。
(?<=(\A([^\r]*\r\n){N}))(?m:\n*[^\r]*\r$){1,M}
= 前 N 行之后最多 M 行(((?=\r?)\n[^\r]*\r)|((?=\r?)\n[^\ r]+\r?)){1,M}(?=((\n[^\r]*\r)|(\n[^\r]+\r?)){N}\Z)
= UP TO M 行 BEFORE LAST N 行(从末尾算起)
这 2 个表达式与这些行等效,但它们始终只返回一个匹配项。
任务是将这些表达式组合起来以实现场景 1-4。有人可以帮忙吗?
请注意,问题标题中的情况只是场景 #1 的子类,其中 B = 0 且 D = 0。
示例 1:第 3-5 行的字符 3-6。共进行3场比赛。
来源:
line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
结果:
<match>ne3 </match>
<match>ne4 </match>
<match>ne5 </match>
示例 2:最后 1 行之前 2 行的最后 4 个字符。共2场比赛。
来源:
line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
结果:
<match>ah 4</match>
<match>ah 5</match>
I want 4 general RegEx expressions for the following 4 basic cases:
- Up to A chars starting after B chars from start of line on up to C lines starting after D lines from start of file
- Up to A chars starting after B chars from start of line on up to C lines occurring before D lines from end of file
- Up to A chars starting before B chars from end of line on up to C lines starting after D lines from start of file
- Up to A chars starting before B chars from end of line on up to C lines starting before D lines from end of file
These would allow to select arbitrary text blocks anywhere in the file.
So far I have managed to come up with cases that only work for lines and chars separately:
(?<=(?m:^[^\r]{N}))[^\r]{1,M}
= UP
TO M chars OF EVERY LINE, AFTER FIRST
N chars[^\r]{1,M}(?=(?m:.{N}\r$))
= UP TO M chars OF EVERY LINE, BEFORE LAST N chars
The above 2 expressions are for chars, and they return MANY matches (one for each line).
(?<=(\A([^\r]*\r\n){N}))(?m:\n*[^\r]*\r$){1,M}
= UP TO M lines AFTER FIRST N lines(((?=\r?)\n[^\r]*\r)|((?=\r?)\n[^\r]+\r?)){1,M}(?=((\n[^\r]*\r)|(\n[^\r]+\r?)){N}\Z)
= UP TO M lines BEFORE LAST N lines from end
These 2 expressions are equivalents for the lines, but they always return just ONE match.
The task is to combine these expressions to allow for scenarios 1-4. Anyone can help?
Note that the case in the title of the question, is just a subclass of scenario #1, where both B = 0 and D = 0.
EXAMPLE 1: Characters 3-6 of lines 3-5. A total of 3 matches.
SOURCE:
line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
RESULT:
<match>ne3 </match>
<match>ne4 </match>
<match>ne5 </match>
EXAMPLE 2: Last 4 characters of 2 lines before 1 last line. A total of 2 matches.
SOURCE:
line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
RESULT:
<match>ah 4</match>
<match>ah 5</match>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
这是基本情况 2 的一个正则表达式:
在文本中
它将匹配
(
ne2
从倒数第五行 (C+D = 5) 中的第三个字符 (B=2) 开始,等等.)Here's one regex for basic case 2:
In the text
it will match
(
ne2
starts at the third character (B=2) in the fifth-to-last line (C+D = 5), etc.)对于初学者,这是“基本情况 1”的答案:
您现在可以使用 So 迭代匹配
,在文本中
它将匹配
For starters, here's an answer for "Basic Case 1":
You can now iterate over the matches using
So, in the text
it will match
编辑:根据您的评论,这听起来确实是您无法控制的事情。我发布这个答案的原因是,我经常觉得,尤其是在涉及正则表达式时,开发人员很容易陷入技术挑战,而忽视了实际目标:解决问题问题。我知道我也是这样的。我认为这只是技术和创造性思维的不幸后果。
因此,如果可能的话,我想让您重新关注手头的问题,并强调,在有丰富的工具集的情况下,正则表达式不是完成这项工作的正确工具。如果出于您无法控制的原因,它是您可以使用的唯一工具,那么您当然别无选择。
我认为您可能有真正的理由要求使用正则表达式解决方案;但由于这些原因没有得到充分解释,我觉得你仍然有可能只是固执;)
你说这需要在正则表达式中完成,但我不相信!
没问题。谁说您需要 LINQ 来解决这样的问题? LINQ 只是让事情变得更容易;它不会使不可能的事情成为可能。
例如,您可以通过以下方式实现问题中的第一个案例(将其重构为更灵活的东西会相当简单,也允许您涵盖案例 2-3):
因此,有一个 .NET 2.0 友好的无需使用正则表达式(或 LINQ)即可解决一般问题。
也许我只是太笨了;是什么阻止您从非正则表达式开始,然后使用正则表达式来实现更“复杂”的行为?例如,如果您需要对上面
ScanText
返回的行进行额外处理,您当然可以使用 Regex 来实现。但从一开始就坚持使用正则表达式似乎......我不知道,只是没有必要。如果真是这样的话,那就太好了。但是,如果您的原因只是上面摘录中的那些,那么我不同意问题的这个特定方面(从某些文本行扫描某些字符)需要使用正则表达式来解决,即使正则表达式 < em>将需要解决本问题范围内未涵盖的问题的其他方面。
另一方面,如果您出于某种任意原因被迫使用正则表达式(例如,有人选择编写某些需求/规范,可能没有经过太多考虑,则正则表达式将用于此任务)好吧,我个人建议反对它。向任何有能力改变此要求的人解释正则表达式不是必需的,并且无需使用正则表达式即可轻松解决问题......或使用“正常”代码和正则表达式的组合。
我能想到的唯一其他可能性(尽管这可能是我自己缺乏想象力的结果)可以解释您需要使用正则表达式来解决您在问题中描述的问题 是您仅限于使用专门接受正则表达式作为用户输入的特定工具。但是你的问题被标记为
.net
,所以我必须假设在某种程度上你可以编写自己的代码来使用在解决这个问题时。如果是这样的话,那么我会再说一遍:我认为你不需要正则表达式;)Edit: Based on your comments, it sounds like this really is something out of your control. The reason I posted this answer is that I feel like often, especially when it comes to regular expressions, developers get easily caught up in the technical challenge and lose sight of the actual goal: solving the problem. I know I'm this way too. I think it's just an unfortunate consequence of being both technically and creatively minded.
So I wanted to refocus you, if possible, on the problem at hand, and stress that, in the presence of a well-stocked toolset, Regex is not the right tool for this job. If it's the only tool at your disposal for reasons outside your control, then, of course, you have no choice.
I figured you probably had real reasons for demanding a Regex solution; but since those reasons weren't fully explained, I felt there was still a chance you were just being stubborn ;)
You say this needs to be done in Regex, but I'm not convinced!
No problem. Who says you need LINQ for a problem like this? LINQ just makes things easier; it doesn't make impossible things possible.
Here's one way you could implement the first case from your question, for example (and it would be fairly straightforward to refactor this into something more flexible, allowing you to cover cases 2–3 as well):
So there's a .NET 2.0-friendly solution to the general problem, without using regular expressions (or LINQ).
Maybe I'm just being dense; what's preventing you from starting with something non-Regex, and then using Regex for more "sophisticated" behavior on top of that? If you need to do additional processing on the lines returned by
ScanText
above, for instance, you can certainly do so using Regex. But to insist on using Regex from the start seems... I don't know, just unnecessary.If that's truly the case, then very well. But if your reasons are only those from the excerpts above, then I disagree that this particular aspect of the problem (scanning certain characters from certain lines of text) needs to be addressed using Regex, even if Regex will be required for other aspects of the problem not covered in the scope of this question.
If, on the other hand, you're being forced to use Regex for some arbitrary reason—say, someone chose to write in some requirement/spec, possibly without putting much thought into it, that regular expressions would be used for this task—well, I would personally advise fighting against it. Explain to whoever is in a position to change this requirement that Regex is not necessary and that the problem can easily be solved without using Regex... or using a combination of "normal" code and Regex.
The only other possibility I can think of (though this may be the result of my own lack of imagination) that would explain you needing to use Regex for the problem you've described in your question is that you're restricted to using a particular tool that exclusively accepts regular expressions as user input. But your question is tagged
.net
, and so I have to assume there is some degree to which you can write your own code to be used in solving this problem. And if that's the case, then I will say it again: I don't think you need Regex ;)最后是基本情况 4 的一个解决方案:
在文本中,
这将匹配
(
2 b
因为它是三个字符 (A = 3),从倒数第 8 个字符 (B = 8) 开始倒数第五行 (C+D = 5) 等)And finally one solution for basic case 4:
In the text
this will match
(
2 b
because it's three characters (A = 3), starting at the 8th-to-last character (B = 8) in the fifth-to-last line (C+D = 5), etc.)这是基本情况 3 的情况:
因此在文本中
它将匹配
(
3 b
因为它有 3 个字符 (A = 3),从倒数第 8 个字符 (B = 8) 开始,在第三行(D = 2)等)Here's one for basic case 3:
So in the text
it will match
(
3 b
because it's 3 characters (A = 3), starting at the 8th-to-last character (B = 8), starting in the third line (D = 2), etc.)你为什么不做这样的事情:
你不需要 RegEx,你不需要 LINQ。
好吧,我重读了你问题的开头,你可以简单地参数化 for 循环和 Split 的开始和结束,以获得你所需要的。
Why don't you just do something like this:
You don't need RegEx, you don't need LINQ.
Well I reread the start of your question and you could simply parameterize the start and end of the for loop and the Split to get exactly what you need.
请原谅我提出两点:
我提出的解决方案并不完全基于正则表达式。我知道,我读到您需要纯正则表达式解决方案。但我遇到了一个有趣的问题,并且很快得出结论,使用正则表达式来解决这个问题会使问题变得过于复杂。我无法用纯正则表达式解决方案来回答。我找到了以下内容,并展示了它们;也许,他们可以给你一些想法。
我不懂 C# 或 .NET,只会 Python。由于正则表达式在所有语言中几乎都是相同的,我想我只会用正则表达式来回答,这就是我开始搜索这个问题的原因。现在,我仍然用 Python 展示我的解决方案,因为我认为无论如何它都很容易理解。
我认为通过唯一的正则表达式来捕获文本中所需的所有字母是非常困难的,因为在几行中查找多个字母对我来说似乎是在匹配中查找嵌套匹配的问题(也许我不够熟练)在正则表达式中)。
因此,我认为最好首先搜索所有行中字母的所有出现并将它们放入列表中,然后通过在列表中切片来选择所需的出现。
对于搜索一行中的字母,正则表达式首先对我来说似乎没问题。所以用函数 selectRE() 来解决。
后来,我意识到选择一行中的字母与在方便的索引处切割一行相同,并且与切割列表相同。因此函数 select()。
我把两个解放在一起给出,这样就可以验证两个函数的两个结果的相等性。
结果
现在我将研究 Tim Pietzcker 的棘手解决方案
Excuse me for two points:
I propose solutions that aren’t entirely Regex based. I know, I read that you need pure Regex solutions. But I went into the interesting problem and I rapidly concluded that usage of regexes for this problem is overcomplicating it. I didn’t fell able to answer with pure Regex solutions. I found the following ones, and I show them; maybe, they could give you ideas.
I dont know C# or .NET, only Python. As regexes are nearly the same in all languages, I thought I was going to answer with just regexes, that’s why I began to search about the problem. Now, I show my solutions in Python all the same because I think that anyway it’s easy to understand.
I think it’s very difficult to capture all the occurences of letters that you want in a text by means of a unique regex, because finding several letters in several lines seems to me a problem of finding nested matches in matches (maybe am I not enough skilled in regexes).
So I thought better to search primarily all the occurences of letters in all lines and to put them in a list, and next to select the whished occurences by slicing in the list.
For the search of letters in a line, a regex seemed OK to me first. SO the solution with function selectRE().
Afterwarrds, I realized that selecting the letters in a line is the same as slicing a line at convenient indexes and that’s the same as slicng a list. Hence the function select().
I give the two solutions together, so the equality of the two results of the two functions can be verified.
result
And now I will study the tricky solutions of Tim Pietzcker