使用REGEX具有相同标题的数据块

发布于 2025-02-05 09:26:07 字数 511 浏览 2 评论 0原文

我有一个像这样的长字符串:

[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]

[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]

[[title]]
a = "a3"
b = "3"
c = "3"

[[title]]
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

我的目标是在每个标题(无标题)中提取文本并将其放入切片中。 我尝试使用属性键(例如D和E),但有时它们不存在。

您可以在下面的我的正交易格中查看:

(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)

想找到一种方法来提取每个标题之间的数据

我无法在语法后面浏览\,

谢谢!

I have a long string that is build like that:

[[title]]
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]

[[title]]
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]

[[title]]
a = "a3"
b = "3"
c = "3"

[[title]]
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

My target is to extract the text inside each title (without the title) and put it into a slice.
I've tried to use the attributes keys (like d and e) but sometimes they don't exist.

You can look in my regex below:

(?m)(((\[\[title]]\s*\n)(?:^.+$\n)+?)(d.*?$)(\s*e(.|\n)*?])?)

I want to find a way to extract the data between each title until \n or end of string

Edition:

I'm using GO so I can't use look around \ behind syntax

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

顾北清歌寒 2025-02-12 09:26:07

您可以使用以下模式,该模式从[[title]]到空行匹配。

`\[\[title]](.*?)^

您可以使用以下模式,该模式从 [[title]] 到空行匹配。

gms

说明

  • \ [\ [title]] match [[title]]
  • 捕获组


  • 关闭组
  • ^$使用m(Multiline)标志表示空行

请参见 demo 使用 golang REGEX引擎

You can use the following pattern that matches from [[title]] to an empty line.

`\[\[title]](.*?)^

You can use the following pattern that matches from [[title]] to an empty line.

gms

Explanation

  • \[\[title]] Match [[title]]
  • ( Capturing group
    • .*? Non-greedy match till next match
  • ) Close group
  • ^$ Using m (multiline) flag this means an empty line

See the demo with the Golang regex engine

清晰传感 2025-02-12 09:26:07

这似乎有效。它不像 @artyomvancyan的答案那样简单或优雅,尽管它的优势很小,它在表达结束时不需要纽文:

[demo]

(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+

说明:

  • (?m):多行修改器。
  • (?:\ [\ [title]] \ n(<文本直到下一个关闭的方括号或空白行>))+:找到一个或多个以[[[title]]开头的块] \ n,然后是< text;直到下一个闭合方括号或空白行>,然后捕获这些文本。
  • (?:。*\ n)+?(?:\] |^$):两个连续的非捕捉子组;第一个是一堆行,(?:。*| n)+,non-greedy,;第二个是闭合方括号,],或一个空行,^$。也就是说,一堆线在包含闭合方括号或空白行的第一行线中结束。

This seems to work. It's not as simple or elegant as @ArtyomVancyan's answer, although it has the little advantage that it doesn't need a newline at the end of the expression:

[Demo]

(?m)(?:\[\[title]]\n((?:.*\n)+?(?:\]|^$)))+

Explanation:

  • (?m): multi line modifier.
  • (?:\[\[title]]\n(<text until next closing square bracket or blank line>))+: find one or more blocks starting with [[title]]\n and followed by <text until next closing square bracket or blank line>, and capture those texts.
  • (?:.*\n)+?(?:\]|^$): two consecutive non-capturing subgroups; the first one is a bunch of lines, (?:.*|n)+, non-greedy, ?; and the second one is either a closing square bracket, ], or an empty line, ^$. That is, a bunch of lines ending either in the first line line containing a closing square bracket or a blank line.
情域 2025-02-12 09:26:07

只需为您的自定义格式构建自定义解析器,您可能会更好地为您提供更好的服务,而不是制作危险的正则态度,或者您可能会发现您可以重新使用

​> and 在块开始时,您可以使用正则判处找到它们,但

如果您对内容不感兴趣,则只能将它们分开(肯定是下一步是您是),然后您'确保结构与显示的结构一样简单,您也可以直接在这些上直接拆分两次

>>> long_string_config = """ """  # input data omitted for brevity
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
...    print("---")
...    print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]

Instead of making a regex which seems fraught with perils, you'll probably be better served by just building a custom parser for your custom format, or you may find you can repurpose an implementation of an INI configparser

If the titles are always defined as being within pairs of [[]] and at the start of a block, you could use a regex to find them, but only to separate them out

If you're not interested in the content (surely the next step is that you are) and you're sure the structure is as simple as you show, you could also just directly split twice on these instead

>>> long_string_config = """ """  # input data omitted for brevity
>>> for block in filter(None, (a.split("]]")[-1].strip() for a in long_string_config.split("[["))):
...    print("---")
...    print(block)
...
---
a = "1"
b = "1"
c = "1"
d = "1"
e = [
 "1",
 "1",
]
---
a = "2"
b = "2"
c = "2"
d = "2"
e = [
 "2",
]
---
a = "a3"
b = "3"
c = "3"
---
a = "a4"
b = "4"
c = "4"
e = [
 "4",
]
殤城〤 2025-02-12 09:26:07

您可能会使用模式重复标题部分下的行可能格式。

行以词字符开始,然后是=,然后零件“ ...”[...]

\[\[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]*"|\[[^\]\[]*]))*)

说明

  • \ [\ [title]] match [[title]]
  • capture> capture 组1
    • (?:非捕获组
      • \ r?\ n匹配newline
      • \ w+ \ s*= \ s*匹配1+ word chars和=在可选的whitspace chars
      • 之间

      • (?:替代方案的非捕获组
        • “ [^“]*”“ ...”
        • 匹配

        • |
        • \ [[^\] \ []*][ ... ]]
        • 匹配

      • 关闭非捕获组
    • )*关闭非捕获组,并选择重复
  • 关闭组1

regex demo

You might use a pattern to repeat the possible format of the lines under the title part.

The lines start with word characters followed by = and then either a part "..." or [...]

\[\[title]]((?:\r?\n\w+\s*=\s*(?:"[^"]*"|\[[^\]\[]*]))*)

Explanation

  • \[\[title]] Match [[title]]
  • ( Capture group 1
    • (?: Non capture group
      • \r?\n Match a newline
      • \w+\s*=\s* Match 1+ word chars and = between optional whitspace chars
      • (?: Non capture group for the alternatives
        • "[^"]*" Match from "..."
        • | Or
        • \[[^\]\[]*] match from [...]
      • ) Close non capture group
    • )* Close non capture group and optionally repeat
  • ) Close group 1

Regex demo

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文