正则表达式 - 应用于文本文件
我有一个具有以下结构的文本文件:
KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N
KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13 _________数据键14-数据值14 数据键1N-数据值1N (1)
// 重要的是附加数据键位于新行
(1) 下划线不是数据的一部分。我用它来对齐数据。
问题:如何使用正则表达式将数据转换为这种格式?
<KEYWORD0>
<DataKey00>DataValue00</DataKey00>
<DataKey01>DataValue01</DataKey01>
<DataKey02>DataValue02</DataKey02>
<DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
<DataKey10>DataValue10</DataKey10>
<DataKey11>DataValue11</DataKey11>
<DataKey12>DataValue12</DataKey12>
<DataKey13>DataValue12</DataKey13>
<DataKey14>DataValue12</DataKey14>
<DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>
I have a text file with the following structure:
KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N
KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13
_________DataKey14-DataValue14 DataKey1N-DataValue1N (1)// It is significant that the additional datakeys are on a new line
(1) the underline is not part of the data. I used it to align the data.
Question: How do I use a regex to convert my data to this format?
<KEYWORD0>
<DataKey00>DataValue00</DataKey00>
<DataKey01>DataValue01</DataKey01>
<DataKey02>DataValue02</DataKey02>
<DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
<DataKey10>DataValue10</DataKey10>
<DataKey11>DataValue11</DataKey11>
<DataKey12>DataValue12</DataKey12>
<DataKey13>DataValue12</DataKey13>
<DataKey14>DataValue12</DataKey14>
<DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Regex 是为受虐狂准备的,它是 VB.NET 中的一个非常简单的文本解析器(从 C# 转换而来,因此请检查错误):
请注意,我向 XML 添加了一个根元素,因为 .Net XML 对象只喜欢读取和写入符合规范的 XML。
另请注意,该代码使用了我为 String.Split 编写的扩展。
Regex is for masochists, it's a very simple text parser in VB.NET (converted from C# so check for bugs):
Note that I added a root element to the XML because .Net XML objects only like reading and writing conformant XML.
Also note that the code uses an extension I wrote for String.Split.
^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*
这已经开始出现在附近,但我认为这更容易用编程语言做...只需逐行处理文件...
^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*
This is starting to get in the neighborhood but I think this is just easier to do in a programming language... just process the file line by line...
您需要使用 .NET 中正则表达式的组和匹配功能并应用类似的功能:
You need to use the Groups and Matches feature of Regex in .NET and apply something like:
如果 DataValue 和 DataKey 项不能包含
<
或>
或“-”字符或空格,您可以执行以下操作:Read your file in一个字符串并替换为带有类似于以下正则表达式的replaceAll:
([^- \t]+)-([^- \t]+)
并使用它作为替换 (< ;$1>$2
)。这会将这样的内容转换为:DataKey01-DataValue01
为这样的内容:DataValue01
。之后,您需要运行另一个全局替换,但此正则表达式
^([^ \t]+)(\s+(?:<[^>]+>[^<]+]+>[\s\n]*)+)
并再次替换为<$1>$2
。这应该可以解决问题。
我不在 VB.net 中编程,所以我不知道实际语法是否正确(在某些情况下,您可能需要将
\
加倍或加倍)。您应该确保为第二遍启用“多行”选项。解释一下:
[^- \t]+
) 将匹配任何不包含或
-
或\t< 的字符字符串/代码>。这被标记为 $1(注意它周围的括号)
-
将匹配-
字符[^- \t]+
) 将再次匹配匹配任何不包含或
-
或\t
的字符字符串。这也被标记为 $2(注意它周围的括号)cd
匹配的ab-cd
字符串。步骤文件如下所示:
^([^ \t]+)(\s+(?:<[^>]+>[^<]+]+> ;[\s\n]*)+)
^([^ \t]+)
标记并匹配任何非或
\ 的字符串t
从该行开始(这是$1
)(
开始一个标记\s+
空格(?:
从这里开始的非标记组<[^>]+>
匹配开放 xml 标记:[^<]+
匹配标签内部bc
]+>
匹配结束标记[\s\n]*
一些可选的空格或换行符)+
关闭未标记的组并重复至少一次)
关闭标记(这是$2
)现在替换很简单。
希望有帮助。
但如果这不是一次性的工作,你可能应该尝试制作一个简单的解析器:)
If the DataValue and DataKey items don't can't contain
<
or>
or '-' chars or spaces you can do something like this:Read your file in a string and to a replaceAll with a regex similar to this:
([^- \t]+)-([^- \t]+)
and use this as a replacement (<$1>$2</$1>
). This will convert something like this:DataKey01-DataValue01
into something like this:<DataKey01>DataValue01</DataKey01>
.After that you need to run another global replace but this regex
^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)
and replace with<$1>$2</$1>
again.This should do the trick.
I don't program in VB.net so i have no idea if the actual syntax is correct (you might need to double or quadruple the
\
in some cases). You should make sure the enable the Multiline option for the second pass.To explain:
[^- \t]+
) will match any string of chars not containingor
-
or\t
. This is marked as $1 (notice the parentheses around it)-
will match the-
char[^- \t]+
) will again match any string of chars not containingor
-
or\t
. This is also marked as $2 (notice the parentheses around it)ab-cd
string matched with<ab>cd</ab>
After this step the file looks like:
^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)
^([^ \t]+)
mark and match any string of nonor
\t
beginning at the line (this is$1
)(
begin a mark\s+
white space(?:
non marked group starting here<[^>]+>
match an open xml tag:<ab>
[^<]+
match the inside of a tagbc
</[^>]+>
match an closing tag</ab>
[\s\n]*
some optional white space or newlines)+
close the non marked group and repeat at least one time)
close the mark (this is$2
)The replacement is straight forward now.
Hope it helps.
But you should probably try to make a simple parser if this is not a one off job :)