正则表达式 - 应用于文本文件

发布于 2024-10-18 05:41:47 字数 965 浏览 4 评论 0原文

我有一个具有以下结构的文本文件:

KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N

KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13 _________数据键14-数据值14 数据键1N-数据值1N (1)

// 重要的是附加数据键位于新行

(1) 下划线不是数据的一部分。我用它来对齐数据。

问题:如何使用正则表达式将数据转换为这种格式?

<KEYWORD0>
    <DataKey00>DataValue00</DataKey00>
    <DataKey01>DataValue01</DataKey01>
    <DataKey02>DataValue02</DataKey02>
    <DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
    <DataKey10>DataValue10</DataKey10>
    <DataKey11>DataValue11</DataKey11>
    <DataKey12>DataValue12</DataKey12>
    <DataKey13>DataValue12</DataKey13>
    <DataKey14>DataValue12</DataKey14>
    <DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>

I have a text file with the following structure:

KEYWORD0 DataKey01-DataValue01 DataKey02-DataValue02 ... DataKey0N-DataValue0N

KEYWORD1 DataKey11-DataValue11 DataKey12-DataValue12 DataKey13-DataValue13
_________DataKey14-DataValue14 DataKey1N-DataValue1N (1)

// It is significant that the additional datakeys are on a new line

(1) the underline is not part of the data. I used it to align the data.

Question: How do I use a regex to convert my data to this format?

<KEYWORD0>
    <DataKey00>DataValue00</DataKey00>
    <DataKey01>DataValue01</DataKey01>
    <DataKey02>DataValue02</DataKey02>
    <DataKey0N>DataValue0N</DataKey0N>
</KEYWORD0>
<KEYWORD1>
    <DataKey10>DataValue10</DataKey10>
    <DataKey11>DataValue11</DataKey11>
    <DataKey12>DataValue12</DataKey12>
    <DataKey13>DataValue12</DataKey13>
    <DataKey14>DataValue12</DataKey14>
    <DataKey1N>DataValue1N</DataKey1N>
</KEYWORD1>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

晨曦÷微暖 2024-10-25 05:41:47

Regex 是为受虐狂准备的,它是 VB.NET 中的一个非常简单的文本解析器(从 C# 转换而来,因此请检查错误):

Public Class MyFileConverter
    Public Sub Parse(inputFilename As String, outputFilename As String)
        Using reader As New StreamReader(inputFilename)
            Using writer As New StreamWriter(outputFilename)
                Parse(reader, writer)
            End Using
        End Using
    End Sub

    Public Sub Parse(reader As TextReader, writer As TextWriter)
        Dim line As String
        Dim state As Integer = 0

        Dim xmlWriter As New XmlTextWriter(writer)
        xmlWriter.WriteStartDocument()
        xmlWriter.WriteStartElement("Keywords")
        ' Root element required for conformance
        While (InlineAssignHelper(line, reader.ReadLine())) IsNot Nothing
            If line.Length = 0 Then
                If state > 0 Then
                    xmlWriter.WriteEndElement()
                End If
                state = 0
                Continue While
            End If

            Dim parts As String() = line.Split(Function(c) [Char].IsWhiteSpace(c), StringSplitOptions.RemoveEmptyEntries)
            Dim index As Integer = 0

            If state = 0 Then
                state = 1
                xmlWriter.WriteStartElement(parts(System.Math.Max(System.Threading.Interlocked.Increment(index),index - 1)))
            End If

            While index < parts.Length
                Dim keyvalue As String() = parts(index).Split("-"C)
                xmlWriter.WriteStartElement(keyvalue(0))
                xmlWriter.WriteString(keyvalue(1))
                xmlWriter.WriteEndElement()
                index += 1
            End While
        End While

        If state > 0 Then
            xmlWriter.WriteEndElement()
        End If
        xmlWriter.WriteEndElement()
        xmlWriter.WriteEndDocument()
    End Sub
    Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, value As T) As T
        target = value
        Return value
    End Function
End Class

请注意,我向 XML 添加了一个根元素,因为 .Net XML 对象只喜欢读取和写入符合规范的 XML。

另请注意,该代码使用了我为 String.Split 编写的扩展

Regex is for masochists, it's a very simple text parser in VB.NET (converted from C# so check for bugs):

Public Class MyFileConverter
    Public Sub Parse(inputFilename As String, outputFilename As String)
        Using reader As New StreamReader(inputFilename)
            Using writer As New StreamWriter(outputFilename)
                Parse(reader, writer)
            End Using
        End Using
    End Sub

    Public Sub Parse(reader As TextReader, writer As TextWriter)
        Dim line As String
        Dim state As Integer = 0

        Dim xmlWriter As New XmlTextWriter(writer)
        xmlWriter.WriteStartDocument()
        xmlWriter.WriteStartElement("Keywords")
        ' Root element required for conformance
        While (InlineAssignHelper(line, reader.ReadLine())) IsNot Nothing
            If line.Length = 0 Then
                If state > 0 Then
                    xmlWriter.WriteEndElement()
                End If
                state = 0
                Continue While
            End If

            Dim parts As String() = line.Split(Function(c) [Char].IsWhiteSpace(c), StringSplitOptions.RemoveEmptyEntries)
            Dim index As Integer = 0

            If state = 0 Then
                state = 1
                xmlWriter.WriteStartElement(parts(System.Math.Max(System.Threading.Interlocked.Increment(index),index - 1)))
            End If

            While index < parts.Length
                Dim keyvalue As String() = parts(index).Split("-"C)
                xmlWriter.WriteStartElement(keyvalue(0))
                xmlWriter.WriteString(keyvalue(1))
                xmlWriter.WriteEndElement()
                index += 1
            End While
        End While

        If state > 0 Then
            xmlWriter.WriteEndElement()
        End If
        xmlWriter.WriteEndElement()
        xmlWriter.WriteEndDocument()
    End Sub
    Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, value As T) As T
        target = value
        Return value
    End Function
End Class

Note that I added a root element to the XML because .Net XML objects only like reading and writing conformant XML.

Also note that the code uses an extension I wrote for String.Split.

不交电费瞎发啥光 2024-10-25 05:41:47

^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*

这已经开始出现在附近,但我认为这更容易用编程语言做...只需逐行处理文件...

^(\w)\s*((\w)\s*)(\r\n^\s+(\w)\s*)*

This is starting to get in the neighborhood but I think this is just easier to do in a programming language... just process the file line by line...

一片旧的回忆 2024-10-25 05:41:47

您需要使用 .NET 中正则表达式的组和匹配功能并应用类似的功能:

([A-Z\d]+)(\s([A-Za-z\d]+)\-([A-Za-z\d]+))*
  1. 查找匹配项并选择第一个组来查找关键字
  2. 循环遍历第 3 组和第 4 组的匹配项以捕获该关键字的 DataKey 和 DataValue
  3. Go至 1

You need to use the Groups and Matches feature of Regex in .NET and apply something like:

([A-Z\d]+)(\s([A-Za-z\d]+)\-([A-Za-z\d]+))*
  1. Find a Match and select the first Gruop to find the KEYWORD
  2. Loop through the Matches of Group 3 and 4 to catch the DataKey and DataValue for that KEYWORD
  3. Go to 1
多孤肩上扛 2024-10-25 05:41:47

如果 DataValue 和 DataKey 项不能包含 <> 或“-”字符或空格,您可以执行以下操作:

Read your file in一个字符串并替换为带有类似于以下正则表达式的replaceAll: ([^- \t]+)-([^- \t]+) 并使用它作为替换 (< ;$1>$2)。这会将这样的内容转换为:DataKey01-DataValue01 为这样的内容:DataValue01

之后,您需要运行另一个全局替换,但此正则表达式 ^([^ \t]+)(\s+(?:<[^>]+>[^<]+]+>[\s\n]*)+) 并再次替换为 <$1>$2

这应该可以解决问题。

我不在 VB.net 中编程,所以我不知道实际语法是否正确(在某些情况下,您可能需要将 \ 加倍或加倍)。您应该确保为第二遍启用“多行”选项。

解释一下:

([^- \t]+)-([^- \t]+)
  • ([^- \t]+) 将匹配任何不包含 -\t< 的字符字符串/代码>。这被标记为 $1(注意它周围的括号)
  • - 将匹配 - 字符
  • ([^- \t]+) 将再次匹配匹配任何不包含 -\t 的字符字符串。这也被标记为 $2(注意它周围的括号)
  • 替换将仅转换与 cd 匹配的 ab-cd 字符串

。步骤文件如下所示:

KEYWORD0 <DataKey00>DataValue00</DataKey00> <DataKey01>DataValue01</DataKey01>
   <DataKey02>DataValue02</DataKey02> <DataKey0N>DataValue0N</DataKey0N>

KEYWORD1 <DataKey10>DataValue10</DataKey10> <DataKey11>DataValue11</DataKey11>
   <DataKey12>DataValue12</DataKey12> <DataKey13>DataValue12</DataKey13>
   <DataKey14>DataValue12</DataKey14> <DataKey1N>DataValue1N</DataKey1N>

^([^ \t]+)(\s+(?:<[^>]+>[^<]+]+> ;[\s\n]*)+)

  • ^([^ \t]+) 标记并匹配任何非 \ 的字符串t 从该行开始(这是 $1
  • ( 开始一个标记
    • \s+ 空格
    • (?: 从这里开始的非标记组
      • <[^>]+> 匹配开放 xml 标记:
      • [^<]+ 匹配标签内部 bc
      • ]+> 匹配结束标记
      • [\s\n]* 一些可选的空格或换行符
    • )+ 关闭未标记的组并重复至少一次
  • ) 关闭标记(这是$2

现在替换很简单。

希望有帮助。

但如果这不是一次性的工作,你可能应该尝试制作一个简单的解析器:)

If the DataValue and DataKey items don't can't contain < or > or '-' chars or spaces you can do something like this:

Read your file in a string and to a replaceAll with a regex similar to this: ([^- \t]+)-([^- \t]+) and use this as a replacement (<$1>$2</$1>). This will convert something like this: DataKey01-DataValue01 into something like this: <DataKey01>DataValue01</DataKey01>.

After that you need to run another global replace but this regex ^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+) and replace with <$1>$2</$1> again.

This should do the trick.

I don't program in VB.net so i have no idea if the actual syntax is correct (you might need to double or quadruple the \ in some cases). You should make sure the enable the Multiline option for the second pass.

To explain:

([^- \t]+)-([^- \t]+)
  • ([^- \t]+) will match any string of chars not containing or - or \t. This is marked as $1 (notice the parentheses around it)
  • - will match the - char
  • ([^- \t]+) will again match any string of chars not containing or - or \t. This is also marked as $2 (notice the parentheses around it)
  • The replacement will just convert a ab-cd string matched with <ab>cd</ab>

After this step the file looks like:

KEYWORD0 <DataKey00>DataValue00</DataKey00> <DataKey01>DataValue01</DataKey01>
   <DataKey02>DataValue02</DataKey02> <DataKey0N>DataValue0N</DataKey0N>

KEYWORD1 <DataKey10>DataValue10</DataKey10> <DataKey11>DataValue11</DataKey11>
   <DataKey12>DataValue12</DataKey12> <DataKey13>DataValue12</DataKey13>
   <DataKey14>DataValue12</DataKey14> <DataKey1N>DataValue1N</DataKey1N>

^([^ \t]+)(\s+(?:<[^>]+>[^<]+</[^>]+>[\s\n]*)+)

  • ^([^ \t]+) mark and match any string of non or \t beginning at the line (this is $1)
  • ( begin a mark
    • \s+ white space
    • (?: non marked group starting here
      • <[^>]+> match an open xml tag: <ab>
      • [^<]+ match the inside of a tag bc
      • </[^>]+> match an closing tag </ab>
      • [\s\n]* some optional white space or newlines
    • )+ close the non marked group and repeat at least one time
  • ) close the mark (this is $2)

The replacement is straight forward now.

Hope it helps.

But you should probably try to make a simple parser if this is not a one off job :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文