解析空格分隔文本的最佳方法

发布于 2024-07-05 04:07:24 字数 1112 浏览 5 评论 0原文

我有这样的字符串,

 /c SomeText\MoreText "Some Text\More Text\Lol" SomeText

我想对其进行标记,但是我不能只在空格上进行拆分。 我想出了一个有点丑陋的解析器,但我想知道是否有人有更优雅的设计。

顺便说一句,这是 C# 中的。

编辑:我的丑陋版本虽然丑陋,但它是 O(N) 并且实际上可能比使用正则表达式更快。

private string[] tokenize(string input)
{
    string[] tokens = input.Split(' ');
    List<String> output = new List<String>();

    for (int i = 0; i < tokens.Length; i++)
    {
        if (tokens[i].StartsWith("\""))
        {
            string temp = tokens[i];
            int k = 0;
            for (k = i + 1; k < tokens.Length; k++)
            {
                if (tokens[k].EndsWith("\""))
                {
                    temp += " " + tokens[k];
                    break;
                }
                else
                {
                    temp += " " + tokens[k];
                }
            }
            output.Add(temp);
            i = k + 1;
        }
        else
        {
            output.Add(tokens[i]);
        }
    }

    return output.ToArray();            
}

I have string like this

 /c SomeText\MoreText "Some Text\More Text\Lol" SomeText

I want to tokenize it, however I can't just split on the spaces. I've come up with somewhat ugly parser that works, but I'm wondering if anyone has a more elegant design.

This is in C# btw.

EDIT: My ugly version, while ugly, is O(N) and may actually be faster than using a RegEx.

private string[] tokenize(string input)
{
    string[] tokens = input.Split(' ');
    List<String> output = new List<String>();

    for (int i = 0; i < tokens.Length; i++)
    {
        if (tokens[i].StartsWith("\""))
        {
            string temp = tokens[i];
            int k = 0;
            for (k = i + 1; k < tokens.Length; k++)
            {
                if (tokens[k].EndsWith("\""))
                {
                    temp += " " + tokens[k];
                    break;
                }
                else
                {
                    temp += " " + tokens[k];
                }
            }
            output.Add(temp);
            i = k + 1;
        }
        else
        {
            output.Add(tokens[i]);
        }
    }

    return output.ToArray();            
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

迟月 2024-07-12 04:07:24

您正在做的事情的计算机术语是词法分析; 阅读该文章以获得对这一常见任务的良好总结。

根据您的示例,我猜测您希望用空格来分隔单词,但引号中的内容应被视为不带引号的“单词”。

最简单的方法是将单词定义为正则表达式:

([^"^\s]+)\s*|"([^"]+)"\s*

此表达式指出“单词”是 (1) 被空格包围的非引号、非空白文本,或 (2) 被空格包围的非引号文本通过引号(后跟一些空格)。 请注意使用捕获括号来突出显示所需的文本。

有了该正则表达式,您的算法就很简单:在文本中搜索捕获括号定义的下一个“单词”,然后返回它。 重复此操作,直到您用完“单词”为止。

这是我能用 VB.NET 编写的最简单的工作代码。 请注意,我们必须检查两个组中的数据,因为有两组捕获括号。

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

注1:上面的Will的答案与这个想法相同。 希望这个答案能更好地解释场景背后的细节:)

The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.

Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.

The simplest way to do this is to define a word as a regular expression:

([^"^\s]+)\s*|"([^"]+)"\s*

This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.

Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".

Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)

2024-07-12 04:07:24

Microsoft.VisualBasic.FileIO 命名空间(在 Microsoft.VisualBasic.dll 中)有一个 TextFieldParser,可用于分割空格分隔的文本。 它可以很好地处理引号内的字符串(即“这是一个标记”thisistokentwo)。

请注意,仅仅因为 DLL 显示 VisualBasic 并不意味着您只能在 VB 项目中使用它。 它是整个框架的一部分。

The Microsoft.VisualBasic.FileIO namespace (in Microsoft.VisualBasic.dll) has a TextFieldParser you can use to split on space delimeted text. It handles strings within quotes (i.e., "this is one token" thisistokentwo) well.

Note, just because the DLL says VisualBasic doesn't mean you can only use it in a VB project. Its part of the entire Framework.

好久不见√ 2024-07-12 04:07:24

有状态机方法。

    private enum State
    {
        None = 0,
        InTokin,
        InQuote
    }

    private static IEnumerable<string> Tokinize(string input)
    {
        input += ' '; // ensure we end on whitespace
        State state = State.None;
        State? next = null; // setting the next state implies that we have found a tokin
        StringBuilder sb = new StringBuilder();
        foreach (char c in input)
        {
            switch (state)
            {
                default:
                case State.None:
                    if (char.IsWhiteSpace(c))
                        continue;
                    else if (c == '"')
                    {
                        state = State.InQuote;
                        continue;
                    }
                    else
                        state = State.InTokin;
                    break;
                case State.InTokin:
                    if (char.IsWhiteSpace(c))
                        next = State.None;
                    else if (c == '"')
                        next = State.InQuote;
                    break;
                case State.InQuote:
                    if (c == '"')
                        next = State.None;
                    break;
            }
            if (next.HasValue)
            {
                yield return sb.ToString();
                sb = new StringBuilder();
                state = next.Value;
                next = null;
            }
            else
                sb.Append(c);
        }
    }

它可以很容易地扩展到诸如嵌套引号和转义之类的事情。 返回为 IEnumerable 允许您的代码仅解析您需要的内容。 这种惰性方法没有任何真正的缺点,因为字符串是不可变的,因此您知道在解析整个内容之前 input 不会改变。

请参阅:http://en.wikipedia.org/wiki/Automata-Based_Programming

There is the state machine approach.

    private enum State
    {
        None = 0,
        InTokin,
        InQuote
    }

    private static IEnumerable<string> Tokinize(string input)
    {
        input += ' '; // ensure we end on whitespace
        State state = State.None;
        State? next = null; // setting the next state implies that we have found a tokin
        StringBuilder sb = new StringBuilder();
        foreach (char c in input)
        {
            switch (state)
            {
                default:
                case State.None:
                    if (char.IsWhiteSpace(c))
                        continue;
                    else if (c == '"')
                    {
                        state = State.InQuote;
                        continue;
                    }
                    else
                        state = State.InTokin;
                    break;
                case State.InTokin:
                    if (char.IsWhiteSpace(c))
                        next = State.None;
                    else if (c == '"')
                        next = State.InQuote;
                    break;
                case State.InQuote:
                    if (c == '"')
                        next = State.None;
                    break;
            }
            if (next.HasValue)
            {
                yield return sb.ToString();
                sb = new StringBuilder();
                state = next.Value;
                next = null;
            }
            else
                sb.Append(c);
        }
    }

It can easily be extended for things like nested quotes and escaping. Returning as IEnumerable<string> allows your code to only parse as much as you need. There aren't any real downsides to that kind of lazy approach as strings are immutable so you know that input isn't going to change before you have parsed the whole thing.

See: http://en.wikipedia.org/wiki/Automata-Based_Programming

一个人的夜不怕黑 2024-07-12 04:07:24

您可能还想研究正则表达式。 这可能会帮助你。 这是从 MSDN 中盗取的示例...

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}", 
                          matches.Count, 
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value, 
                              groups[0].Index, 
                              groups[1].Index);
        }

    }

}
// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazy dog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54

You also might want to look into regular expressions. That might help you out. Here is a sample ripped off from MSDN...

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}", 
                          matches.Count, 
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value, 
                              groups[0].Index, 
                              groups[1].Index);
        }

    }

}
// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazy dog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54
爺獨霸怡葒院 2024-07-12 04:07:24

Craig 是对的 - 使用正则表达式。 Regex.Split 可能会更简洁地满足您的需求。

Craig is right — use regular expressions. Regex.Split may be more concise for your needs.

茶色山野 2024-07-12 04:07:24

[^\t]+\t|"[^"]+"\t

使用正则表达式绝对看起来是最好的选择,但是这个只返回整个字符串。我正在尝试调整它,但运气不佳,所以远的。

string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, @"[^\t]+\t|""[^""]+""\t");

[^\t]+\t|"[^"]+"\t

using the Regex definitely looks like the best bet, however this one just returns the whole string. I'm trying to tweak it, but not much luck so far.

string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, @"[^\t]+\t|""[^""]+""\t");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文