类似 Google 的搜索查询标记化和字符串分割

发布于 2024-08-14 11:18:36 字数 364 浏览 7 评论 0原文

我正在寻找类似于谷歌的做法来标记搜索查询。例如,如果我有以下搜索查询:

the quick "brown fox" jumps over the "lazy dog"

我想要一个包含以下标记的字符串数组:

the
quick
brown fox
jumps
over
the
lazy dog

如您所见,标记保留双引号中的空格。

我正在寻找一些如何在 C# 中执行此操作的示例,最好不使用正则表达式,但是如果这最有意义并且性能最高,那就这样吧。

另外我想知道如何扩展它来处理其他特殊字符,例如,在术语前面放置一个 - 以强制从搜索查询中排除等等。

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:

the quick "brown fox" jumps over the "lazy dog"

I would like to have a string array with the following tokens:

the
quick
brown fox
jumps
over
the
lazy dog

As you can see, the tokens preserve the spaces with in double quotes.

I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.

Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

謌踐踏愛綪 2024-08-21 11:18:36

到目前为止,这看起来是 RegEx 的一个不错的候选者。如果它变得更加复杂,那么可能需要更复杂的标记化方案,但是除非必要,否则您应该避免该路线,因为它需要更多的工作。 (另一方面,对于复杂的模式,正则表达式很快就会变成狗,同样应该避免)。

这个正则表达式应该可以解决您的问题:

("[^"]+"|\w+)\s*

这是其用法的 C# 示例:

string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = @"(""[^""]+""|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

此方法的真正好处是它可以轻松扩展以包含您的“-”要求,如下所示:

string data = "the quick \"brown fox\" jumps over " +
              "the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = @"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

现在我和下一个人一样讨厌阅读正则表达式,但如果你把它分开,这个很容易阅读:

(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*

解释

  1. 如果可能匹配一个减号,后跟一个“,后跟所有内容,直到下一个”
  2. 否则匹配一个“,后跟所有内容,直到next "
  3. 否则匹配 a - 后跟任何单词字符
  4. 否则匹配尽可能多的单词字符
  5. 将结果放在一个组中
  6. 吞掉任何后续空格字符

So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).

This regex should solve your problem:

("[^"]+"|\w+)\s*

Here is a C# example of its usage:

string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = @"(""[^""]+""|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

The real benefit of this method is it can be easily extened to include your "-" requirement like so:

string data = "the quick \"brown fox\" jumps over " +
              "the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = @"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";

MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
    string group = m.Groups[0].Value;
}

Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:

(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*

Explanation

  1. If possible match a minus sign, followed by a " followed by everything until the next "
  2. Otherwise match a " followed by everything until the next "
  3. Otherwise match a - followed by any word characters
  4. Otherwise match as many word characters as you can
  5. Put the result in a group
  6. Swallow up any following space characters
初见 2024-08-21 11:18:36

几天前我只是想弄清楚如何做到这一点。我最终使用了 Microsoft.VisualBasic.FileIO.TextFieldParser ,它完全符合我的要求(只需将 HasFieldsEnclosureInQuotes 设置为 true )。当然,在 C# 程序中使用“Microsoft.VisualBasic”看起来有些奇怪,但它确实有效,而且据我所知,它是 .NET 框架的一部分。

为了将我的字符串放入 TextFieldParser 的流中,我使用了“new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))”。不确定这是否是最好的方法。

编辑:我认为这不能满足您的“-”要求,所以也许正则表达式解决方案更好

I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.

To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.

Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better

梦里的微风 2024-08-21 11:18:36

像这样逐个字符地转到字符串:(某种伪代码)

array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
    if in_quotes:
        if c is '"':
            append word to words
            word = "" // empty word
            in_quotes = false
        else:
            append c to word
   else if c is '"':
        in_quotes = true
   else if c is ' ': // space
       if not empty word:
           append word to words
           word = "" // empty word
   else:
        append c to word

// Rest
if not empty word:
    append word to words

Go char by char to the string like this: (sort of pseudo code)

array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
    if in_quotes:
        if c is '"':
            append word to words
            word = "" // empty word
            in_quotes = false
        else:
            append c to word
   else if c is '"':
        in_quotes = true
   else if c is ' ': // space
       if not empty word:
           append word to words
           word = "" // empty word
   else:
        append c to word

// Rest
if not empty word:
    append word to words
小兔几 2024-08-21 11:18:36

我一直在寻找解决这个问题的 Java 解决方案,并使用@Michael La Voie 提出了一个解决方案。尽管在 C# 中提出了这个问题,我还是想在这里分享它。希望没关系。

public static final List<String> convertQueryToWords(String q) {
    List<String> words = new ArrayList<>();
    Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
    Matcher matcher = pattern.matcher(q);
    while (matcher.find()) {
        MatchResult result = matcher.toMatchResult();
        if (result != null && result.group() != null) {
            if (result.group().contains("\"")) {
                words.add(result.group().trim().replaceAll("\"", "").trim());
            } else {
                words.add(result.group().trim());
            }
        }
    }
    return words;
}

I was looking for a Java solution to this problem and came up with a solution using @Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.

public static final List<String> convertQueryToWords(String q) {
    List<String> words = new ArrayList<>();
    Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
    Matcher matcher = pattern.matcher(q);
    while (matcher.find()) {
        MatchResult result = matcher.toMatchResult();
        if (result != null && result.group() != null) {
            if (result.group().contains("\"")) {
                words.add(result.group().trim().replaceAll("\"", "").trim());
            } else {
                words.add(result.group().trim());
            }
        }
    }
    return words;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文