当前位置：文江博客话题详情

全文标记器

发布于 2024-08-28 03:11:37 字数 246 浏览 9 评论 0原文

这应该是一个不重新发明轮子的理想情况，但到目前为止我的搜索一直是徒劳的。

我不想自己编写一个分词器，而是想使用现有的 C++ 分词器。这些标记将在索引中用于全文搜索。性能非常重要，我将解析许多千兆字节的文本。

编辑：请注意，标记将在搜索索引中使用。创建此类代币并不是一门精确的科学（据我所知），并且需要一些启发法。这已经被做过一千次了，可能以一千种不同的方式，但我什至找不到其中之一:)

有什么好的指示吗？

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

美人如玉 2024-09-04 03:11:37

C++ 字符串工具包库 (StrTk) 为您的问题提供了以下解决方案：

#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::deque<std::string> word_list;
   strtk::for_each_line("data.txt",
                        [&word_list](const std::string& line)
                        {
                           const std::string delimiters = "\t\r\n ,,.;:'\""
                                                          "!@#$%^&*_-=+`~/\\"
                                                          "()[]{}<>";
                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << strtk::join(" ",word_list) << std::endl;

   return 0;
}

可以找到更多示例< a href="http://www.codeproject.com/KB/recipes/Tokenizer.aspx">此处

The C++ String Toolkit Library (StrTk) has the following solution to your problem:

#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::deque<std::string> word_list;
   strtk::for_each_line("data.txt",
                        [&word_list](const std::string& line)
                        {
                           const std::string delimiters = "\t\r\n ,,.;:'\""
                                                          "!@#$%^&*_-=+`~/\\"
                                                          "()[]{}<>";
                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << strtk::join(" ",word_list) << std::endl;

   return 0;
}

More examples can be found Here

回复收藏 0 原文

时光倒影 2024-09-04 03:11:37

如果性能是一个主要问题，您可能应该坚持使用旧的 strtok肯定会很快：

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}

If performance is a main issue you should probably stick to good old strtok which is sure to be fast:

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}

回复收藏 0 原文

风轻花落早 2024-09-04 03:11:37

如果您的标记不太难解析，则正则表达式库可能会很好地工作。

回复收藏 0 原文

怎樣才叫好 2024-09-04 03:11:37

我编写了自己的分词器作为开源的一部分
SWISH++ 索引和搜索引擎。

还有 ICU 分词器
处理 Unicode。

回复收藏 0 原文

仙气飘飘 2024-09-04 03:11:37

我可能会从查看std::stringstream。 C 风格的 strtok 存在许多可用性问题，而且 C 风格的字符串也很麻烦。

下面是一个将句子标记为单词的超简单示例：

#include <sstream>
#include <iostream>
#include <string>

int main(void) 
{
   std::stringstream sentence("This is a sentence with a bunch of words"); 
   while (sentence)
   {
      std::string word;  
      sentence >> word;  
      std::cout << "Got token: " << word << std::endl;
   }
}

janks@phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:

std::stringstream 类是“双向”的，因为它支持输入和输出。您可能只想执行其中之一，因此您可以使用 std::istringstream 或 std::ostringstream。

它们的美妙之处在于它们分别是 std::istream 和 std::ostream，因此您可以像使用 std 一样使用它们： :cin 或 std::cout，希望您熟悉它们。

有些人可能会认为这些类的使用成本很高；中的 std::strstream 基本上是相同的东西，但构建在更便宜的 C 风格 0 终止字符串之上。对你来说可能会更快。但无论如何，我不会立即担心性能。让一些东西发挥作用，然后对其进行基准测试。您很可能只需编写编写良好的 C++ 即可获得足够的速度，从而最大限度地减少不必要的对象创建和销毁。如果还是不够快，那么你可以看看其他地方。不过，这些课程可能足够快。您的 CPU 从硬盘或网络读取数据块所需的时间可能会浪费数千个周期。

I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.

Here's an ultra-trivial example of it tokenizing a sentence into words:

#include <sstream>
#include <iostream>
#include <string>

int main(void) 
{
   std::stringstream sentence("This is a sentence with a bunch of words"); 
   while (sentence)
   {
      std::string word;  
      sentence >> word;  
      std::cout << "Got token: " << word << std::endl;
   }
}

janks@phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:

The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.

The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.

Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.

回复收藏 0 原文

娇纵 2024-09-04 03:11:37

您可以使用 Ragel 状态机编译器来创建分词器（或词法分析器）。

生成的代码没有外部依赖项。

我建议您查看 clang.rl 示例，获取相关示例语法和用法。

回复收藏 0 原文

空宴 2024-09-04 03:11:37

好吧，我首先搜索 Boost 并...跳： Boost.Tokenizer

好东西吗？默认情况下，它会在空格和标点符号处中断，因为它是用于文本的，因此您不会忘记符号。

从介绍来看：

#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "This is,  a test";
   tokenizer<> tok(s);
   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

// prints
This
is
a
test

// notes how the ',' and ' ' were nicely removed

还有其他功能：

它可以转义字符，
它与迭代器兼容，因此您可以直接将它与 istream 一起使用......因此可以与 < code>ifstream

和一些选项（例如保留空令牌等...）

检查一下！

Well, I would begin by searching Boost and... hop: Boost.Tokenizer

The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.

From the introduction:

#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "This is,  a test";
   tokenizer<> tok(s);
   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

// prints
This
is
a
test

// notes how the ',' and ' ' were nicely removed

And there are additional features: