全文标记器
这应该是一个不重新发明轮子的理想情况,但到目前为止我的搜索一直是徒劳的。
我不想自己编写一个分词器,而是想使用现有的 C++ 分词器。这些标记将在索引中用于全文搜索。性能非常重要,我将解析许多千兆字节的文本。
编辑:请注意,标记将在搜索索引中使用。创建此类代币并不是一门精确的科学(据我所知),并且需要一些启发法。这已经被做过一千次了,可能以一千种不同的方式,但我什至找不到其中之一:)
有什么好的指示吗?
谢谢!
This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.
Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.
Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probably in a thousand different ways, but I can't even find one of them :)
Any good pointers?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
C++ 字符串工具包库 (StrTk) 为您的问题提供了以下解决方案:
可以找到更多示例< a href="http://www.codeproject.com/KB/recipes/Tokenizer.aspx">此处
The C++ String Toolkit Library (StrTk) has the following solution to your problem:
More examples can be found Here
如果性能是一个主要问题,您可能应该坚持使用旧的 strtok肯定会很快:
If performance is a main issue you should probably stick to good old strtok which is sure to be fast:
如果您的标记不太难解析,则正则表达式库可能会很好地工作。
A regular expression library might work well if your tokens aren't too difficult to parse.
我编写了自己的分词器作为开源的一部分
SWISH++ 索引和搜索引擎。
还有 ICU 分词器
处理 Unicode。
I wrote my own tokenizer as part of the open-source
SWISH++ indexing and search engine.
There's also the the ICU tokenizer
that handles Unicode.
我可能会从
查看std::stringstream
。 C 风格的 strtok 存在许多可用性问题,而且 C 风格的字符串也很麻烦。下面是一个将句子标记为单词的超简单示例:
std::stringstream 类是“双向”的,因为它支持输入和输出。您可能只想执行其中之一,因此您可以使用
std::istringstream
或std::ostringstream
。它们的美妙之处在于它们分别是
std::istream
和std::ostream
,因此您可以像使用std 一样使用它们: :cin
或std::cout
,希望您熟悉它们。有些人可能会认为这些类的使用成本很高;
中的std::strstream
基本上是相同的东西,但构建在更便宜的 C 风格 0 终止字符串之上。对你来说可能会更快。但无论如何,我不会立即担心性能。让一些东西发挥作用,然后对其进行基准测试。您很可能只需编写编写良好的 C++ 即可获得足够的速度,从而最大限度地减少不必要的对象创建和销毁。如果还是不够快,那么你可以看看其他地方。不过,这些课程可能足够快。您的 CPU 从硬盘或网络读取数据块所需的时间可能会浪费数千个周期。I might look into
std::stringstream
from<sstream>
. C-stylestrtok
has a number of usability problems, and C-style strings are just troublesome.Here's an ultra-trivial example of it tokenizing a sentence into words:
The
std::stringstream
class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd usestd::istringstream
orstd::ostringstream
.The beauty of them is that they are also
std::istream
andstd::ostream
s respectively, so you can use them as you'd usestd::cin
orstd::cout
, which are hopefully familiar to you.Some might argue these classes are expensive to use;
std::strstream
from<strstream>
is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.您可以使用 Ragel 状态机编译器 来创建分词器(或词法分析器)。
生成的代码没有外部依赖项。
我建议您查看 clang.rl 示例,获取相关示例语法和用法。
You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).
The generated code has no external dependencies.
I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.
好吧,我首先搜索 Boost 并...跳: Boost.Tokenizer
好东西吗?默认情况下,它会在空格和标点符号处中断,因为它是用于文本的,因此您不会忘记符号。
从介绍来看:
还有其他功能:
istream
一起使用......因此可以与 < code>ifstream和一些选项(例如保留空令牌等...)
检查一下!
Well, I would begin by searching Boost and... hop: Boost.Tokenizer
The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.
From the introduction:
And there are additional features:
Iterators
so you can use it with anistream
directly... and thus with anifstream
and a few options (like keeping empty tokens etc...)
Check it out!