C++需要将一个字符串与 200.000 个单词进行比较
在我的 C++ 程序中...
用户输入程序字符串“foo”。
我需要比较这个字符串与我的字符串,在txt文件中写入:这个字符串是名词! (或形容词...)
我有几个 TXT 文件 - 一个包含名词的文件,第二个包含形容词的文件...但每个文件大约有 200.000 个单词。
如何有效地将这个字符串“foo”与我的文件中的字符串进行比较?
我需要使用什么?
In my program in C++ ...
User types in program string "foo".
I need to compare this string to my strings, in txt files to write: this string is noun! (or adjective...)
I got few TXT files - one file with nouns, 2-nd file with adjectives... but in each file is about 200.000 words.
How I can effectively compare this string "foo" with strings in my files?
What I need to use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
为此使用 TRIE 数据结构。您应该需要一些内存来构建数据结构。但你的目标将是最有效的。
Use TRIE data structure for this. You should need some memory for constructing the data structure. But your objective will be most efficient.
将您的单词放入
std::set
容器中并对其进行查找。这为访问提供了 O(log n) 时间,这对于您正在做的事情来说可能足够了。您还可以使用
std::map
,其中键是单词,值是类(例如“名词”)。Put your words in
std::set<std::string>
containers and do a lookup on them. This gives O(log n) time for an access, which is probably sufficient for what you are doing.You can also use
std::map<std::string, std::string>
where the key is the word and the value is the class (e.g. "noun").我建议对您的文件使用 sqlite。
您可以为每个键值创建一个 CRC,并将键和值 (int) 存储到表中。为关键字段创建索引。
当你想要进行查找时,你可以获取单词的 CRC,然后在表中进行查找。
I would recommend to use sqlite for your files instead.
You could create a CRC of each of the key values, and store the key and values (int) into a table. Create an index for the key field.
When you want to do a lookup you can take the CRC of the word, and do a lookup in the table.
如果您有的话,基数树将为字符串提供比“常规”特里树更好的内存使用许多具有共同词根/前缀的字符串(字典可能就是这种情况,即具有多种形式的单词 - 尽管这可能取决于语言)。
A Radix tree will provide a better memory usage for strings than a 'regular' trie if you have a lot of strings with common roots/prefixes (which is probably the case for a dictionary i.e. words with many forms - although that would probably depend on the language).
您只需要确认它是否匹配任何内容吗?
如果是这样,请使用 Trie。
Do you just need to confirm if it matches anything?
If so, use a Trie.
您可以将外部文件索引存储为 btree 或链式哈希表,它将提供非常快的查找时间和最少的查找次数来定位数据。
You can store the external file indexed as a btree or as chained hash tabled it would provide really fast lookup times and minimum seeks to locate the data.