C++搜索性能

发布于 2024-11-08 23:49:55 字数 1439 浏览 0 评论 0原文

我有两个文本文件。其中包含大约 70,000 个姓名的列表 (~1.5MB)。另一个包含将从各种来源获得的文本。也就是说,每次执行程序时,该文件的内容都会发生变化(~0.5MB)。本质上,我希望能够将一些文本粘贴到文本文件中,并查看在我的列表中找到了哪些名称。有点像查找功能 (CTR + F),但有 70,000 个关键字。

无论如何,到目前为止我所拥有的是:

int main()
{
     ifstream namesfile("names.txt");   //names list
     ifstream miscfile("misc.txt");     //misc text
     vector<string> vecnames;           //vector to hold names
     vector<string> vecmisc;            //vector to hold misc text
     size_t found;

     string s;
     string t;

     while (getline(namesfile,s))       
         veccomp.push_back(s);  

     while (getline(miscfile,t))        
         vectenk.push_back(t);

     //outer loop iterates through names list
     for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
         //inner loop iterates through the lines of the mist text file
         for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
             found=vecmisc[j].find(vecnames[i]);
             if (found!=string::npos) {
                 cout << vecnames[i] << endl;
                 break;
             }
         }
     }

     cout << "SEARCH COMPLETE";

     //to keep console application from exiting
     getchar();

     return 0;
 }

现在,就提取我需要的数据而言,这非常有效,但是,它非常慢并且显然效率低下,因为每个名称都要求我可能再次搜索整个文件,这给出了(75000 x杂项文本文件中的行数)迭代。如果有人可以提供帮助,我一定会很感激。一些示例代码是最受欢迎的。此外,如果这有什么区别的话,我正在使用 Dev C++。谢谢。

What I have is two text files. One contains a list of roughly 70,000 names (~1.5MB). The other contains text which will be obtained from miscellaneous sources. That is, this file's contents will change each time the program is executed (~0.5MB). Essentially, I want to be able to paste some text into a text file and see which names from my list are found. Kind of like the find function (CTR + F) but with 70,000 keywords.

In any case, what I have thus far is:

int main()
{
     ifstream namesfile("names.txt");   //names list
     ifstream miscfile("misc.txt");     //misc text
     vector<string> vecnames;           //vector to hold names
     vector<string> vecmisc;            //vector to hold misc text
     size_t found;

     string s;
     string t;

     while (getline(namesfile,s))       
         veccomp.push_back(s);  

     while (getline(miscfile,t))        
         vectenk.push_back(t);

     //outer loop iterates through names list
     for (vector<string>::size_type i = 0; i != vecnames.size(); ++i) {
         //inner loop iterates through the lines of the mist text file
         for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j) {
             found=vecmisc[j].find(vecnames[i]);
             if (found!=string::npos) {
                 cout << vecnames[i] << endl;
                 break;
             }
         }
     }

     cout << "SEARCH COMPLETE";

     //to keep console application from exiting
     getchar();

     return 0;
 }

Now this works great as far as extracting the data I need, however, it is terribly slow and obviously inefficient since each name requires that I potentially search the entire file again which gives (75000 x # of lines in misc text file) iterations. If anyone could help, I would certainly appreciate it. Some sample code is most welcomed. Additionally, I'm using Dev C++ if that makes any difference. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无声静候 2024-11-15 23:49:55

使用 std::hash_set。将所有关键字插入集合中,然后遍历大型文档,每次遇到一个单词时,测试集合中是否包含该单词。

Use a std::hash_set. Insert all your keywords into the set, then traverse the large document and each time you come to a word, test whether the set includes that word.

全部不再 2024-11-15 23:49:55

使用向量,使用二分搜索算法获得的最佳情况搜索时间是 O(log N) 复杂度,并且这只适用于排序列表。如果将排序插入到列表中所需的时间包括在内,则排序线性容器(数组、列表)以及非线性容器(例如二叉搜索树)的最终摊余复杂度为 O(N log N )。因此,这基本上意味着,如果向列表添加更多元素,则将这些元素添加到列表以及稍后查找它们所需的时间将以比线性增长率稍快的速度增加。列表(即,如果将列表的大小加倍,则对列表进行排序将花费两倍多的时间,然后列表上的任何搜索都会非常快......为了使搜索时间加倍,该列表必须按现有元素数量的平方增长)。

另一方面,良好的哈希表实现(例如 std::unordered_map)以及避免过多冲突的良好哈希算法的摊余复杂度为 O(1) ..这意味着总的来说,无论有多少元素,任何给定元素都有一个恒定的查找时间,这使得搜索速度非常快。哈希表相对于线性列表或二叉搜索树的主要缺点是哈希表的实际内存占用量。为了避免太多冲突,一个好的哈希表的大小需要等于某个至少大于 2*N 的大质数,其中 N 是您计划存储在哈希表中的元素总数。大批。但“浪费的空间”是高效且极快的查找的代价。

Using a vector, the best-case search time you're going to get is O(log N) complexity using a binary search algorithm, and that's only going to work for a sorted list. If you include the time it will take to make sorted insertions into a list, the final amortized complexity for a sorted linear container (arrays, lists), as well as non-linear containers such as binary-search trees, O(N log N). So that basically means that if you add more elements to the list, the time it will take to both add those elements to the list, as well as find them later on, will increase at a rate a little faster than the linear growth rate of the list (i.e., if you double the size of the list, it will take a little over double the time to sort the list, and then any searches on the list will be pretty quick ... in order to double the search time, the list would have to grow by the square of the existing amount of elements).

A good hash-table implementation on the other-hand (such as std::unordered_map) along with a good hash-algorithm that avoids too many collisions, has an amortized complexity of O(1) ... that means overall there's a constant look-up time for any given element, no matter how many elements there are, making searches very fast. The main penalty over a linear list or binary-search tree for the hash-table is the actual memory footprint of the hash table. A good hash-table, in order to avoid too many collisions, will want to have a size equal to some large prime number that is at least greater than 2*N, where N is the total number of elements you plan on storing in the array. But the "wasted space" is the trade-off for efficient and extremely fast look-ups.

睫毛上残留的泪 2024-11-15 23:49:55

虽然任何类型的映射都是最简单的解决方案,但 Scott Myers 为排序向量和来自算法的binary_search(在Effective STL中)提供了一个很好的案例。

使用排序向量,您的代码将类似于

#include <algorithm>

...

int vecsize = vecnames.size();
sort(vecnames.begin(), vecnames.begin() + vecsize );
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j)
{
  bool found= binary_search(vecnames.begin(), vecnames.begin()+vecsize,vecmisc[j]);
  if (found) std::cout << vecmisc[j] << std::endl;
}

使用排序向量和binary_search的优点是

1)没有要遍历的树,binary_search从(end-start)/2开始,并不断除以2搜索范围最多需要log(n)。

2)没有键、值对。您可以获得矢量的简单性,而无需地图的开销。

3)向量的元素位于连续范围内(这就是为什么在填充向量之前应该使用保留,插入速度更快),因此向量元素的搜索很少跨越页面边界(稍微快一些)。

4)这很酷。

While a map of any kind is the simplest solution, Scott Myers makes a good case for sorted vector and binary_search from algorithm (in Effective STL).

Using a sorted vector, your code would look something like

#include <algorithm>

...

int vecsize = vecnames.size();
sort(vecnames.begin(), vecnames.begin() + vecsize );
for (vector<string>::size_type j = 0;j != vecmisc.size(); ++j)
{
  bool found= binary_search(vecnames.begin(), vecnames.begin()+vecsize,vecmisc[j]);
  if (found) std::cout << vecmisc[j] << std::endl;
}

The advantages of using a sorted vector and binary_search are

1) There is no tree to traverse, the binary_search begins at (end-start)/2, and keeps dividing by 2. It will take at most log(n) to search the range.

2) There is no key,value pair. You get the simplicity of a vector without the overhead of a map.

3) The vector's elements are in a contiguous range (which is why you should use reserve before populating the vector, inserts are faster), and so searching through the vector's elements rarely crosses page boundaries (slightly faster).

4) It's cool.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文