当前位置：文江博客话题详情

在文件txt文件中搜索单词，该文件包含100,000,000个单词，

发布于 2025-01-27 11:01:57 字数 151 浏览 6 评论 0原文

我有一个TXT文件，每条新行中有100000000个单词。

我想编写一个函数，该函数获取单词的输入并搜索是否在txt文件中是否存在单词。

我已经使用地图和Trie方法尝试了这一点，但是我得到了性病：BAC_ALLOC错误，这是由于大量单词都可以建议如何解决问题

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

秋日私语 2025-02-03 11:01:57

编程时，数据结构非常重要。如果可能的话，我建议您使用二进制树之类的东西。不过，这需要对文本文件进行排序。如果您无法对文本文件进行排序，那么最好的方法就是在文本文件中迭代文本文件，直到您得到想要的单词为止。另外，您的评论应包含更多信息，以便我们更轻松地诊断您的问题

回复收藏 0 原文

倾其所爱 2025-02-03 11:01:57

我假设您想一遍又一遍地搜索此单词列表。因为对于少量搜索，只需通过文件线性搜索。

将单词列表解析到后缀树中，大约是文件大小的20倍，即使未经优化则更多。由于您用完了构建单词列表的三位一体的记忆，因此我认为它确实很大。因此，让我们不要将其保存在内存中，而是将其处理一点，以便您可以更快地搜索。

我建议的解决方案是进行字典搜索。

因此，首先将每个空格变成一个newline，因此您每行有一个单词，而不是带有多个单词的多行，然后对文件进行排序并存储它。当您使用时，您可以删除重复项。那是我们的词典。在这样做时，请记住最长的单词（L）的长度。

要访问字典，您需要一个辅助函数才能在偏移X处读取一个单词，该单词可能位于某个单词的中间。该函数应寻求偏移-L，并读取2 * l字节到缓冲区中。然后从缓冲区的中间向后搜索，以在offset x上找到单词。

现在，要搜索您打开字典，并在offset left = 0和offset right right = size_of_of_file中读取单词，即第一个和最后一句话。如果您的搜索词不那么少，则第一个单词或更大，则您完成的最后一个单词，找不到单词。如果您发现搜索词，您也完成了。

接下来，在二进制搜索中，您将进行sTD ::左和右中点，在该偏移量中读取单词，然后检查搜索词是否较小或更多，然后重复到该间隔。这需要o（log n）读取以找到该单词或确定不存在。

字典搜索可以做得更好。而不是使用中点，您可以近似于字典中的单词应该在哪里。假设您的字典从“ AAL”变为“ Zoo”，您正在寻找“斑马”。您会在中间打开字典吗？不，您会在末端打开它，因为Zerba比动物园更近。因此，您需要一个函数，使您在搜索词相对于左和右词的搜索词的0到1之间的值（m）。然后，您的“中点”为（右 - 左） * M。然后，就像二进制搜索一样，确定搜索词在左或右间隔和重复中。

字典搜索仅进行日志n，如果单词列表的分布相当均匀，则平均读取。

I assume you want to search this word list over and over. Because for a small number of searches just search linear through the file.

Parsing the word list into a suffix tree takes about 20 times the size of the file, more if not optimized. Since you ran out of memory constructing a trie of the word list I assume it's really big. So lets not keep it in memory but process it a bit so you can search faster.

The solution I would propose is to do a dictionary search.

So first turn every whitespace into a newline so you have one word per line instead of multiple lines with multiple words and then sort the file and store it. While you are at it you can remove duplicates. That is our dictionary. Remember the length of the longest word (L) while you do that.

To access the dictionary you need a helper function to read a word at offset X, which can be at the middle of some word. The function should seek to the offset - L and read 2 * L bytes into a buffer. Then from the middle of the buffer search backward and forward to find the word at offset X.

Now to search you open the dictionary and read the word at offset left=0 and offset right = size_of_file, i.e. the first and last word. If your search term is less then the first word or larger then the last word you are done, word not found. If you found the search term you are done too.

Next in a binary search you would take the std::midpoint of left and right, read the word at that offset and check if the search term is less or more and recurse into that interval. This would require O(log n) reads to find the word or determine it's not present.

A dictionary search can do better. Instead of using the midpoint you can approximate where the word should be in the dictionary. Say your dictionary goes from "Aal" to "Zoo" and you are searching for "Zebra". Would you open the dictionary in the middle? No, you would open it near the end because Zerba is much closer to Zoo than Aal. So you need a function that gives you a value (M) between 0 and 1 of where a search term is located relative to the left and right word. Your "midpoint" for the search is then (right - left) * M. Then, like with binary search, determine if the search term is in the left or right interval and recurse.

A dictionary search takes only log log n reads on average if the word list has reasonably uniform distribution.

回复收藏 0 原文

~没有更多了~