计算文件中的唯一单词?好的线性搜索替代方案?
我正在使用一种幼稚的方法来解决这个问题,我将单词放入链接列表中,然后对其进行线性搜索。但处理大文件会花费太多时间。
我正在考虑使用二叉搜索树,但我不知道它是否适用于字符串。也听说过Skip Lists,还没真正学过。
而且我还必须使用C语言...
I'm using a naive approach to this problem, I'm putting the words in a linked list and just making a linear search into it. But it's taking too much time in large files.
I was thinking in use a Binary Search Tree but I don't know if it works good with strings. Also heard of Skip Lists, didn't really learn it yet.
And also I have to use the C language...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您可以将所有单词放入 trie 中,然后计算单词数处理了整个文件。
You can put all of the words into a trie and then count the number of words after you have processed the whole file.
二叉搜索树对于字符串来说效果很好。
如果您不关心单词的排序顺序,则可以仅使用哈希表。
Binary Search Trees work fine for strings.
If you don't care about having the words in sorted order, you can just use a hash table.
您正在计算文件中唯一单词的数量吗?
你为什么不构造一个简单的哈希表?这样,对于列表中的每个单词,将其添加到哈希表中。任何重复项都将被丢弃,因为它们已经在哈希表中 - 最后,您可以计算数据结构中的元素数量(通过存储计数器并在每次添加到表时递增它)。
You're counting the number of unique words in the file?
Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).
算法的第一次升级可能是对列表进行排序,因此,您的线性搜索可能会更快(您只搜索直到找到一个比您的元素大的元素),但这仍然是一个幼稚的解决方案。
最好的方法是二叉搜索树,甚至更好的是前缀树(或特里树,在其他答案中已经提到)。
在 K&R 的“C 编程语言”中,您可以找到所需的确切示例。
“自动引用数据结构”(6.5)的第一个示例是二叉搜索树,用于计算字符串中每个单词的出现次数。 (你不需要数:P)
结构是这样的:
在书中你可以看到你想要做的整个例子。
二叉搜索树适用于任何可以接受顺序的数据结构,并且比列表中的线性搜索更好。
抱歉我的英语不好,如果我说错了,请纠正我,我对 C 非常菜鸟:p
编辑: 我无法向其他答案添加评论,但我已阅读OP 的评论说“列表没有排序,所以我不能使用二分搜索”。在链表上使用二分查找是无稽之谈。为什么?当对随机元素的访问速度很快时(就像在数组中一样),二分搜索是有效的。在双链表中,最差的访问次数将是 n/2。但是,您可以在列表中放置很多指针(访问关键元素),但这是一个糟糕的解决方案。
The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.
Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).
In "The C Programming Language" From K&R you have the exact example of what you are looking for.
The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)
the structure is something like this:
In the book you can see the whole example of what you want to do.
Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.
Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p
EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..
我将单词放入链接列表中,然后对其进行线性搜索。
如果要检查单词 W 是否存在,您会遍历整个列表,那么它肯定很长。 O(n^2),其中 n 是列表的大小。
最简单的方法可能是有一个哈希。自己实现很容易(与某些树结构不同),甚至 C 也应该有一些库。您将获得 O(n) 复杂度。
编辑一些C哈希表实现
http://en.wikipedia.org/wiki/Hash_table#Independent_packages
I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.
Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.
edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages
如果您使用的是 UNIX 系统,则可以使用
bsearch()
或hsearch()
系列函数来代替线性搜索。If you're on a UNIX system, then you could use the
bsearch()
orhsearch()
family of functions instead of a linear search.如果您需要一些简单且易于使用的东西,那么
man tsearch
对于简单的二叉搜索树。但这是普通的二叉搜索树,不是平衡的。根据唯一单词的数量,普通 C 数组 + realloc() + qsort() + bsearch() 也可能是一个选项。当我需要在普通的可移植 C 中进行比线性更快的简单搜索时,我会使用它。(否则,如果可能的话,我选择 C++ 和 std::map/std::set。)
更高级的选项通常是特定于平台的(例如Linux 上的glib)。
PS另一个非常容易实现的结构是哈希。对于字符串效率较低,但很容易实现。通过将内存投入到问题中,可以很快地变得非常快。
If you need something simple and easily available then
man tsearch
for simple binary search tree. But this is plain binary search tree, not balanced.Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)
More advanced options are often platforms specific (e.g. glib on Linux).
P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.