计算文件中的唯一单词?好的线性搜索替代方案?

发布于 2024-09-15 16:47:21 字数 146 浏览 11 评论 0原文

我正在使用一种幼稚的方法来解决这个问题,我将单词放入链接列表中,然后对其进行线性搜索。但处理大文件会花费太多时间。

我正在考虑使用二叉搜索树,但我不知道它是否适用于字符串。也听说过Skip Lists,还没真正学过。

而且我还必须使用C语言...

I'm using a naive approach to this problem, I'm putting the words in a linked list and just making a linear search into it. But it's taking too much time in large files.

I was thinking in use a Binary Search Tree but I don't know if it works good with strings. Also heard of Skip Lists, didn't really learn it yet.

And also I have to use the C language...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

弃爱 2024-09-22 16:47:21

您可以将所有单词放入 trie 中,然后计算单词数处理了整个文件。

You can put all of the words into a trie and then count the number of words after you have processed the whole file.

旧城空念 2024-09-22 16:47:21

二叉搜索树对于字符串来说效果很好。

如果您不关心单词的排序顺序,则可以仅使用哈希表。

Binary Search Trees work fine for strings.

If you don't care about having the words in sorted order, you can just use a hash table.

夜清冷一曲。 2024-09-22 16:47:21

您正在计算文件中唯一单词的数量吗?

你为什么不构造一个简单的哈希表?这样,对于列表中的每个单词,将其添加到哈希表中。任何重复项都将被丢弃,因为它们已经在哈希表中 - 最后,您可以计算数据结构中的元素数量(通过存储计数器并在每次添加到表时递增它)。

You're counting the number of unique words in the file?

Why don't your construct a simple hash table? This way, for each word in your list, add it into the hash table. Any duplicates will be discarded since they would already be in the hash table - and finally, you can just count the number of elements in the data structure (by storing a counter and incrementing it each time you add to the table).

汹涌人海 2024-09-22 16:47:21

算法的第一次升级可能是对列表进行排序,因此,您的线性搜索可能会更快(您只搜索直到找到一个比您的元素大的元素),但这仍然是一个幼稚的解决方案。

最好的方法是二叉搜索树,甚至更好的是前缀树(或特里树,在其他答案中已经提到)。

在 K&R 的“C 编程语言”中,您可以找到所需的确切示例。
“自动引用数据结构”(6.5)的第一个示例是二叉搜索树,用于计算字符串中每个单词的出现次数。 (你不需要数:P)

结构是这样的:

struct tnode {
        char *word;
        struct tnode *left;
        struct tnode *right;
};

在书中你可以看到你想要做的整个例子。

二叉搜索树适用于任何可以接受顺序的数据结构,并且比列表中的线性搜索更好。

抱歉我的英语不好,如果我说错了,请纠正我,我对 C 非常菜鸟:p

编辑: 我无法向其他答案添加评论,但我已阅读OP 的评论说“列表没有排序,所以我不能使用二分搜索”。在链表上使用二分查找是无稽之谈。为什么?当对随机元素的访问速度很快时(就像在数组中一样),二分搜索是有效的。在双链表中,最差的访问次数将是 n/2。但是,您可以在列表中放置很多指针(访问关键元素),但这是一个糟糕的解决方案。

The first upgrade to your algorithm could be having the list sorted, so, your lineal search could be faster (you only search until you find one element greater than yours), but this is still a naive solution.

Best approaches are Binary Search Trees and even better, a prefix tree (or trie, already mentioned in other answer).

In "The C Programming Language" From K&R you have the exact example of what you are looking for.
The first example of "autoreferenced data structs" (6.5) is a binary search tree used for counting the ocurrences of every word in a string. (You don't need to count :P)

the structure is something like this:

struct tnode {
        char *word;
        struct tnode *left;
        struct tnode *right;
};

In the book you can see the whole example of what you want to do.

Binary Search Trees works good with any tipe of data structure that can accept an order, and will be better than a lineal search in a list.

Sorry for my poor english, and correct me if i was wrong with something I've said, Im very noob with C :p

EDIT: I can't add comments to other answers, but I have read a coment from OP saying "The list isn't sorted so I can't use binary search". It is nonsense to use binary search on a linked list. ¿Why? Binary Search is efficient when the access to a random element is fast, like in an array. In a double linked list, your worst access will be n/2.. However, you can put a lot of pointers in the list (accesing to key elements), but it is a bad solution..

就是爱搞怪 2024-09-22 16:47:21

我将单词放入链接列表中,然后对其进行线性搜索。
如果要检查单词 W 是否存在,您会遍历整个列表,那么它肯定很长。 O(n^2),其中 n 是列表的大小。

最简单的方法可能是有一个哈希。自己实现很容易(与某些树结构不同),甚至 C 也应该有一些库。您将获得 O(n) 复杂度。

编辑一些C哈希表实现
http://en.wikipedia.org/wiki/Hash_table#Independent_packages

I'm puting the words in a linked list and just making a linear search into it.
If to check whether word W is present, you go through the whole list, then it's surely long. O(n^2), where n is size of the list.

Simplest way is probably having a hash. It's easy to implement yourself (unlike some tree structures) and even C should have some libraries for that. You'll get O(n) complexity.

edit Some C hashtable implementations
http://en.wikipedia.org/wiki/Hash_table#Independent_packages

万劫不复 2024-09-22 16:47:21

如果您使用的是 UNIX 系统,则可以使用 bsearch()hsearch() 系列函数来代替线性搜索。

If you're on a UNIX system, then you could use the bsearch() or hsearch() family of functions instead of a linear search.

老街孤人 2024-09-22 16:47:21

如果您需要一些简单且易于使用的东西,那么 man tsearch对于简单的二叉搜索树。但这是普通的二叉搜索树,不是平衡的。

根据唯一单词的数量,普通 C 数组 + realloc() + qsort() + bsearch() 也可能是一个选项。当我需要在普通的可移植 C 中进行比线性更快的简单搜索时,我会使用它。(否则,如果可能的话,我选择 C++ 和 std::map/std::set。)

更高级的选项通常是特定于平台的(例如Linux 上的glib)。

PS另一个非常容易实现的结构是哈希。对于字符串效率较低,但很容易实现。通过将内存投入到问题中,可以很快地变得非常快。

If you need something simple and easily available then man tsearch for simple binary search tree. But this is plain binary search tree, not balanced.

Depending on number of unique words, plain C array + realloc() + qsort() + bsearch() might be an option too. That's what I use when I need no-frills faster-than-linear search in plain portable C. (Otherwise, if possible, I opt for C++ and std::map/std::set.)

More advanced options are often platforms specific (e.g. glib on Linux).

P.S. Another very easy to implement structure is a hash. Less efficient for strings but very easy to implement. Can be very quickly made blazing fast by throwing memory at the problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文