词频统计
在预面试中,我面临这样的问题:
给定一个由单个空格分隔的单词组成的字符串,按单词在字符串中出现的次数降序打印单词。
例如,输入字符串“abb”将生成以下输出:
b : 2
a : 1
首先,我想说,输入字符串是由单字母单词还是由多字母单词组成并不那么清楚。如果是前者,事情可能会很简单。
这是我的想法:
int c[26] = {0};
char *pIn = strIn;
while (*pIn != 0 && *pIn != ' ')
{
++c[*pIn];
++pIn;
}
/* how to sort the array c[26] and remember the original index? */
我可以统计输入字符串中每个单字母单词的频率,并且可以对其进行排序(使用 QuickSort 或其他方式)。但是在计数数组排序后,如何获取与计数相关的单字母单词,以便稍后将它们成对打印出来?
如果输入字符串由多字母单词组成,我计划使用 map
来跟踪频率。但同样,如何对映射的键值对进行排序?
问题是用 C 或 C++ 编写的,欢迎提出任何建议。
谢谢!
In an pre-interview, I am faced with a question like this:
Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.
For example an input string of “a b b” would generate the following output:
b : 2
a : 1
Firstly, I'd say it is not so clear that whether the input string is made up of single-letter words or multiple-letter words. If the former is the case, it could be simple.
Here is my thought:
int c[26] = {0};
char *pIn = strIn;
while (*pIn != 0 && *pIn != ' ')
{
++c[*pIn];
++pIn;
}
/* how to sort the array c[26] and remember the original index? */
I can get the statistics of the frequecy of every single-letter word in the input string, and I can get it sorted (using QuickSort or whatever). But after the count array is sorted, how to get the single-letter word associated with the count so that I can print them out in pair later?
If the input string is made of of multiple-letter word, I plan to use a map<const char *, int>
to track the frequency. But again, how to sort the map's key-value pair?
The question is in C or C++, and any suggestion is welcome.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我将使用
std::map
来存储单词及其计数。然后我会使用这样的东西来获取单词:最后,您只需要弄清楚如何按频率顺序打印它们,我将把它作为练习留给您。
I would use a
std::map<std::string, int>
to store the words and their counts. Then I would use something this to get the words:finally, you just need to figure out how to print them in order of frequency, I'll leave that as an exercise for you.
您假设只需要 26 个选项,这绝对是错误的,因为您的雇主也希望允许使用多字符单词(甚至可能是数字?)。
这意味着您将需要一个可变长度的数组。我强烈建议使用矢量,或者更好的是使用地图。
要查找字符串中的字符序列,请查找当前位置(从 0 开始)和下一个空格的位置。然后就是这个词了。将当前位置设置为空格并再次执行。不断重复此操作,直到结束。
通过使用地图,您已经可以获得可用的字数/计数。
如果您申请的工作需要大学技能,我强烈建议通过添加某种哈希函数来优化地图。然而,从问题的难度来看,我认为情况并非如此。
You're definitely wrong in assuming that you need only 26 options, 'cause your employer will want to allow multiple-character words as well (and maybe even numbers?).
This means you're going to need an array with a variable length. I strongly recommend using a vector or, even better, a map.
To find the character sequences in the string, find your current position (start at 0) and the position of the next space. Then that's the word. Set the current position to the space and do it again. Keep repeating this until you're at the end.
By using the map you'll already have the word/count available.
If the job you're applying for requires university skills, I strongly recommend optimizing the map by adding some kind of hashing function. However, judging by the difficulty of the question I assume that that is not the case.
以 C 语言为例:
我喜欢暴力、简单的算法,所以我会这样做:
对输入字符串进行分词以给出未排序的单词数组。我实际上必须物理地移动每个单词(因为每个单词的长度都是可变的);我认为我需要一个 char* 数组,我将使用它作为 qsort( ) 的参数。
qsort( ) (降序)该单词数组。 (在 qsort() 的 COMPAR 函数中,假设较大的单词是较小的单词,以便数组获得降序排序顺序。)
3.a.遍历现在已排序的数组,查找相同单词的子数组。子数组的结尾和下一个子数组的开头由我看到的第一个不相同的单词表示。
3.b.当我到达子数组的末尾(或排序数组的末尾)时,我知道(1)单词和(2)子数组中相同单词的数量。
编辑新步骤 4: 在另一个数组(称为 array2)中保存子数组中单词的 char* 以及子数组中相同单词的计数。
当排序数组中不再有单词时,我就完成了。是时候打印了。
qsort( ) array2 按词频。
遍历 array2,打印每个单词及其频率。
我受够了!我们去吃午饭吧。
Taking the C-language case:
I like brute-force, straightforward algos so I would do it in this way:
Tokenize the input string to give an unsorted array of words. I'll have to actually, physically move each word (because each is of variable length); and I think I'll need an array of char*, which I'll use as the arg to qsort( ).
qsort( ) (descending) that array of words. (In the COMPAR function of qsort(), pretend that bigger words are smaller words so that the array acquires descending sort order.)
3.a. Go through the now-sorted array, looking for subarrays of identical words. The end of a subarray, and the beginning of the next, is signalled by the first non-identical word I see.
3.b. When I get to the end of a subarray (or to the end of the sorted array), I know (1) the word and (2) the number of identical words in the subarray.
EDIT new step 4: Save, in another array (call it array2), a char* to a word in the subarry and the count of identical words in the subarray.
When no more words in sorted array, I'm done. it's time to print.
qsort( ) array2 by word frequency.
go through array2, printing each word and its frequency.
I'M DONE! Let's go to lunch.
在我之前的所有答案都没有给出真正的答案。
让我们思考一个可能的解决方案。
有一种或多或少的标准方法来计算容器中的东西。
我们可以使用关联容器,例如
std::map
或std::unordered_map
。在这里,我们将“键”(在本例中为单词)与计数和值(在本例中为特定单词的计数)相关联。幸运的是,这些地图有一个非常好的索引
operator[]
。这将查找给定的键,如果找到,则返回对该值的引用。如果未找到,则它将使用该密钥创建一个新条目并返回对新条目的引用。因此,在这两种情况下,我们都会获得用于计数的值的引用。然后我们可以简单地写:这看起来非常直观。
完成此操作后,您就已经有了频率表。可以使用
std::map
按键(单词)排序,也可以不排序,但使用std::unordered_map
可以更快地访问。现在您想根据频率/计数进行排序。不幸的是,这对于地图来说是不可能的。
因此,我们需要使用第二个容器,例如 ```std::vector`````,然后我们可以对任何给定谓词进行 unsing
std::sort
排序,或者,我们可以复制将值放入容器中,例如隐式对其元素进行排序的std::multiset
。为了获取
std::string
的单词,我们只需使用std::istringstream
和标准提取运算符>>
。没什么大不了的。因为要为 std 容器编写所有这些长名称,所以我们使用
using
关键字创建别名。毕竟,我们现在编写超紧凑的代码,只需几行代码即可完成任务:
All the answers prior to mine did not give really an answer.
Let us think on a potential solution.
There is a more or less standard approach for counting something in a container.
We can use an associative container like a
std::map
or astd::unordered_map
. And here we associate a "key", in this case the word, to a count, with a value, in this case the count of the specific word.And luckily the maps have a very nice index
operator[]
. This will look for the given key and, if found, return a reference to the value. If not found, then it will create a new entry with the key and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:And that looks really intuitive.
After this operation, you have already the frequency table. Either sorted by the key (the word), by using a
std::map
or unsorted, but faster accessible with astd::unordered_map
.Now you want to sort according to the frequency/count. Unfortunately this is not possible with maps.
Therefore we need to use a second container, like a ```std::vector`````which we then can sort unsing
std::sort
for any given predicate, or, we can copy the values into a container, like astd::multiset
that implicitely orders its elements.For getting out the words of a
std::string
we simply use astd::istringstream
and the standard extraction operator>>
. No big deal at all.And because writing all this long names for the std containers, we create alias names, with the
using
keyword.After all this, we now write ultra compact code and fulfill the task with just a few lines of code: