树形图问题
我正在尝试计算文本文件中单词的频率。但我必须使用不同的方法。例如,如果文件包含 BRAIN-ISCHEMIA 和 ISCHEMIA-BRAIN,我需要对 BRAIN-ISCHEMIA 进行两次计数(并保留 ISCHEMIA-BRAIN),反之亦然。这是我的代码-
// Mapping of String->Integer (word -> frequency)
HashMap<String, Integer> frequencyMap = new HashMap<String, Integer>();
// Iterate through each line of the file
String[] temp;
String currentLine;
String currentLine2;
while ((currentLine = in.readLine()) != null) {
// Remove this line if you want words to be case sensitive
currentLine = currentLine.toLowerCase();
temp=currentLine.split("-");
currentLine2=temp[1]+"-"+temp[0];
// Iterate through each word of the current line
// Delimit words based on whitespace, punctuation, and quotes
StringTokenizer parser = new StringTokenizer(currentLine);
while (parser.hasMoreTokens()) {
String currentWord = parser.nextToken();
Integer frequency = frequencyMap.get(currentWord);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord, frequency + 1);
}
StringTokenizer parser2 = new StringTokenizer(currentLine2);
while (parser2.hasMoreTokens()) {
String currentWord2 = parser2.nextToken();
Integer frequency = frequencyMap.get(currentWord2);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord2, frequency + 1);
}
}
// Display our nice little Map
System.out.println(frequencyMap);
但是对于以下文件-
缺血-谷氨酸 脑缺血 谷氨酸脑 大脑耐受性 大脑耐受性 耐受脑 谷氨酸缺血 缺血-谷氨酸
我得到以下输出-
{谷氨酸脑=1,缺血谷氨酸=3,缺血脑=1,谷氨酸缺血=3,脑耐受=3,脑缺血=1,耐受脑=3,脑谷氨酸= 1}
{ 我认为问题出在第二个 while 块中。对这个问题的任何了解都将受到高度赞赏。
I am trying to count frequency of words in a text file. But I have to use a different approach. For example, if the file contains BRAIN-ISCHEMIA and ISCHEMIA-BRAIN, I need to count BRAIN-ISCHEMIA twice (and leaving ISCHEMIA-BRAIN) or vice versa. Here is my piece of code-
// Mapping of String->Integer (word -> frequency)
HashMap<String, Integer> frequencyMap = new HashMap<String, Integer>();
// Iterate through each line of the file
String[] temp;
String currentLine;
String currentLine2;
while ((currentLine = in.readLine()) != null) {
// Remove this line if you want words to be case sensitive
currentLine = currentLine.toLowerCase();
temp=currentLine.split("-");
currentLine2=temp[1]+"-"+temp[0];
// Iterate through each word of the current line
// Delimit words based on whitespace, punctuation, and quotes
StringTokenizer parser = new StringTokenizer(currentLine);
while (parser.hasMoreTokens()) {
String currentWord = parser.nextToken();
Integer frequency = frequencyMap.get(currentWord);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord, frequency + 1);
}
StringTokenizer parser2 = new StringTokenizer(currentLine2);
while (parser2.hasMoreTokens()) {
String currentWord2 = parser2.nextToken();
Integer frequency = frequencyMap.get(currentWord2);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord2, frequency + 1);
}
}
// Display our nice little Map
System.out.println(frequencyMap);
But for the following file-
ISCHEMIA-GLUTAMATE
ISCHEMIA-BRAIN
GLUTAMATE-BRAIN
BRAIN-TOLERATE
BRAIN-TOLERATE
TOLERATE-BRAIN
GLUTAMATE-ISCHEMIA
ISCHEMIA-GLUTAMATE
I am getting the following output-
{glutamate-brain=1, ischemia-glutamate=3, ischemia-brain=1, glutamate-ischemia=3, brain-tolerate=3, brain-ischemia=1, tolerate-brain=3, brain-glutamate=1}
The problem is in second while block I think. Any light on this problem will be highly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
从算法的角度来看,您可能需要考虑以下方法:
对于每个字符串,拆分,然后排序,然后重新组合(即采用 DEF-ABC 并转换为 ABC-DEF。ABC-DEF 将转换为 ABC-DEF) 。然后将其用作频率计数的关键。
如果您需要保留确切的原始项目,只需将其包含在您的密钥中 - 这样密钥将具有:序数(重新组合的字符串)和原始项目。
From an algorithm perspective, you may want to consider the following approach:
For each string, split, then sort, then re-combine (i.e. take DEF-ABC and convert to ABC-DEF. ABC-DEF would convert to ABC-DEF). Then use that as the key for your frequency count.
If you need to hold onto the exact original item, just include that in your key - so the key would have: ordinal (the re-combined string) and original.
免责声明:我窃取了 Kevin Day 建议的甜蜜技巧来实现。
我仍然想发帖只是为了让您知道使用正确的方法数据结构 (Multiset/Bad) 和正确的库 (google-guava) 不仅可以简化代码,还可以提高效率嗯>。
代码
输出
Disclaimer: I stole the sweet trick suggested by Kevin Day for my implementation.
I still want to post just to let you know that using the right data structure (Multiset/Bad) and the right libraries (google-guava) will not only simplify the code but also makes it efficient.
Code
Output
感谢大家的帮助。这是我解决问题的方法-
Thanks everyone for your help. Here is how I solved it-