文档中的字数统计频率
我有一个目录,其中有 1000 个 txt.files。我想知道每个单词在 1000 个文档中出现了多少次。因此,即使“cow”这个词在 X 中出现了 100 次,它仍然会被算作 1 次。如果它出现在不同的文档中,则加一。因此,如果每个文档中都出现“cow”,则最大值为 1000。我如何在不使用任何其他外部库的情况下以简单的方式做到这一点。这是我到目前为止所拥有的
private Hashtable<String, Integer> getAllWordCount()
private Hashtable<String, Integer> getAllWordCount()
{
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
try {
for (int j = 0; j < fileDirectory.length; j++){
File theDirectory = new File(fileDirectory[j]);
File[] children = theDirectory.listFiles();
for (int i = 0; i < children.length; i++){
Scanner scanner = new Scanner(new FileReader(children[i]));
while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
if (words.contains(text) == false){
if (result.get(text) == null)
result.put(text, 1);
else
result.put(text, result.get(text) + 1);
words.add(text);
}
}
}
words.clear();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(result.size());
return result;
}
I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far
private Hashtable<String, Integer> getAllWordCount()
private Hashtable<String, Integer> getAllWordCount()
{
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
try {
for (int j = 0; j < fileDirectory.length; j++){
File theDirectory = new File(fileDirectory[j]);
File[] children = theDirectory.listFiles();
for (int i = 0; i < children.length; i++){
Scanner scanner = new Scanner(new FileReader(children[i]));
while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
if (words.contains(text) == false){
if (result.get(text) == null)
result.put(text, 1);
else
result.put(text, result.get(text) + 1);
words.add(text);
}
}
}
words.clear();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(result.size());
return result;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您还需要一个
HashSet
在其中存储从当前文件中读取的每个唯一单词。然后,在读取每个单词后,您应该检查它是否在集合中,如果不在集合中,则增加
result
映射中的相应值(或者如果它为空,则添加一个新条目,就像您已经do)并将该单词添加到集合中。不过,当您开始读取新文件时,请不要忘记重置设置。
You also need a
HashSet<String>
in which you store each unique word you've read from the current file.Then after every word read, you should check if it's in the set, if it isn't, increment the corresponding value in the
result
map (or add a new entry if it was empty, like you already do) and add the word to the set.Don't forget to reset the set when you start to read a new file though.
这个怎么样?
how about this?