文档中的字数统计频率

发布于 2024-10-21 03:56:56 字数 1591 浏览 1 评论 0原文

我有一个目录，其中有 1000 个 txt.files。我想知道每个单词在 1000 个文档中出现了多少次。因此，即使“cow”这个词在 X 中出现了 100 次，它仍然会被算作 1 次。如果它出现在不同的文档中，则加一。因此，如果每个文档中都出现“cow”，则最大值为 1000。我如何在不使用任何其他外部库的情况下以简单的方式做到这一点。这是我到目前为止所拥有的

     private Hashtable<String, Integer> getAllWordCount()
     private Hashtable<String, Integer> getAllWordCount()
    {
        Hashtable<String, Integer> result = new Hashtable<String, Integer>();
        HashSet<String> words = new HashSet<String>();
        try {   
            for (int j = 0; j < fileDirectory.length; j++){
                File theDirectory = new File(fileDirectory[j]);
                File[] children = theDirectory.listFiles();

                for (int i = 0; i < children.length; i++){
                    Scanner scanner = new Scanner(new FileReader(children[i]));

                    while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                        if (words.contains(text) == false){
                            if (result.get(text) == null)
                                result.put(text, 1);
                            else
                                result.put(text, result.get(text) + 1);
                            words.add(text);
                        }
                    }
                }
                words.clear();
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println(result.size());
        return result;
    }

原文

I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far

     private Hashtable<String, Integer> getAllWordCount()
     private Hashtable<String, Integer> getAllWordCount()
    {
        Hashtable<String, Integer> result = new Hashtable<String, Integer>();
        HashSet<String> words = new HashSet<String>();
        try {   
            for (int j = 0; j < fileDirectory.length; j++){
                File theDirectory = new File(fileDirectory[j]);
                File[] children = theDirectory.listFiles();

                for (int i = 0; i < children.length; i++){
                    Scanner scanner = new Scanner(new FileReader(children[i]));

                    while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                        if (words.contains(text) == false){
                            if (result.get(text) == null)
                                result.put(text, 1);
                            else
                                result.put(text, result.get(text) + 1);
                            words.add(text);
                        }
                    }
                }
                words.clear();
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        System.out.println(result.size());
        return result;
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故人的歌 2024-10-28 03:56:56

您还需要一个 HashSet 在其中存储从当前文件中读取的每个唯一单词。

然后，在读取每个单词后，您应该检查它是否在集合中，如果不在集合中，则增加 result 映射中的相应值（或者如果它为空，则添加一个新条目，就像您已经do）并将该单词添加到集合中。

不过，当您开始读取新文件时，请不要忘记重置设置。

回复收藏 0 原文

眉黛浅 2024-10-28 03:56:56

这个怎么样？

private Hashtable<String, Integer> getAllWordCount()
{
    Hashtable<String, Integer> result = new Hashtable<String, Integer>();
    HashSet<String> words = new HashSet<String>();
    try {   
        for (int j = 0; j < fileDirectory.length; j++){
            File theDirectory = new File(fileDirectory[j]);
            File[] children = theDirectory.listFiles();
            for (int i = 0; i < children.length; i++){
                Scanner scanner = new Scanner(new FileReader(children[i]));
                while (scanner.hasNext()){
                    String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                    words.add(text);
                }
                for (String word : words) {
                  Integer count = result.get(word)
                  if (result.get(word) == null) {
                    result.put(word, 1);
                  } else {
                    result.put(word, result.get(word) + 1);
                  }
                }
                words.clear();
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(result.size());
    return result;
}

how about this?

private Hashtable<String, Integer> getAllWordCount()
{
    Hashtable<String, Integer> result = new Hashtable<String, Integer>();
    HashSet<String> words = new HashSet<String>();
    try {   
        for (int j = 0; j < fileDirectory.length; j++){
            File theDirectory = new File(fileDirectory[j]);
            File[] children = theDirectory.listFiles();
            for (int i = 0; i < children.length; i++){
                Scanner scanner = new Scanner(new FileReader(children[i]));
                while (scanner.hasNext()){
                    String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
                    words.add(text);
                }
                for (String word : words) {
                  Integer count = result.get(word)
                  if (result.get(word) == null) {
                    result.put(word, 1);
                  } else {
                    result.put(word, result.get(word) + 1);
                  }
                }
                words.clear();
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    System.out.println(result.size());
    return result;
}

回复收藏 0 原文

~没有更多了~