如何确定每个文档中术语的术语频率？

发布于 11-01 18:07 字数 2184 浏览 3 评论 0原文

我正在构建倒排索引，但在检查数据库时似乎无法获得正确的频率。我到处都读到您应该使用 HashMap，但我不太确定这是否是正确的方法。有什么想法吗？

public class Tokenize {

    public static void createIndex() throws Exception{
        
        ArrayList<Dokument> dok = new QueryHandler().getDokuments();
        ArrayList<String> queries = new ArrayList<String>();
        ArrayList<String> queries2 = new ArrayList<String>();
        HashMap<String, Integer> frek = new HashMap<String, Integer>();
        
        for(int d = 0; d < dok.size(); d++){
            String token = "";
            int frekvens = 0;
            
            
            try{
                
                Dokument document = dok.get(d);
                StringTokenizer st = new StringTokenizer(document.dokument());
                while (st.hasMoreTokens()) {
                    
                    
                    token = st.nextToken();
                    token.replaceAll("[']", "");
                    token.replaceAll("[,]", "");
                    token.replaceAll("[)]", "");
                    token.replaceAll("[(]", "");
                    token.replaceAll("[.]", "");
                    frekvens ++;
                    frek.put(token, frekvens);
                    
                
                        queries.add("INSERT IGNORE INTO termindeks (docID, term) values ("+document.docID()+", '"+token+"')");
                        queries2.add("INSERT IGNORE INTO invertedindeks (term, docID, termfrekvens) values ('"+token+"', "+document.docID()+", "+ frekvens+")");
                                        
                            
                }
            }
            

            catch (Exception e) {
            e.printStackTrace();
            System.out.println(token);
            }
        }
        
        String[] ffs = new String[queries.size()];
        ffs = queries.toArray(ffs);
        getDB().runQueriesIgnoreException(queries.toArray(ffs));
        
        String[] ffs2 = new String[queries2.size()];
        ffs2 = queries2.toArray(ffs2);
        getDB().runQueriesIgnoreException(queries2.toArray(ffs2));

    }
}

原文

I'm building an inverted index, but I can't seem to get the correct frequencies when I check the database. I read everywhere that you should use a HashMap, but I'm not quite sure if this is the correct method of doing so. Any ideas?

public class Tokenize {

    public static void createIndex() throws Exception{
        
        ArrayList<Dokument> dok = new QueryHandler().getDokuments();
        ArrayList<String> queries = new ArrayList<String>();
        ArrayList<String> queries2 = new ArrayList<String>();
        HashMap<String, Integer> frek = new HashMap<String, Integer>();
        
        for(int d = 0; d < dok.size(); d++){
            String token = "";
            int frekvens = 0;
            
            
            try{
                
                Dokument document = dok.get(d);
                StringTokenizer st = new StringTokenizer(document.dokument());
                while (st.hasMoreTokens()) {
                    
                    
                    token = st.nextToken();
                    token.replaceAll("[']", "");
                    token.replaceAll("[,]", "");
                    token.replaceAll("[)]", "");
                    token.replaceAll("[(]", "");
                    token.replaceAll("[.]", "");
                    frekvens ++;
                    frek.put(token, frekvens);
                    
                
                        queries.add("INSERT IGNORE INTO termindeks (docID, term) values ("+document.docID()+", '"+token+"')");
                        queries2.add("INSERT IGNORE INTO invertedindeks (term, docID, termfrekvens) values ('"+token+"', "+document.docID()+", "+ frekvens+")");
                                        
                            
                }
            }
            

            catch (Exception e) {
            e.printStackTrace();
            System.out.println(token);
            }
        }
        
        String[] ffs = new String[queries.size()];
        ffs = queries.toArray(ffs);
        getDB().runQueriesIgnoreException(queries.toArray(ffs));
        
        String[] ffs2 = new String[queries2.size()];
        ffs2 = queries2.toArray(ffs2);
        getDB().runQueriesIgnoreException(queries2.toArray(ffs2));

    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不语却知心2024-11-08 18:07:50

您应该首先获取令牌的值，增加它并再次放置它。

就像你的循环中这样：

Integer frekvens = frek.get(token); //remove the other frekvens as it's not needed - or find a better name for this one ;)
if( frekvens == null ) { frekvens = 0 };
frekvens++;
frek.put(token, frekvens);

You should get the value for the token first, increment it and put it again.

Like this in your loop:

Integer frekvens = frek.get(token); //remove the other frekvens as it's not needed - or find a better name for this one ;)
if( frekvens == null ) { frekvens = 0 };
frekvens++;
frek.put(token, frekvens);

回复收藏 0 原文

丢了幸福的猪2024-11-08 18:07:50

这个想法是正确的，但据我所知，您没有正确使用 HashMap。您必须获取与键关联的值，即

Integer i = map.get(token);
i += 1;
map.put(token, i);

EDIT

另一种选择是使用 AtomicInteger 而不是 Integer 因为 AtomicInteger 是可变的。

Map<String, AtomicInteger> map = new HashMap<String, AtomicInteger>();    
map.get(token).getAndIncrement();

The idea is correct but as far as I see, you don't make correct use of the HashMap. You have to get the value associated with the key, i.e.

Integer i = map.get(token);
i += 1;
map.put(token, i);

EDIT

Another option would be to use an AtomicInteger instead of an Integer because AtomicIntegers are mutable.

Map<String, AtomicInteger> map = new HashMap<String, AtomicInteger>();    
map.get(token).getAndIncrement();

回复收藏 0 原文

~没有更多了~

关于作者

断舍离

暂无简介

文章

25 人气

关注发私信

何以畏孤独

文章 0 评论 0

关注

南冥有猫

文章 0 评论 0

关注

神妖

文章 0 评论 0

关注

冷心人i

文章 0 评论 0

关注

橘虞初梦

文章 0 评论 0

关注

北人南面

文章 0 评论 0

友情链接

文江博客

如何确定每个文档中术语的术语频率？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

何以畏孤独

南冥有猫

神妖

冷心人i

橘虞初梦

北人南面

友情链接

如何确定每个文档中术语的术语频率？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

何以畏孤独

南冥有猫

神妖

冷心人i

橘虞初梦

北人南面

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。