如何确定每个文档中术语的术语频率?

发布于 11-01 18:07 字数 2184 浏览 3 评论 0原文

我正在构建倒排索引,但在检查数据库时似乎无法获得正确的频率。我到处都读到您应该使用 HashMap,但我不太确定这是否是正确的方法。有什么想法吗?

public class Tokenize {

    public static void createIndex() throws Exception{
        
        ArrayList<Dokument> dok = new QueryHandler().getDokuments();
        ArrayList<String> queries = new ArrayList<String>();
        ArrayList<String> queries2 = new ArrayList<String>();
        HashMap<String, Integer> frek = new HashMap<String, Integer>();
        
        for(int d = 0; d < dok.size(); d++){
            String token = "";
            int frekvens = 0;
            
            
            try{
                
                Dokument document = dok.get(d);
                StringTokenizer st = new StringTokenizer(document.dokument());
                while (st.hasMoreTokens()) {
                    
                    
                    token = st.nextToken();
                    token.replaceAll("[']", "");
                    token.replaceAll("[,]", "");
                    token.replaceAll("[)]", "");
                    token.replaceAll("[(]", "");
                    token.replaceAll("[.]", "");
                    frekvens ++;
                    frek.put(token, frekvens);
                    
                
                        queries.add("INSERT IGNORE INTO termindeks (docID, term) values ("+document.docID()+", '"+token+"')");
                        queries2.add("INSERT IGNORE INTO invertedindeks (term, docID, termfrekvens) values ('"+token+"', "+document.docID()+", "+ frekvens+")");
                                        
                            
                }
            }
            

            catch (Exception e) {
            e.printStackTrace();
            System.out.println(token);
            }
        }
        
        String[] ffs = new String[queries.size()];
        ffs = queries.toArray(ffs);
        getDB().runQueriesIgnoreException(queries.toArray(ffs));
        
        String[] ffs2 = new String[queries2.size()];
        ffs2 = queries2.toArray(ffs2);
        getDB().runQueriesIgnoreException(queries2.toArray(ffs2));

    }
}

I'm building an inverted index, but I can't seem to get the correct frequencies when I check the database. I read everywhere that you should use a HashMap, but I'm not quite sure if this is the correct method of doing so. Any ideas?

public class Tokenize {

    public static void createIndex() throws Exception{
        
        ArrayList<Dokument> dok = new QueryHandler().getDokuments();
        ArrayList<String> queries = new ArrayList<String>();
        ArrayList<String> queries2 = new ArrayList<String>();
        HashMap<String, Integer> frek = new HashMap<String, Integer>();
        
        for(int d = 0; d < dok.size(); d++){
            String token = "";
            int frekvens = 0;
            
            
            try{
                
                Dokument document = dok.get(d);
                StringTokenizer st = new StringTokenizer(document.dokument());
                while (st.hasMoreTokens()) {
                    
                    
                    token = st.nextToken();
                    token.replaceAll("[']", "");
                    token.replaceAll("[,]", "");
                    token.replaceAll("[)]", "");
                    token.replaceAll("[(]", "");
                    token.replaceAll("[.]", "");
                    frekvens ++;
                    frek.put(token, frekvens);
                    
                
                        queries.add("INSERT IGNORE INTO termindeks (docID, term) values ("+document.docID()+", '"+token+"')");
                        queries2.add("INSERT IGNORE INTO invertedindeks (term, docID, termfrekvens) values ('"+token+"', "+document.docID()+", "+ frekvens+")");
                                        
                            
                }
            }
            

            catch (Exception e) {
            e.printStackTrace();
            System.out.println(token);
            }
        }
        
        String[] ffs = new String[queries.size()];
        ffs = queries.toArray(ffs);
        getDB().runQueriesIgnoreException(queries.toArray(ffs));
        
        String[] ffs2 = new String[queries2.size()];
        ffs2 = queries2.toArray(ffs2);
        getDB().runQueriesIgnoreException(queries2.toArray(ffs2));

    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不语却知心2024-11-08 18:07:50

您应该首先获取令牌的值,增加它并再次放置它。

就像你的循环中这样:

Integer frekvens = frek.get(token); //remove the other frekvens as it's not needed - or find a better name for this one ;)
if( frekvens == null ) { frekvens = 0 };
frekvens++;
frek.put(token, frekvens);

You should get the value for the token first, increment it and put it again.

Like this in your loop:

Integer frekvens = frek.get(token); //remove the other frekvens as it's not needed - or find a better name for this one ;)
if( frekvens == null ) { frekvens = 0 };
frekvens++;
frek.put(token, frekvens);
丢了幸福的猪2024-11-08 18:07:50

这个想法是正确的,但据我所知,您没有正确使用 HashMap。您必须获取与键关联的值,即

Integer i = map.get(token);
i += 1;
map.put(token, i);

EDIT

另一种选择是使用 AtomicInteger 而不是 Integer 因为 AtomicInteger 是可变的。

Map<String, AtomicInteger> map = new HashMap<String, AtomicInteger>();    
map.get(token).getAndIncrement();

The idea is correct but as far as I see, you don't make correct use of the HashMap. You have to get the value associated with the key, i.e.

Integer i = map.get(token);
i += 1;
map.put(token, i);

EDIT

Another option would be to use an AtomicInteger instead of an Integer because AtomicIntegers are mutable.

Map<String, AtomicInteger> map = new HashMap<String, AtomicInteger>();    
map.get(token).getAndIncrement();
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文