如何确定每个文档中术语的术语频率?
我正在构建倒排索引,但在检查数据库时似乎无法获得正确的频率。我到处都读到您应该使用 HashMap
,但我不太确定这是否是正确的方法。有什么想法吗?
public class Tokenize {
public static void createIndex() throws Exception{
ArrayList<Dokument> dok = new QueryHandler().getDokuments();
ArrayList<String> queries = new ArrayList<String>();
ArrayList<String> queries2 = new ArrayList<String>();
HashMap<String, Integer> frek = new HashMap<String, Integer>();
for(int d = 0; d < dok.size(); d++){
String token = "";
int frekvens = 0;
try{
Dokument document = dok.get(d);
StringTokenizer st = new StringTokenizer(document.dokument());
while (st.hasMoreTokens()) {
token = st.nextToken();
token.replaceAll("[']", "");
token.replaceAll("[,]", "");
token.replaceAll("[)]", "");
token.replaceAll("[(]", "");
token.replaceAll("[.]", "");
frekvens ++;
frek.put(token, frekvens);
queries.add("INSERT IGNORE INTO termindeks (docID, term) values ("+document.docID()+", '"+token+"')");
queries2.add("INSERT IGNORE INTO invertedindeks (term, docID, termfrekvens) values ('"+token+"', "+document.docID()+", "+ frekvens+")");
}
}
catch (Exception e) {
e.printStackTrace();
System.out.println(token);
}
}
String[] ffs = new String[queries.size()];
ffs = queries.toArray(ffs);
getDB().runQueriesIgnoreException(queries.toArray(ffs));
String[] ffs2 = new String[queries2.size()];
ffs2 = queries2.toArray(ffs2);
getDB().runQueriesIgnoreException(queries2.toArray(ffs2));
}
}
I'm building an inverted index, but I can't seem to get the correct frequencies when I check the database. I read everywhere that you should use a HashMap
, but I'm not quite sure if this is the correct method of doing so. Any ideas?
public class Tokenize {
public static void createIndex() throws Exception{
ArrayList<Dokument> dok = new QueryHandler().getDokuments();
ArrayList<String> queries = new ArrayList<String>();
ArrayList<String> queries2 = new ArrayList<String>();
HashMap<String, Integer> frek = new HashMap<String, Integer>();
for(int d = 0; d < dok.size(); d++){
String token = "";
int frekvens = 0;
try{
Dokument document = dok.get(d);
StringTokenizer st = new StringTokenizer(document.dokument());
while (st.hasMoreTokens()) {
token = st.nextToken();
token.replaceAll("[']", "");
token.replaceAll("[,]", "");
token.replaceAll("[)]", "");
token.replaceAll("[(]", "");
token.replaceAll("[.]", "");
frekvens ++;
frek.put(token, frekvens);
queries.add("INSERT IGNORE INTO termindeks (docID, term) values ("+document.docID()+", '"+token+"')");
queries2.add("INSERT IGNORE INTO invertedindeks (term, docID, termfrekvens) values ('"+token+"', "+document.docID()+", "+ frekvens+")");
}
}
catch (Exception e) {
e.printStackTrace();
System.out.println(token);
}
}
String[] ffs = new String[queries.size()];
ffs = queries.toArray(ffs);
getDB().runQueriesIgnoreException(queries.toArray(ffs));
String[] ffs2 = new String[queries2.size()];
ffs2 = queries2.toArray(ffs2);
getDB().runQueriesIgnoreException(queries2.toArray(ffs2));
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
发布评论
评论(2)
丢了幸福的猪2024-11-08 18:07:50
这个想法是正确的,但据我所知,您没有正确使用 HashMap
。您必须获取与键关联的值,即
Integer i = map.get(token);
i += 1;
map.put(token, i);
EDIT
另一种选择是使用 AtomicInteger
而不是 Integer
因为 AtomicInteger
是可变的。
Map<String, AtomicInteger> map = new HashMap<String, AtomicInteger>();
map.get(token).getAndIncrement();
~没有更多了~
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
您应该首先获取令牌的值,增加它并再次放置它。
就像你的循环中这样:
You should get the value for the token first, increment it and put it again.
Like this in your loop: