删除 Java 中的停用词 --- 需要帮助
我使用一种方法来删除文件中定义的停用词,这将从我传递给该方法的查询字符串中删除这些单词...代码工作正常
现在我需要做的是...如果查询字符串只包含这些停用词,那么它不应该被删除..
例如:如果停用词文件有“is”“was”“and”
如果查询是“I was a Student”那么输出应该是“I a学生”,
但如果查询是“并且是”,现在我需要输出与“and is”相同。
下面是我编写的删除停用词的方法。
public static String removeStopWords(String query) throws UnsupportedEncodingException
{
String []queryTerms = query.split("&");
String queryString="";
StringBuffer sb =new StringBuffer();
for (int i=0;i<queryTerms.length;i++){
if(queryTerms[i].startsWith("q=") && !queryTerms[i].startsWith("q.orig")){
queryString = queryTerms[i].replaceAll("q=","").trim().replace("+"," ").replaceAll("\\s+"," ").trim();
}
}
if(!queryString.equalsIgnoreCase("")) {
String [] tokens=queryString.split("\\s+");
List lStopWords=StopWordDataLoad.getlQueryStringStopword();
List<String> lTokens=new ArrayList<String>();
boolean noStopWord=false;
for(String s: tokens)
if(!lStopWords.contains(s)) {
if(sb.length()==0) sb.append(s);
else sb.append(" ").append(s);
} else noStopWord=true;
queryString=sb.toString().replaceAll("\\s+", " ");
if(queryString.equalsIgnoreCase("") || noStopWord ==false) return query;
}
else return query;
String fque="";
String finQue = "";
ArrayList<String> list = new ArrayList<String>();
for (int i=0;i<queryTerms.length;i++){
if(queryTerms[i].startsWith("q=") && !queryTerms[i].startsWith("q.orig")){
fque = "q="+URLEncoder.encode(queryString,PropertyLoader.getHttpEncoding());
list.add(fque);
} else if (!queryTerms[i].equalsIgnoreCase("")) list.add(queryTerms[i]);
}
ListIterator<String> iter = list.listIterator();
while(iter.hasNext()){
String str = iter.next();
finQue=finQue+"&"+str;
}
return finQue.trim();
}
Im using a method to remove stop word defined in a file, that will rip off those words from the query string that i pass to this method... The code is working fine
Now what i need to do is ... If the query string contains just those stop words alone then it should not be ripped of..
eg : if the stopwords file has "is" "was" "and"
if the query is "I was a student" then the output should be " I a student"
but if the query is "and is " now i need the output the same as "and is".
Below is the method that i wrote to remove stop words.
public static String removeStopWords(String query) throws UnsupportedEncodingException
{
String []queryTerms = query.split("&");
String queryString="";
StringBuffer sb =new StringBuffer();
for (int i=0;i<queryTerms.length;i++){
if(queryTerms[i].startsWith("q=") && !queryTerms[i].startsWith("q.orig")){
queryString = queryTerms[i].replaceAll("q=","").trim().replace("+"," ").replaceAll("\\s+"," ").trim();
}
}
if(!queryString.equalsIgnoreCase("")) {
String [] tokens=queryString.split("\\s+");
List lStopWords=StopWordDataLoad.getlQueryStringStopword();
List<String> lTokens=new ArrayList<String>();
boolean noStopWord=false;
for(String s: tokens)
if(!lStopWords.contains(s)) {
if(sb.length()==0) sb.append(s);
else sb.append(" ").append(s);
} else noStopWord=true;
queryString=sb.toString().replaceAll("\\s+", " ");
if(queryString.equalsIgnoreCase("") || noStopWord ==false) return query;
}
else return query;
String fque="";
String finQue = "";
ArrayList<String> list = new ArrayList<String>();
for (int i=0;i<queryTerms.length;i++){
if(queryTerms[i].startsWith("q=") && !queryTerms[i].startsWith("q.orig")){
fque = "q="+URLEncoder.encode(queryString,PropertyLoader.getHttpEncoding());
list.add(fque);
} else if (!queryTerms[i].equalsIgnoreCase("")) list.add(queryTerms[i]);
}
ListIterator<String> iter = list.listIterator();
while(iter.hasNext()){
String str = iter.next();
finQue=finQue+"&"+str;
}
return finQue.trim();
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只需将最后一行更改为:
Just change the last line to this: