从集合中删除不符合条件的项目
对于学校项目,目标是将查询字符串与 Song 对象内的歌词字符串进行模糊匹配。整体数据结构是一个由独特单词与歌词中包含该单词的歌曲集配对的 TreeMap。
我有包含查询字符串的初步匹配歌曲集。这里的问题是,我必须根据匹配部分中的字符数(包括空格)为每首结果歌曲分配一个排名。例如,搜索“她爱你”会在匹配项中返回以下内容:
“...她爱你...”披头士乐队,排名= 13
“......她只是爱你......” Bonnie Raitt,排名=18
“...她爱我,那么你...” Elvis Presley,rank=23
我用来对结果进行排序的是:
for (int i=0; i<lyrics.length; i++) {
if (lyrics[i].equals(query[0])) { //got the start point
start=i; //adjust the start index point
//loop through lyrics from start point
for (int j=1; j<query.length; j++) {
if (lyrics[j].equals(query[query.length-1])) {
end=i; //found the last word
}
//if next lyric word doesn't match this query word
if (!lyrics[i+j].equals(query[j])) {
//advance loop through lyrics. when a match is found, i is adjusted to
//the match index
for (int k= i+j+1; k<lyrics.length; k++) {
if (lyrics[k].equals(query[j]) || lyrics[k].equals(query[0]))
i=k++;
} //end inner advance loop
} //end query string test
}//end query test loop
song.setRanks(start, end); //start and end points for the rank algorithm.
} //end start point test
由于结果集中的所有歌曲都包含任何特定顺序的查询词,它们不会全部包含在结果打印输出中。使用此算法,如果查询与任何特定长度不匹配,如何设置触发器以从集合中删除歌曲?
编辑-Lucene 是这个问题的解决方案吗?这是项目中的一个灰色地带,我将在明天的课堂上提出。他允许我们为这个项目选择任何数据结构,但我不知道使用另一种实现进行字符串匹配是否会通过要求。
编辑 2 @belisarius-我不明白编辑距离在这里如何应用。编辑距离最常见的应用需要长度为n的字符串a和长度为m的字符串b,距离是a==b所需的编辑次数。对于这个项目,所需要的只是一场比赛中角色的排名,起点和终点未知。通过对上面发布的代码进行一些更改,我可以准确地找到起点和终点。我需要的是一种方法,如果歌词不适合任何方式的搜索,则可以从集合中删除不匹配的内容。
For a school project, the goal is to do a fuzzy match of a query string to a lyric string inside a Song object. The overall data structure is a TreeMap of unique words paired with sets of songs that contain that word in the lyrics.
I have my preliminary match set of songs that contain the query string. The twist here is that I have to assign each result song a rank based on the number of characters in the match section, spaces inclusive. For example, searching for "she loves you" returns these among the matches:
"... She loves you ..." The Beatles, rank= 13
"... She just loves you ..." Bonnie Raitt, rank=18
"... She loves me, well you ..." Elvis Presley, rank=23
The I'm using to sort the results is:
for (int i=0; i<lyrics.length; i++) {
if (lyrics[i].equals(query[0])) { //got the start point
start=i; //adjust the start index point
//loop through lyrics from start point
for (int j=1; j<query.length; j++) {
if (lyrics[j].equals(query[query.length-1])) {
end=i; //found the last word
}
//if next lyric word doesn't match this query word
if (!lyrics[i+j].equals(query[j])) {
//advance loop through lyrics. when a match is found, i is adjusted to
//the match index
for (int k= i+j+1; k<lyrics.length; k++) {
if (lyrics[k].equals(query[j]) || lyrics[k].equals(query[0]))
i=k++;
} //end inner advance loop
} //end query string test
}//end query test loop
song.setRanks(start, end); //start and end points for the rank algorithm.
} //end start point test
Since all the songs in the result set contain the query words in any particular order, they will not all be included in the result printout. Using this algorithm, how can I set a trigger to remove the song from the set if the query is not matched to any particular length?
Edit- Is Lucene a solution to this? This is a gray area in the project, and one I will bring up in class tomorrow. He is allowing us to choose whatever data structures for this project, but I don't know if using another implementation for string matching will pass muster.
Edit 2 @ belisarius- I don't see how edit distance applies here. The most common application of Levenshtein distance requres a String a of length n and String b of length m, and the distance is the number of edits required for a==b. For this project, all that is required is the rank of characters in a match, with the start and end points unknown. With some changes to the code posted above, I am finding the start and end points accurately. What I need is a way to remove the non-matches from the set if the the lyrics don't fit the search in any fashion.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能想查看Levenstein 距离。 Apache commons-lang 库在 StringUtils 类。
Probably you want to have a look at the Levenstein distance. The Apache commons-lang library implemented it in version 2.1 in the StringUtils class.
帕特里夏特里也许就能满足你的需求。
浏览一下这个,看看它是否有你需要的东西。
http://code.google.com/p/patricia-trie/
A Patricia trie might just do it for you.
Go through this one see if it has what u need.
http://code.google.com/p/patricia-trie/