查找包含另一个字符串的字符串部分，以及可能的中间单词

发布于 2024-10-01 18:16:02 字数 1362 浏览 16 评论 0 原文

对于本学期的最后一个项目，目标是在 Song 对象内的歌词字符串上运行特定短语的搜索，然后根据子字符串匹配的长度对结果进行排名。歌词是从文件中读取的，并与该文件中的换行符匹配。

例如，搜索“她爱你”将在示例匹配中返回以下内容：

披头士乐队：“... 她爱你，是的，是的，是的...”排名= 13 个字符< br> Bonnie Raitt：“...她只是爱你...”排名= 18 个字符
埃尔维斯·普雷斯利：“...您在问她是否爱我\r\n嗯，您不知道...”排名= 23 个字符

正如您从上一个示例中看到的，匹配可以跨越多行。

我在 TreeMap> 中拥有所有歌曲，因此我获得了与查询中的第一个单词匹配的所有歌曲。我遇到的困难是在字符串中搜索匹配项，因为正则表达式在这种情况下不起作用。

构造 Song 对象时，我将歌词转储到 Set 中以运行单个单词的搜索，为此我使用了 String.split("[^a-zA-Z}") 分离出单个单词并清除标点符号。所以我想在该数组上运行搜索。我使用的过程如下：

break up the query into a String array
  for each Song in the set
    if (song.lyrics.contains(query)
      great, break loop to next song

    otherwise
      int queryCounter=0;
      find first index point in String array that matches query[queryCounter]
        using that as the start point, iterate through the String array for matches

迭代完成后，将创建一个 Rank 对象来保存匹配的数组部分的歌曲、搜索短语、起点和终点。 Rank 对象中有一个方法来计算字符数并补偿空白以计算排名。然后将其插入 PriorityQueue，其中将从原始 matchSet 中提取前十个匹配项。

问题是这并不能防止误报，而且匹配排名可能会出现偏差。例如，Aerosmith的Beyond Beautiful包含“...她爱我，她不爱你...”通过我的流程，我将匹配“...她爱我，她爱你< /strong> 不是...”，因此我的排名将不是 13，而是 27。

我需要进行哪些更改才能消除误报和不正确的排名？

原文

For the last project of the semester, the goal is to run searches of a particular phrase on a lyric String inside an Song object, then rank the results based on the length of the substring match. The lyrics were read from a file and match the line breaks in that file.

For example, searching for "She loves you" would return these in the sample matches:

The Beatles: "... She loves you, yeah, yeah, yeah ..." Rank= 13 characters
Bonnie Raitt: "... She just loves you ..." Rank= 18 characters
Elvis Presley: "... You're asking if she loves me\r\nWell, you don't know..." Rank= 23 characters

As you can see from the last example, matches can span multiple lines.

I have all the songs in a TreeMap<String, TreeSet<Song>>, so I get all the songs that match the first word in the query. The difficulty I'm having is searching the String for matches, since a regex won't work in this situation.

When the Song object is constructed, I dumped the lyrics into a Set to run searches for a single word, and to do that I used String.split("[^a-zA-Z}") to separate out the individual words and weed out the punctuation marks. So I want to run my search on that array. The process I'm using goes like:

break up the query into a String array
  for each Song in the set
    if (song.lyrics.contains(query)
      great, break loop to next song

    otherwise
      int queryCounter=0;
      find first index point in String array that matches query[queryCounter]
        using that as the start point, iterate through the String array for matches

When the iteration is complete, a Rank object is created to hold the Song, search phrase, start point and end points of the array section that matches. In the Rank object is a method to count the number of characters and compensate for whitespace to calculate the rank. This is then inserted into a PriorityQueue, where the top ten matches will be pulled from the original matchSet.

The problem is that this doesn't prevent false positives, and match ranks can get skewed. For example, Aerosmith's Beyond Beautiful contains "... she loves me she loves you not ..." With my process, I will match "... she loves me she loves you not...", so instead of a rank of 13, I will get a rank of 27.

What changes are necessary for me to weed out the false positives and incorrect rankings?

分享到QQ

分享到微博