查找包含另一个字符串的字符串部分,以及可能的中间单词

发布于 2024-10-01 18:16:02 字数 1362 浏览 6 评论 0 原文

对于本学期的最后一个项目,目标是在 Song 对象内的歌词字符串上运行特定短语的搜索,然后根据子字符串匹配的长度对结果进行排名。歌词是从文件中读取的,并与该文件中的换行符匹配。

例如,搜索“她爱你”将在示例匹配中返回以下内容:

披头士乐队:“... 她爱你,是的,是的,是的...”排名= 13 个字符< br> Bonnie Raitt:“...只是爱你...”排名= 18 个字符
埃尔维斯·普雷斯利:“...您在问她是否爱我\r\n嗯,不知道...”排名= 23 个字符

正如您从上一个示例中看到的,匹配可以跨越多行。

我在 TreeMap> 中拥有所有歌曲,因此我获得了与查询中的第一个单词匹配的所有歌曲。我遇到的困难是在字符串中搜索匹配项,因为正则表达式在这种情况下不起作用。

构造 Song 对象时,我将歌词转储到 Set 中以运行单个单词的搜索,为此我使用了 String.split("[^a-zA-Z}") 分离出单个单词并清除标点符号。所以我想在该数组上运行搜索。我使用的过程如下:

break up the query into a String array
  for each Song in the set
    if (song.lyrics.contains(query)
      great, break loop to next song

    otherwise
      int queryCounter=0;
      find first index point in String array that matches query[queryCounter]
        using that as the start point, iterate through the String array for matches

迭代完成后,将创建一个 Rank 对象来保存匹配的数组部分的歌曲、搜索短语、起点和终点。 Rank 对象中有一个方法来计算字符数并补偿空白以计算排名。然后将其插入 PriorityQueue,其中将从原始 matchSet 中提取前十个匹配项。

问题是这并不能防止误报,而且匹配排名可能会出现偏差。例如,Aerosmith的Beyond Beautiful包含“...她爱我,她不爱你...”通过我的流程,我将匹配“...她爱我,她爱你< /strong> 不是...”,因此我的排名将不是 13,而是 27。

我需要进行哪些更改才能消除误报和不正确的排名?

For the last project of the semester, the goal is to run searches of a particular phrase on a lyric String inside an Song object, then rank the results based on the length of the substring match. The lyrics were read from a file and match the line breaks in that file.

For example, searching for "She loves you" would return these in the sample matches:

The Beatles: "... She loves you, yeah, yeah, yeah ..." Rank= 13 characters
Bonnie Raitt: "... She just loves you ..." Rank= 18 characters
Elvis Presley: "... You're asking if she loves me\r\nWell, you don't know..." Rank= 23 characters

As you can see from the last example, matches can span multiple lines.

I have all the songs in a TreeMap<String, TreeSet<Song>>, so I get all the songs that match the first word in the query. The difficulty I'm having is searching the String for matches, since a regex won't work in this situation.

When the Song object is constructed, I dumped the lyrics into a Set to run searches for a single word, and to do that I used String.split("[^a-zA-Z}") to separate out the individual words and weed out the punctuation marks. So I want to run my search on that array. The process I'm using goes like:

break up the query into a String array
  for each Song in the set
    if (song.lyrics.contains(query)
      great, break loop to next song

    otherwise
      int queryCounter=0;
      find first index point in String array that matches query[queryCounter]
        using that as the start point, iterate through the String array for matches

When the iteration is complete, a Rank object is created to hold the Song, search phrase, start point and end points of the array section that matches. In the Rank object is a method to count the number of characters and compensate for whitespace to calculate the rank. This is then inserted into a PriorityQueue, where the top ten matches will be pulled from the original matchSet.

The problem is that this doesn't prevent false positives, and match ranks can get skewed. For example, Aerosmith's Beyond Beautiful contains "... she loves me she loves you not ..." With my process, I will match "... she loves me she loves you not...", so instead of a rank of 13, I will get a rank of 27.

What changes are necessary for me to weed out the false positives and incorrect rankings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

亢潮 2024-10-08 18:16:02

我想补充一下jjinguy所说的:

基本上,在“otherwise”块中,找到与起始点匹配的第一个索引后,您还必须寻找可能的其他起始点,如果找到另一个起始点,则重置起始点

我将在其中保留所有可能匹配的列表一首歌曲,最后使用排名最好的一首。简单地重置起点可能无法赶上排名最高的比赛。

也许这不是最好的方法,但担忧仍然存在。

I would like to add to what jjinguy said:

Basically, in the 'otherwise' block, after you find the first index that matches the start, you also have to look for possible other start points, and reset your start if you find another one

I would keep a list of all possible matches in a song, and finally use the one that has the best rank. Simply resetting the start point might not catch the match with the best rank.

Maybe that isn't the best way, but the concern is still there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文