使用 LIKE 选择所有可能重复的相似行?
将有关歌曲的信息导入到我的 SQLite 数据库后,我想使用 SELECT 语句使用以下条件查找所有可能的重复歌曲:
一行中的 SongName 与同一表(歌曲)中任何其他行中的 SongName 相似或相同以及 ArtistID两行都相同。这应该在不知道 SongName 内容的情况下起作用。如果我想将已知歌曲名称与数据库中的所有其他歌曲名称进行比较,可以使用“songName LIKE '%known name%'”来完成,但如果没有这个,我如何找到所有重复项?
示例歌曲表:
id songName artistID duration
--------------------------------------------
0 This is a song 5 3:43
1 Another song 3 3:23
2 01-This is a song 5 3:42
3 song 4 4:01
4 song 4 6:33
5 Another record 2 2:45
预期结果:
id songName artistID duration
--------------------------------------------
0 This is a song 5 3:43
2 01-This is a song 5 3:42
3 song 4 4:01
4 song 4 6:33
编辑:
由于已经提出了创建哈希并比较它们的想法,我正在考虑使用这个伪函数为每个歌曲名称创建一个哈希值:
Public Function createHash(ByVal phrase As String) As String
'convert to lower case
phrase = LCase(phrase)
'split the phrase into words
Dim words() As String = phrase.Replace("_", " ").Split(" ")
Dim hash As String = ""
For w = 0 To words.Count - 1
'remove noise words (a, an, the, etc.)
words(w) = removeNoiseWords(words(w))
'convert 1 or 2-digit numbers to corresponding words
words(w) = number2word(words(w))
Next
'rebuild using replaced words and remove spaces
hash = String.Join("", words)
'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
hash = removeUnsupChars(hash, True)
'strip away all remaining non-alphanumeric characters
hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
Return hash
End Function
计算出哈希值后,我会将其存储在每个记录中,然后使用 count(hash)>1 选择重复项。然后,我将使用 .NET 代码来查看返回的记录的artistID 是否相同。
到目前为止,这个解决方案似乎运行良好。这是我用来查找重复歌曲的 SQLite 语句:
SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;
这为我提供了多次出现的所有哈希的列表。我将这些结果存储在一个数组中,然后循环遍历该数组并简单地使用此语句来获取详细信息:
For i = 0 To dupeHashes.Count - 1
SQLconnect.Open()
SQLcommand = SQLconnect.CreateCommand
SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
SQLreader = SQLcommand.ExecuteReader()
While SQLreader.Read()
'get whatever data needed for each duplicate song
End While
SQLcommand.Dispose()
SQLconnect.Close()
Next
After importing information about songs into my SQLite database, I want to use a SELECT statement to find all possible duplicate songs using this criteria:
songName in one row similar or equal to the songName in any other row in the same table (Songs) AND artistID is the same in both rows. This should work without knowing the contents of songName. If I wanted to compare a known song name to all others in the database that can be done using "songName LIKE '%known name%'", but how do I find all duplicates without this?
Example Songs Table:
id songName artistID duration
--------------------------------------------
0 This is a song 5 3:43
1 Another song 3 3:23
2 01-This is a song 5 3:42
3 song 4 4:01
4 song 4 6:33
5 Another record 2 2:45
Expected Results:
id songName artistID duration
--------------------------------------------
0 This is a song 5 3:43
2 01-This is a song 5 3:42
3 song 4 4:01
4 song 4 6:33
EDIT:
Since the idea of creating a hash and comparing them has been suggested, I'm thinking about using this psuedo-function to create a hash for each song name:
Public Function createHash(ByVal phrase As String) As String
'convert to lower case
phrase = LCase(phrase)
'split the phrase into words
Dim words() As String = phrase.Replace("_", " ").Split(" ")
Dim hash As String = ""
For w = 0 To words.Count - 1
'remove noise words (a, an, the, etc.)
words(w) = removeNoiseWords(words(w))
'convert 1 or 2-digit numbers to corresponding words
words(w) = number2word(words(w))
Next
'rebuild using replaced words and remove spaces
hash = String.Join("", words)
'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
hash = removeUnsupChars(hash, True)
'strip away all remaining non-alphanumeric characters
hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
Return hash
End Function
Once the hash is computed, I'll store that with each record and then select duplicates using a count(hash)>1. I'll then use .NET code to see if the artistID is the same for the returned records.
This solution seems to be working fine so far. Here is the SQLite statement I use to find the duplicate songs:
SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;
This gives me a list of all the hashes that occur more than once. I store these results in an array, then cycle through the array and simply use this statement to get the detailed info:
For i = 0 To dupeHashes.Count - 1
SQLconnect.Open()
SQLcommand = SQLconnect.CreateCommand
SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
SQLreader = SQLcommand.ExecuteReader()
While SQLreader.Read()
'get whatever data needed for each duplicate song
End While
SQLcommand.Dispose()
SQLconnect.Close()
Next
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
就我个人而言,我会添加一个额外的字段,您可以在其中计算标题的某种“哈希”。一个好的功能是去除所有非字母字符,包括空格,删除任何文章(如“the”、“a”、“an”),然后计算 soundex 代码 作为标题,并在其前面添加artistId 字符串作为前缀。
因此,在您的情况下,您会得到:
从现在开始,仅获取具有 ...count(Hash)>1 的行应该很容易...
另请注意,我建议 Soundex 但您可以制作自己的函数,或者改编一个现有的一个,使某些元素比其他元素更相关。
Personally I would add an extra field where you compute some sort of "hash" of the title. A good function for this would be stripping away every non-alphabetic character, including spaces, drop any article (like "the", "a", "an") then compute the soundex code for the title and prefixing this with the artistId string.
So in your case you would get:
From now on, getting only the rows having ...count(Hash)>1 should be easy...
Note also that I suggest Soundex but you can make your own function, or adapt an existing one, making some elements more relevant than others.
可以大致了解这个问题,但有一点需要澄清:
为什么结果没有
1 另一首歌 3 3:23
记录?因为它可以被视为与那些重复
3 歌曲 4 4:01
4 歌曲 4 6:33
记录?
我只是在tsql中写了一个简单的脚本来解决,效率较低,仅供参考。
Can get the brief idea about the question, but one point need to be clarify:
why the result do not have
1 Another song 3 3:23
record? since it can be treated as duplicate with those
3 song 4 4:01
4 song 4 6:33
records?
I just write a simple script in tsql to solve, the efficiency is low, just check as reference.