使用 LIKE 选择所有可能重复的相似行？

发布于 2024-10-31 06:18:44 字数 2646 浏览 4 评论 0原文

将有关歌曲的信息导入到我的 SQLite 数据库后，我想使用 SELECT 语句使用以下条件查找所有可能的重复歌曲：

一行中的 SongName 与同一表（歌曲）中任何其他行中的 SongName 相似或相同以及 ArtistID两行都相同。这应该在不知道 SongName 内容的情况下起作用。如果我想将已知歌曲名称与数据库中的所有其他歌曲名称进行比较，可以使用“songName LIKE '%known name%'”来完成，但如果没有这个，我如何找到所有重复项？

示例歌曲表：

id  songName            artistID  duration
--------------------------------------------
0  This is a song       5         3:43
1  Another song         3         3:23
2  01-This is a song    5         3:42
3  song                 4         4:01
4  song                 4         6:33
5  Another record       2         2:45

预期结果：

id  songName            artistID  duration
--------------------------------------------
0   This is a song      5         3:43
2   01-This is a song   5         3:42
3   song                4         4:01
4   song                4         6:33

编辑：

由于已经提出了创建哈希并比较它们的想法，我正在考虑使用这个伪函数为每个歌曲名称创建一个哈希值：

Public Function createHash(ByVal phrase As String) As String
    'convert to lower case
    phrase = LCase(phrase)

    'split the phrase into words
    Dim words() As String = phrase.Replace("_", " ").Split(" ")

    Dim hash As String = ""
    For w = 0 To words.Count - 1
        'remove noise words (a, an, the, etc.)
        words(w) = removeNoiseWords(words(w))
        'convert 1 or 2-digit numbers to corresponding words
        words(w) = number2word(words(w))
    Next

    'rebuild using replaced words and remove spaces
    hash = String.Join("", words)

    'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
    hash = removeUnsupChars(hash, True)

    'strip away all remaining non-alphanumeric characters
    hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
    Return hash
End Function

计算出哈希值后，我会将其存储在每个记录中，然后使用 count(hash)>1 选择重复项。然后，我将使用 .NET 代码来查看返回的记录的artistID 是否相同。

到目前为止，这个解决方案似乎运行良好。这是我用来查找重复歌曲的 SQLite 语句：

SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;

这为我提供了多次出现的所有哈希的列表。我将这些结果存储在一个数组中，然后循环遍历该数组并简单地使用此语句来获取详细信息：

    For i = 0 To dupeHashes.Count - 1
        SQLconnect.Open()
        SQLcommand = SQLconnect.CreateCommand
        SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
        SQLreader = SQLcommand.ExecuteReader()
        While SQLreader.Read()
            'get whatever data needed for each duplicate song
        End While
        SQLcommand.Dispose()
        SQLconnect.Close()
    Next

原文

After importing information about songs into my SQLite database, I want to use a SELECT statement to find all possible duplicate songs using this criteria:

songName in one row similar or equal to the songName in any other row in the same table (Songs) AND artistID is the same in both rows. This should work without knowing the contents of songName. If I wanted to compare a known song name to all others in the database that can be done using "songName LIKE '%known name%'", but how do I find all duplicates without this?

Example Songs Table:

id  songName            artistID  duration
--------------------------------------------
0  This is a song       5         3:43
1  Another song         3         3:23
2  01-This is a song    5         3:42
3  song                 4         4:01
4  song                 4         6:33
5  Another record       2         2:45

Expected Results:

id  songName            artistID  duration
--------------------------------------------
0   This is a song      5         3:43
2   01-This is a song   5         3:42
3   song                4         4:01
4   song                4         6:33

EDIT:

Since the idea of creating a hash and comparing them has been suggested, I'm thinking about using this psuedo-function to create a hash for each song name:

Public Function createHash(ByVal phrase As String) As String
    'convert to lower case
    phrase = LCase(phrase)

    'split the phrase into words
    Dim words() As String = phrase.Replace("_", " ").Split(" ")

    Dim hash As String = ""
    For w = 0 To words.Count - 1
        'remove noise words (a, an, the, etc.)
        words(w) = removeNoiseWords(words(w))
        'convert 1 or 2-digit numbers to corresponding words
        words(w) = number2word(words(w))
    Next

    'rebuild using replaced words and remove spaces
    hash = String.Join("", words)

    'convert upper ascii into alphabetic (ie. ñ = n, Ö = O, etc.)
    hash = removeUnsupChars(hash, True)

    'strip away all remaining non-alphanumeric characters
    hash = REGEX_Replace(hash, "[^A-Za-z0-9]", "")
    Return hash
End Function

Once the hash is computed, I'll store that with each record and then select duplicates using a count(hash)>1. I'll then use .NET code to see if the artistID is the same for the returned records.

This solution seems to be working fine so far. Here is the SQLite statement I use to find the duplicate songs:

SELECT count(*),hash from Songs GROUP BY hash HAVING count(hash) > 1 ORDER BY hash;

This gives me a list of all the hashes that occur more than once. I store these results in an array, then cycle through the array and simply use this statement to get the detailed info:

    For i = 0 To dupeHashes.Count - 1
        SQLconnect.Open()
        SQLcommand = SQLconnect.CreateCommand
        SQLcommand.CommandText = "SELECT * from Songs WHERE hash = '" & dupeHashes(i) & "';"
        SQLreader = SQLcommand.ExecuteReader()
        While SQLreader.Read()
            'get whatever data needed for each duplicate song
        End While
        SQLcommand.Dispose()
        SQLconnect.Close()
    Next

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青瓷清茶倾城歌 2024-11-07 06:18:44

就我个人而言，我会添加一个额外的字段，您可以在其中计算标题的某种“哈希”。一个好的功能是去除所有非字母字符，包括空格，删除任何文章（如“the”、“a”、“an”），然后计算 soundex 代码作为标题，并在其前面添加artistId 字符串作为前缀。

因此，在您的情况下，您会得到：

id  songName            artistID  duration  Hash
----------------------------------------------------
0  This is a song       5         3:43      5.T0021
1  Another song         3         3:23      3.A9872
2  01-This is a song    5         3:42      5.T0021
3  song                 4         4:01      4.S0332
4  song                 4         6:33      4.S0332
5  Another record       2         2:45      2.A7622

从现在开始，仅获取具有 ...count(Hash)>1 的行应该很容易...

另请注意，我建议 Soundex 但您可以制作自己的函数，或者改编一个现有的一个，使某些元素比其他元素更相关。

Personally I would add an extra field where you compute some sort of "hash" of the title. A good function for this would be stripping away every non-alphabetic character, including spaces, drop any article (like "the", "a", "an") then compute the soundex code for the title and prefixing this with the artistId string.

So in your case you would get:

id  songName            artistID  duration  Hash
----------------------------------------------------
0  This is a song       5         3:43      5.T0021
1  Another song         3         3:23      3.A9872
2  01-This is a song    5         3:42      5.T0021
3  song                 4         4:01      4.S0332
4  song                 4         6:33      4.S0332
5  Another record       2         2:45      2.A7622

From now on, getting only the rows having ...count(Hash)>1 should be easy...

Note also that I suggest Soundex but you can make your own function, or adapt an existing one, making some elements more relevant than others.

回复收藏 0 原文

人海汹涌 2024-11-07 06:18:44

可以大致了解这个问题，但有一点需要澄清：
为什么结果没有
1 另一首歌 3 3:23
记录？因为它可以被视为与那些重复
3 歌曲 4 4:01
4 歌曲 4 6:33
记录？

我只是在tsql中写了一个简单的脚本来解决，效率较低，仅供参考。

 drop table #t;
drop table #result;

create table #t 
(
id int ,
songName varchar(100),
artistID int,
duration varchar(20)
)
insert into #t
select '0',  'This is a song'   ,    '5'  ,       '3:43' union all
select '1',  'Another song'     ,    '3'  ,       '3:23' union all
select '2',  '01-This is a song',    '5'  ,       '3:42' union all
select '3',  'song'             ,    '4'  ,       '4:01' union all
select '4',  'song'             ,    '4'  ,       '6:33' union all
select '5',  'Another record'   ,    '2'  ,       '2:45'

select * from #t
select * into #result from #t where 1 = 0

declare @sName varchar(100)
declare @id int
declare @count int

declare c cursor for 
select id, songName from #t

open c
fetch next from c into @id, @sName
while @@FETCH_STATUS = 0
begin
    select @count = COUNT(*) from #result where id = @id
    if @count = 0 
    begin
        select @count = COUNT(*) from #t where songName like '%'+@sName+'%'
        --select @count , @sName
        if @count > 1
        begin
            insert into #result select *  from #t where songName like '%'+@sName+'%' and id not in (select id from #result)
        end
    end
fetch next from c into @id, @sName
end
close c
deallocate c

select * from #result

Can get the brief idea about the question, but one point need to be clarify:
why the result do not have
1 Another song 3 3:23
record? since it can be treated as duplicate with those
3 song 4 4:01
4 song 4 6:33
records?

I just write a simple script in tsql to solve, the efficiency is low, just check as reference.

 drop table #t;
drop table #result;

create table #t 
(
id int ,
songName varchar(100),
artistID int,
duration varchar(20)
)
insert into #t
select '0',  'This is a song'   ,    '5'  ,       '3:43' union all
select '1',  'Another song'     ,    '3'  ,       '3:23' union all
select '2',  '01-This is a song',    '5'  ,       '3:42' union all
select '3',  'song'             ,    '4'  ,       '4:01' union all
select '4',  'song'             ,    '4'  ,       '6:33' union all
select '5',  'Another record'   ,    '2'  ,       '2:45'

select * from #t
select * into #result from #t where 1 = 0

declare @sName varchar(100)
declare @id int
declare @count int

declare c cursor for 
select id, songName from #t

open c
fetch next from c into @id, @sName
while @@FETCH_STATUS = 0
begin
    select @count = COUNT(*) from #result where id = @id
    if @count = 0 
    begin
        select @count = COUNT(*) from #t where songName like '%'+@sName+'%'
        --select @count , @sName
        if @count > 1
        begin
            insert into #result select *  from #t where songName like '%'+@sName+'%' and id not in (select id from #result)
        end
    end
fetch next from c into @id, @sName
end
close c
deallocate c

select * from #result

回复收藏 0 原文

~没有更多了~