根据重音查找重复字段

发布于 2024-12-09 21:30:10 字数 1013 浏览 0 评论 0 原文

好吧,这让我很烦恼。 我从客户那里得到了一个电话簿数据库,其中一些结果包含带重音的名称,

有些我的意思主要是城市字段或类别。 这让我的查询结果看起来很荒谬。

DB 字符集:UTF-8

例如:

CompanyName |城市 |等等...

DemoCompany |豪普特大街 18 号 |无论如何

演示公司 |上街 18 号 |无论

数据库有大约 360k 条记录......所以手动检查不是一个选择。 任何人都知道我如何找到重音/非重音值? 类似于重复列检查...

编辑: 当我查询表时,我得到了两者的结果,这不是问题。 问题是,当我显示结果时,有些显示有重音,有些则没有。

编辑:

CREATE TABLE `enc` (
  `company` varchar(255) DEFAULT NULL,
  `address` varchar(255) DEFAULT NULL,
  `postcode` varchar(255) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  `Telefon1` varchar(255) DEFAULT NULL,
  `Telefon2` varchar(255) DEFAULT NULL,
  `Telefon3` varchar(255) DEFAULT NULL,
  `Telefon4` varchar(255) DEFAULT NULL,
  `Telefon5` varchar(255) DEFAULT NULL,
  `Branche1` varchar(255) DEFAULT NULL,
  `Branche2` varchar(255) DEFAULT NULL,
  `Branche3` varchar(255) DEFAULT NULL,
  `Branche4` varchar(255) DEFAULT NULL,
  `Branche5` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$

ok, this is bugging me.
i got a phonebook DB from a client where some of the results containts accented names,

and by some i mean mainly the city field,or category.
which makes my query results look ridiculous.

DB Charset: UTF-8

for example:

CompanyName | City | etc...

DemoCompany | Hauptstraße 18 | Whatever

DemoCompany | Hauptstrabe 18 | Whatever

the DB has around 360k records.... so manual checking is not an option.
anyone has an idea how can i find the accented/not accented values ?
something like a duplicate column check...

EDIT:
when i query the table, i get results for both, that is not the problem.
the problem is, when i display the results, some are displayed with accent, and some without.

EDIT:

CREATE TABLE `enc` (
  `company` varchar(255) DEFAULT NULL,
  `address` varchar(255) DEFAULT NULL,
  `postcode` varchar(255) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  `Telefon1` varchar(255) DEFAULT NULL,
  `Telefon2` varchar(255) DEFAULT NULL,
  `Telefon3` varchar(255) DEFAULT NULL,
  `Telefon4` varchar(255) DEFAULT NULL,
  `Telefon5` varchar(255) DEFAULT NULL,
  `Branche1` varchar(255) DEFAULT NULL,
  `Branche2` varchar(255) DEFAULT NULL,
  `Branche3` varchar(255) DEFAULT NULL,
  `Branche4` varchar(255) DEFAULT NULL,
  `Branche5` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8$

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

今天小雨转甜 2024-12-16 21:30:10

您可以从这样的事情开始,这将显示是否存在彼此完全重复的行(及其计数):

SELECT 
    CompanyName, City, etc... 
  , COUNT(*) AS DuplicateCount
FROM 
    TableToCheck
GROUP BY
    CompanyName, City, etc...            --- all columns except the Primary Key
HAVING 
    COUNT(*) > 1

如果您只想查找重复的地址,您可以这样做:

SELECT 
    Address
  , COUNT(*) AS DuplicateCount
FROM 
    TableToCheck
GROUP BY
    Address                     
HAVING 
    COUNT(*) > 1

我想再次阅读您的问题我误解了你在问什么。如果您不想查找重复项(因为没有),但您想查找带重音的单词(并可能用不带重音的单词替换它们):

您现在拥有的表可能正在使用不区分大小写的排序规则(例如 utf_general_ci< /code> 或 utf_unicode_ci),因此您可以将表复制到具有相同字符集但区分大小写的排序规则的新表中,例如 utf_bin

然后,您可以创建一个重音字符列表,然后编写一个查询来在新表的字段中检查此列表(这将非常慢):

SELECT nt.*
FROM NewTable AS nt 
  JOIN AccentedList AS al
WHERE nt.field LIKE CONCAT('%', al.AccentedChar, '%')
GROUP BY nt.PK

或者运行查询来替换这些字符,例如 'ß''ss'

You can start with something like this, that will show if there are rows that are exact duplicates of each other (and their count):

SELECT 
    CompanyName, City, etc... 
  , COUNT(*) AS DuplicateCount
FROM 
    TableToCheck
GROUP BY
    CompanyName, City, etc...            --- all columns except the Primary Key
HAVING 
    COUNT(*) > 1

If you want to find only duplicate addresses, you do something like this:

SELECT 
    Address
  , COUNT(*) AS DuplicateCount
FROM 
    TableToCheck
GROUP BY
    Address                     
HAVING 
    COUNT(*) > 1

Reading your question again, I think I misunderstood what you are asking. If you don't want to find duplicates (as there are not) but you want to find accented words (and replace them with unaccented perhaps):

The table you have now is probably using a case insensitive collation (like utf_general_ci or utf_unicode_ci), so you could copy the table into a new one that has same charset but a case sensitive collation, like utf_bin.

You could then create a list of accented characters and then write a query to check for this list in fields of your new table (this will be real slow):

SELECT nt.*
FROM NewTable AS nt 
  JOIN AccentedList AS al
WHERE nt.field LIKE CONCAT('%', al.AccentedChar, '%')
GROUP BY nt.PK

or run a query to REPLACE() those characters, like 'ß' with 'ss' for example.

小女人ら 2024-12-16 21:30:10

您不仅需要考虑重音符号,还需要考虑许多其他等效字符:

  • 在德语中,您可以将 'ß' 写为 'ss',ä 为 'ae','ü' 为 'ue' 等等,
  • 在意大利语和法语中,您可以搜索不带重音符号的字母,但重音符号有时也会用撇号替换(例如,意大利语中的 giocherò 为 giochero')

如果找到,请编写一个函数来比较字符串而不考虑这些差异,或者您可以尝试使用函数进行匹配利用语音差异。

示例如下(许多数据库都实现了它们):

  • Soundex
  • 距离相似度
  • Jaro Winkler

Mysql 有一个 SOUNDEX 函数,对于其他函数,您必须定义自己的函数(网上有几个示例)。

结果并不完美,但寻找类似的条目将有助于手动检查。

You don't only have to consider accents but many other equivalent characters:

  • in German you can write 'ß' as 'ss', ä as 'ae', 'ü' as 'ue' and so on
  • in Italian and French you can search for letters without the accent but the accent is also sometimes substituted with an apostrophe (e.g., giocherò as giochero' in Italian)

If found write a function the compares the strings without considering these differences or you could try to match using a function that leverages phonetic differences.

Examples are (many databases implement them):

  • Soundex
  • Distance similarity
  • Jaro Winkler

Mysql has a SOUNDEX function, for the others you will have to define your own function (there are several examples on the web).

The results are not perfect but looking for similar entries will help a manual check.

执笔绘流年 2024-12-16 21:30:10

我很确定这是语音搜索的情况。您可以创建一个临时(可能位于内存中)表,将行的语音等效项插入其中,然后计算有多少重复项。这对于名称(Meyer、Mayer)以及街道(Straße、Strasse)非常有效。

I'm pretty sure this is a case for a phonetic search. You could create a temporary (possible memory located) table, insert the phonetic equivalent of the row into it, then take a count of how many are duplicates. This works very well for names (Meyer, Mayer) as well as Streets (Straße, Strasse).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文