根据重音查找重复字段
好吧,这让我很烦恼。 我从客户那里得到了一个电话簿数据库,其中一些结果包含带重音的名称,
有些我的意思主要是城市字段或类别。 这让我的查询结果看起来很荒谬。
DB 字符集:UTF-8
例如:
CompanyName |城市 |等等...
DemoCompany |豪普特大街 18 号 |无论如何
演示公司 |上街 18 号 |无论
数据库有大约 360k 条记录......所以手动检查不是一个选择。 任何人都知道我如何找到重音/非重音值? 类似于重复列检查...
编辑: 当我查询表时,我得到了两者的结果,这不是问题。 问题是,当我显示结果时,有些显示有重音,有些则没有。
编辑:
CREATE TABLE `enc` (
`company` varchar(255) DEFAULT NULL,
`address` varchar(255) DEFAULT NULL,
`postcode` varchar(255) DEFAULT NULL,
`city` varchar(255) DEFAULT NULL,
`Telefon1` varchar(255) DEFAULT NULL,
`Telefon2` varchar(255) DEFAULT NULL,
`Telefon3` varchar(255) DEFAULT NULL,
`Telefon4` varchar(255) DEFAULT NULL,
`Telefon5` varchar(255) DEFAULT NULL,
`Branche1` varchar(255) DEFAULT NULL,
`Branche2` varchar(255) DEFAULT NULL,
`Branche3` varchar(255) DEFAULT NULL,
`Branche4` varchar(255) DEFAULT NULL,
`Branche5` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$
ok, this is bugging me.
i got a phonebook DB from a client where some of the results containts accented names,
and by some i mean mainly the city field,or category.
which makes my query results look ridiculous.
DB Charset: UTF-8
for example:
CompanyName | City | etc...
DemoCompany | Hauptstraße 18 | Whatever
DemoCompany | Hauptstrabe 18 | Whatever
the DB has around 360k records.... so manual checking is not an option.
anyone has an idea how can i find the accented/not accented values ?
something like a duplicate column check...
EDIT:
when i query the table, i get results for both, that is not the problem.
the problem is, when i display the results, some are displayed with accent, and some without.
EDIT:
CREATE TABLE `enc` (
`company` varchar(255) DEFAULT NULL,
`address` varchar(255) DEFAULT NULL,
`postcode` varchar(255) DEFAULT NULL,
`city` varchar(255) DEFAULT NULL,
`Telefon1` varchar(255) DEFAULT NULL,
`Telefon2` varchar(255) DEFAULT NULL,
`Telefon3` varchar(255) DEFAULT NULL,
`Telefon4` varchar(255) DEFAULT NULL,
`Telefon5` varchar(255) DEFAULT NULL,
`Branche1` varchar(255) DEFAULT NULL,
`Branche2` varchar(255) DEFAULT NULL,
`Branche3` varchar(255) DEFAULT NULL,
`Branche4` varchar(255) DEFAULT NULL,
`Branche5` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8$
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以从这样的事情开始,这将显示是否存在彼此完全重复的行(及其计数):
如果您只想查找重复的地址,您可以这样做:
我想再次阅读您的问题我误解了你在问什么。如果您不想查找重复项(因为没有),但您想查找带重音的单词(并可能用不带重音的单词替换它们):
您现在拥有的表可能正在使用不区分大小写的排序规则(例如
utf_general_ci< /code> 或
utf_unicode_ci
),因此您可以将表复制到具有相同字符集但区分大小写的排序规则的新表中,例如utf_bin
。然后,您可以创建一个重音字符列表,然后编写一个查询来在新表的字段中检查此列表(这将非常慢):
或者运行查询来替换这些字符,例如
'ß'
和'ss'
。You can start with something like this, that will show if there are rows that are exact duplicates of each other (and their count):
If you want to find only duplicate addresses, you do something like this:
Reading your question again, I think I misunderstood what you are asking. If you don't want to find duplicates (as there are not) but you want to find accented words (and replace them with unaccented perhaps):
The table you have now is probably using a case insensitive collation (like
utf_general_ci
orutf_unicode_ci
), so you could copy the table into a new one that has same charset but a case sensitive collation, likeutf_bin
.You could then create a list of accented characters and then write a query to check for this list in fields of your new table (this will be real slow):
or run a query to
REPLACE()
those characters, like'ß'
with'ss'
for example.您不仅需要考虑重音符号,还需要考虑许多其他等效字符:
如果找到,请编写一个函数来比较字符串而不考虑这些差异,或者您可以尝试使用函数进行匹配利用语音差异。
示例如下(许多数据库都实现了它们):
Mysql 有一个
SOUNDEX
函数,对于其他函数,您必须定义自己的函数(网上有几个示例)。结果并不完美,但寻找类似的条目将有助于手动检查。
You don't only have to consider accents but many other equivalent characters:
If found write a function the compares the strings without considering these differences or you could try to match using a function that leverages phonetic differences.
Examples are (many databases implement them):
Mysql has a
SOUNDEX
function, for the others you will have to define your own function (there are several examples on the web).The results are not perfect but looking for similar entries will help a manual check.
我很确定这是语音搜索的情况。您可以创建一个临时(可能位于内存中)表,将行的语音等效项插入其中,然后计算有多少重复项。这对于名称(Meyer、Mayer)以及街道(Straße、Strasse)非常有效。
I'm pretty sure this is a case for a phonetic search. You could create a temporary (possible memory located) table, insert the phonetic equivalent of the row into it, then take a count of how many are duplicates. This works very well for names (Meyer, Mayer) as well as Streets (Straße, Strasse).