处理 ruby 1.8.7 中不同类型的 utf 连字符
我们在数据库中填充了不同类型的连字符/破折号(在某些文本中)。在将它们与某些用户输入文本进行比较之前,我必须将任何类型的破折号/连字符标准化为简单的连字符/减号(ascii 45)。
我们必须转换的可能破折号是:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD ­
Non-breaking hyphen U+2011 ‑
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
这些都必须使用 gsub 转换为连字符减号 (-)。 我使用 CharDet gem 来检测所获取字符串的字符编码类型。它显示windows-1252。我尝试过Iconv将编码转换为ascii。但它抛出异常 Iconv::IllegalSequence。
红宝石-v => ruby 1.8.7(2009-06-12 补丁级别 174)[i686-darwin9.8.0]
轨道-v => Rails 2.3.5
mysql编码=> 'latin1'
知道如何实现这一点吗?
We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD
Non-breaking hyphen U+2011 ‑
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
These all have to be converted to Hyphen-minus(-) using gsub.
I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.
ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'
Any idea how to accomplish this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
警告:我对 Ruby 一无所知,但是您遇到的问题与您使用的编程语言无关。
您不需要转换
连字符减号(-) U +002D -
到简单连字符/减号 (ascii 45)
;它们是同一件事。您认为数据库编码是
latin1
。声明“我的数据采用 ISO-8859-1 aka latin1 进行编码”与“支票已在邮件中”和“当然,早上我仍然会爱你”。它告诉您的只是它是每个字符单字节编码。假设“获取的字符串”意味着“从数据库中提取的字节字符串”,
chardet
在报告windows-1252
又名cp1252
时很可能是正确的-- 然而,这可能是偶然的,因为chardet
有时似乎在用尽其他可能性时将其报告为默认值。(a) 这些 Unicode 字符无法解码为
latin1
或cp1252
或ascii
:是什么让您觉得它们可能出现在输入中或者在数据库中?
(b) 这些 Unicode 字符可以解码为
cp1252
,但不能解码为latin1
或ascii
:这些(很可能是 EN DASH)才是您真正想要的需要转换为 ascii 连字符/破折号。
chardet
报告为windows-1252
的字符串中包含什么内容?(c) 这可以解码为
cp1252
和latin1
,但不能解码为ascii
:如果字符串包含非 ASCII 字符,则任何尝试(使用 < code>iconv 或任何其他方法)将其转换为
ascii
将失败,除非您使用某种“忽略”或“替换为?
”选项。你为什么要这么做?Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.
You don't need to convert
Hyphen-minus(-) U+002D -
tosimple hyphen/minus (ascii 45)
; they're the same thing.You believe that the database encoding is
latin1
. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.Presuming that "fetched string" means "byte string extracted from the database",
chardet
is very likely quite right in reportingwindows-1252
akacp1252
-- however this may be by accident aschardet
sometimes seems to report that as a default when it has exhausted other possibilities.(a) These Unicode characters cannot be decoded into
latin1
orcp1252
orascii
:What gives you the impression that they may possibly appear in the input or in the database?
(b) These Unicode characters can be decoded into
cp1252
but notlatin1
orascii
:These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that
chardet
reported aswindows-1252
?(c) This can be decoded into
cp1252
andlatin1
but notascii
:If a string contains non-ASCII characters, any attempt (using
iconv
or any other method) to convert it toascii
will fail, unless you use some kind of "ignore" or "replace with?
" option. Why are you trying to do that?