处理 ruby​​ 1.8.7 中不同类型的 utf 连字符

发布于 2024-09-25 12:10:23 字数 925 浏览 0 评论 0原文

我们在数据库中填充了不同类型的连字符/破折号(在某些文本中)。在将它们与某些用户输入文本进行比较之前,我必须将任何类型的破折号/连字符标准化为简单的连字符/减号(ascii 45)。

我们必须转换的可能破折号是:

Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen   U+00AD  ­
Non-breaking hyphen  U+2011  &#8209
Figure dash(‒)  U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―

这些都必须使用 gsub 转换为连字符减号 (-)。 我使用 CharDet gem 来检测所获取字符串的字符编码类型。它显示windows-1252。我尝试过Iconv将编码转换为ascii。但它抛出异常 Iconv::IllegalSequence

红宝石-v => ruby 1.8.7(2009-06-12 补丁级别 174)[i686-darwin9.8.0]
轨道-v => Rails 2.3.5
mysql编码=> 'latin1'

知道如何实现这一点吗?

We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).

The possible dashes we have to convert are:

Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen   U+00AD  ­
Non-breaking hyphen  U+2011  ‑
Figure dash(‒)  U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―

These all have to be converted to Hyphen-minus(-) using gsub.
I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.

ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'

Any idea how to accomplish this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

笛声青案梦长安 2024-10-02 12:10:23

警告:我对 Ruby 一无所知,但是您遇到的问题与您使用的编程语言无关。

您不需要转换连字符减号(-) U +002D -简单连字符/减号 (ascii 45);它们是同一件事。

您认为数据库编码是 latin1。声明“我的数据采用 ISO-8859-1 aka latin1 进行编码”与“支票已在邮件中”和“当然,早上我仍然会爱你”。它告诉您的只是它是每个字符单字节编码。

假设“获取的字符串”意味着“从数据库中提取的字节字符串”,chardet 在报告 windows-1252 又名 cp1252 时很可能是正确的-- 然而,这可能是偶然的,因为 chardet 有时似乎在用尽其他可能性时将其报告为默认值。

(a) 这些 Unicode 字符无法解码为 latin1cp1252ascii

Minus(−) U+2212 − or − or −
Hyphen(-) U+2010
Non-breaking hyphen  U+2011  ‑
Figure dash(‒)  U+2012 (8210) ‒ or ‒
Horizontal bar(―) U+2015 (8213) ― or ―

是什么让您觉得它们可能出现在输入中或者在数据库中?

(b) 这些 Unicode 字符可以解码为 cp1252,但不能解码为 latin1ascii

En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —

这些(很可能是 EN DASH)才是您真正想要的需要转换为 ascii 连字符/破折号。 chardet 报告为 windows-1252 的字符串中包含什么内容?

(c) 这可以解码为 cp1252latin1,但不能解码为 ascii

Soft Hyphen   U+00AD  ­

如果字符串包含非 ASCII 字符,则任何尝试(使用 < code>iconv 或任何其他方法)将其转换为 ascii 将失败,除非您使用某种“忽略”或“替换为 ”选项。你为什么要这么做?

Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.

You don't need to convert Hyphen-minus(-) U+002D - to simple hyphen/minus (ascii 45); they're the same thing.

You believe that the database encoding is latin1. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.

Presuming that "fetched string" means "byte string extracted from the database", chardet is very likely quite right in reporting windows-1252 aka cp1252 -- however this may be by accident as chardet sometimes seems to report that as a default when it has exhausted other possibilities.

(a) These Unicode characters cannot be decoded into latin1 or cp1252 or ascii:

Minus(−) U+2212 − or − or −
Hyphen(-) U+2010
Non-breaking hyphen  U+2011  ‑
Figure dash(‒)  U+2012 (8210) ‒ or ‒
Horizontal bar(―) U+2015 (8213) ― or ―

What gives you the impression that they may possibly appear in the input or in the database?

(b) These Unicode characters can be decoded into cp1252 but not latin1 or ascii:

En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —

These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that chardet reported as windows-1252?

(c) This can be decoded into cp1252 and latin1 but not ascii:

Soft Hyphen   U+00AD  ­

If a string contains non-ASCII characters, any attempt (using iconv or any other method) to convert it to ascii will fail, unless you use some kind of "ignore" or "replace with ?" option. Why are you trying to do that?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文