Ruby 1.9.X 中的 Iconv.conv("UTF-8//IGNORE",...) 等效吗?

发布于 2024-12-12 02:35:43 字数 271 浏览 0 评论 0 原文

我正在从远程源读取数据,偶尔会得到另一种编码的一些字符。它们并不重要。

我想得到一个“最佳猜测”utf-8 字符串,并忽略无效数据。

主要目标是获得一个我可以使用的字符串,并且不会遇到以下错误:

  • Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
  • invalid byte sequence in utf- 8

I'm reading data from a remote source, and occassionally get some characters in another encoding. They're not important.

I'd like to get get a "best guess" utf-8 string, and ignore the invalid data.

Main goal is to get a string I can use, and not run into errors such as:

  • Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8:
  • invalid byte sequence in utf-8

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

我的影子我的梦 2024-12-19 02:35:43

我以为就是这样:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

将替换都知道“?”。

要忽略所有未知数,:replace => '':

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

编辑:

我不确定这是否可靠。我已经进入偏执模式,并一直在使用:

string.encode("UTF-8", ...).force_encoding('UTF-8')

脚本似乎正在运行,现在好吧。但我很确定我之前就犯过错误。

编辑2:

即便如此,我仍然会遇到间歇性错误。请注意,不是每次都会。只是有时候。

I thought this was it:

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")

will replace all knowns with '?'.

To ignore all unknowns, :replace => '':

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

Edit:

I'm not sure this is reliable. I've gone into paranoid-mode, and have been using:

string.encode("UTF-8", ...).force_encoding('UTF-8')

Script seems to be running, ok now. But I'm pretty sure I'd gotten errors with this earlier.

Edit 2:

Even with this, I continue to get intermittant errors. Not every time, mind you. Just sometimes.

背叛残局 2024-12-19 02:35:43

也可以使用 String#charsString#each_char

# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
     +"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"

p [
  'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
  'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]

String#scrub 从 Ruby 2.1 开始就可以使用。

p [
  'abcd' == str.scrub(''),
  'abcd' == str.scrub{ |c| '' }
]

String#chars or String#each_char can be also used.

# Table 3-8. Use of U+FFFD in UTF-8 Conversion
# http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf)
str = "\x61"+"\xF1\x80\x80"+"\xE1\x80"+"\xC2"
     +"\x62"+"\x80"+"\x63"+"\x80"+"\xBF"+"\x64"

p [
  'abcd' == str.chars.collect { |c| (c.valid_encoding?) ? c : '' }.join,
  'abcd' == str.each_char.map { |c| (c.valid_encoding?) ? c : '' }.join
]

String#scrub can be used since Ruby 2.1.

p [
  'abcd' == str.scrub(''),
  'abcd' == str.scrub{ |c| '' }
]
小耗子 2024-12-19 02:35:43

这对我来说非常有用:

"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')

This works great for me:

"String".encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "").force_encoding('UTF-8')
乖不如嘢 2024-12-19 02:35:43

要忽略字符串中未正确 UTF-8 编码的所有未知部分,以下内容(如您最初发布的)几乎可以满足您的要求。

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

需要注意的是,如果编码认为字符串已经是 UTF-8,则它不会执行任何操作。因此,您需要更改编码,采用仍然可以对 UTF-8 可以编码的全套 unicode 字符进行编码的编码。 (如果不这样做,就会破坏任何不属于该编码的字符 - 7 位 ASCII 将是一个非常糟糕的选择!)因此,请使用 UTF-16:

string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')

To ignore all unknown parts of the string that aren't correctly UTF-8 encoded the following (as you originally posted) almost does what you want.

string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "")

The caveat is that encode doesn't do anything if it thinks the string is already UTF-8. So you need to change encodings, going via an encoding that can still encode the full set of unicode characters that UTF-8 can encode. (If you don't you'll corrupt any characters that aren't in that encoding - 7bit ASCII would be a really bad choice!) So go via UTF-16:

string.encode('UTF-16', :invalid => :replace, :replace => '').encode('UTF-8')
帅冕 2024-12-19 02:35:43

在 @masakielastic 的帮助下,我使用 #chars 方法出于个人目的解决了这个问题。

诀窍是将每个字符分解为其自己的单独块以便 ruby​​ 可以失败

当 Ruby 遇到二进制代码等时,需要失败。如果你不允许 Ruby 继续前进并失败,那么当涉及到这些东西时,它会是一条艰难的道路。因此,我使用 String#chars 方法将给定的字符串分解为字符数组。然后我将该代码传递到一个清理方法中,该方法允许代码在字符串中包含“微故障”(我的创造)。

因此,给定一个“脏”字符串,假设您在图片上使用了 File#read 。 (我的情况)

dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end

允许代码在过程中的某个地方失败似乎是完成它的最佳方法。只要您将这些失败包含在块中,您就可以获取 ruby​​ 的仅接受 UTF-8 部分可读的内容

With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.

The trick is to break down each character into its own separate block so that ruby can fail.

Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.

So, given a "dirty" string, lets say you used File#read on a picture. (my case)

dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end

allowing the code to fail somewhere along in the process seems to be the best way to move through it. So long as you contain those failures within blocks you can grab what is readable by the UTF-8-only-accepting parts of ruby

昔梦 2024-12-19 02:35:43

我没有运气使用 String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace =>; “?”)
。不要为我可靠地工作。

但我将 String#scrub 的纯 ruby​​“回填”到 MRI 1.9 或 2.0 或任何其他不提供 String#scrub 的 ruby​​。

https://github.com/jrochkind/scrub_rb

它使 String#scrub 在红宝石中可用,而红宝石中不可用拥有它;如果在 MRI 2.1 中加载,它将不会执行任何操作,并且您仍将使用内置的 String#scrub,因此它可以让您轻松编写可在任何这些平台上运行的代码。

它的实现有点类似于其他答案中提出的一些其他逐字符解决方案,但它不使用流量控制异常(不要这样做),经过测试,并提供与 MRI 2.1 String 兼容的 API #擦洗

I have not had luck with the one-line uses of String#encode ala string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
. Do not work reliably for me.

But I wrote a pure ruby "backfill" of String#scrub to MRI 1.9 or 2.0 or any other ruby that does not offer a String#scrub.

https://github.com/jrochkind/scrub_rb

It makes String#scrub available in rubies that don't have it; if loaded in MRI 2.1, it will do nothing and you'll still be using the built-in String#scrub, so it can allow you to easily write code that will work on any of these platforms.

It's implementation is somewhat similar to some of the other char-by-char solutions proposed in other answers, but it does not use exceptions for flow control (don't do that), is tested, and provides an API compatible with MRI 2.1 String#scrub

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文