如何摆脱 ruby 中的非 ascii 字符

发布于 2024-08-01 14:25:55 字数 150 浏览 3 评论 0原文

我有一个 Ruby CGI（不是 Rails），可以从 Web 表单中选取照片和标题。我的用户非常热衷于使用智能引号和连字，他们从其他来源粘贴。我的 Web 应用程序不能很好地处理这些非 ASCII 字符，是否有一个快速的 Ruby 字符串操作例程可以摆脱非 ASCII 字符？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏尔 2024-08-08 14:25:55

不，除了基本字符之外，不乏删除所有字符（上面建议的）。最好的解决方案是正确处理这些名称（因为当今大多数文件系统对于 Unicode 名称没有任何问题）。如果你的用户粘贴了连字，他们肯定也想把它们找回来。如果文件系统是您的问题，请将其抽象出来并将文件名设置为某个 md5（这也使您可以轻松地将上传分片到扫描速度非常快的存储桶中，因为它们永远不会有太多条目）。

回复收藏 0 原文

美人骨 2024-08-08 14:25:55

Quick GS 透露了此讨论，其中建议采用以下方法：

class String
  def remove_nonascii(replacement)
    n=self.split("")
    self.slice!(0..self.size)
    n.each { |b|
     if b[0].to_i< 33 || b[0].to_i>127 then
       self.concat(replacement)
     else
       self.concat(b)
     end
    }
    self.to_s
  end
end

Quick GS revealed this discussion which suggests the following method:

class String
  def remove_nonascii(replacement)
    n=self.split("")
    self.slice!(0..self.size)
    n.each { |b|
     if b[0].to_i< 33 || b[0].to_i>127 then
       self.concat(replacement)
     else
       self.concat(b)
     end
    }
    self.to_s
  end
end

回复收藏 0 原文

独留℉清风醉 2024-08-08 14:25:55

这应该可以解决问题：

ascii_only_str = str.gsub(/[^[:ascii:]]/, '')

This should do the trick:

ascii_only_str = str.gsub(/[^[:ascii:]]/, '')

回复收藏 0 原文

滿滿的愛 2024-08-08 14:25:55

使用 String#encode

从 Ruby 1.9 开始，在字符串编码之间进行转换的官方方法是使用字符串#encode。

要简单地删除非 ASCII 字符，您可以这样做：

some_ascii   = "abc"

some_unicode = "áëëçüñżλφθΩ

Use String#encode

The official way to convert between string encodings as of Ruby 1.9 is to use String#encode.

To simply remove non-ASCII characters, you could do this:

some_ascii   = "abc"
some_unicode = "áëëçüñżλφθΩ????????"
more_ascii   = "123ABC"
invalid_byte = "\255"

non_ascii_string = [some_ascii, some_unicode, more_ascii, invalid_byte].join

# See String#encode documentation
encoding_options = {
  :invalid           => :replace,  # Replace invalid byte sequences
  :undef             => :replace,  # Replace anything not defined in ASCII
  :replace           => '',        # Use a blank for those replacements
  :universal_newline => true       # Always break lines with \n
}

ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
  # => "abce123ABC"

Notice that the first 5 characters in the result are "abce1" - the "á" was discarded, one "ë" was discarded, but another "ë" appears to have been converted to "e".

The reason for this is that there are sometimes multiple ways to express the same written character in Unicode. The "á" is a single Unicode codepoint. The first "ë" is, too. When Ruby sees these during this conversion, it discards them.

But the second "ë" is two codepoints: a plain "e", just like you'd find in an ASCII string, followed by a "combining diacritical mark" (this one), which means "put an umlaut on the previous character". In the Unicode string, these are interpreted as a single "grapheme", or visible character. When converting this, Ruby keeps the plain ASCII "e" and discards the combining mark.

If you decide you'd like to provide some specific replacement values, you could do this:

REPLACEMENTS = { 
  'á' => "a",
  'ë' => 'e',
}

encoding_options = {
  :invalid   => :replace,     # Replace invalid byte sequences
  :replace => "",             # Use a blank for those replacements
  :universal_newline => true, # Always break lines with \n
  # For any character that isn't defined in ASCII, run this
  # code to find out how to replace it
  :fallback => lambda { |char|
    # If no replacement is specified, use an empty string
    REPLACEMENTS.fetch(char, "")
  },
}

ascii = non_ascii_string.encode(Encoding.find('ASCII'), encoding_options)
puts ascii.inspect
  #=> "abcaee123ABC"

Update

Some have reported issues with the :universal_newline option. I have seen this intermittently, but haven't been able to track down the cause.

When it happens, I see Encoding::ConverterNotFoundError: code converter not found (universal_newline). However, after some RVM updates, I've just run the script above under the following Ruby versions without problems:

ruby-1.9.2-p290
ruby-1.9.3-p125
ruby-1.9.3-p194
ruby-1.9.3-p362
ruby-2.0.0-preview2
ruby-head (as of 12-31-2012)

Given this, it doesn't appear to be a deprecated feature or even a bug in Ruby. If anyone knows the cause, please comment.

回复收藏 0 原文

寻梦旅人 2024-08-08 14:25:55

1.9


class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\u0080-\uffff]/, replacement)
 end
end

1.8


class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\x80-\xff]/, replacement)
 end
end

1.9


class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\u0080-\uffff]/, replacement)
 end
end

1.8


class String
 def remove_non_ascii(replacement="") 
   self.gsub(/[\x80-\xff]/, replacement)
 end
end

回复收藏 0 原文

孤君无依 2024-08-08 14:25:55

在 @masakielastic 的帮助下，我使用 #chars 方法出于个人目的解决了这个问题。

诀窍是将每个字符分解为其自己的单独块以便 ruby 可以失败。

当 Ruby 遇到二进制代码等时，需要失败。如果你不允许 Ruby 继续前进并失败，那么当涉及到这些东西时，它会是一条艰难的道路。因此，我使用 String#chars 方法将给定的字符串分解为字符数组。然后我将该代码传递到一个清理方法中，该方法允许代码在字符串中包含“微故障”（我的创造词）。

因此，给定一个“脏”字符串，假设您在图片上使用了 File#read 。（我的情况）

dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end

With a bit of help from @masakielastic I have solved this problem for my personal purposes using the #chars method.

The trick is to break down each character into its own separate block so that ruby can fail.

Ruby needs to fail when it confronts binary code etc. If you don't allow ruby to go ahead and fail its a tough road when it comes to this stuff. So I use the String#chars method to break the given string into an array of characters. Then I pass that code into a sanitizing method that allows the code to have "microfailures" (my coinage) within the string.

So, given a "dirty" string, lets say you used File#read on a picture. (my case)

dirty = File.open(filepath).read    
clean_chars = dirty.chars.select do |c|
  begin
    num_or_letter?(c)
  rescue ArgumentError
    next
  end
end
clean = clean_chars.join("")

def num_or_letter?(char)
  if char =~ /[a-zA-Z0-9]/
    true
  elsif char =~ Regexp.union(" ", ".", "?", "-", "+", "/", ",", "(", ")")
    true
  end
end

回复收藏 0 原文

凝望流年 2024-08-08 14:25:55

如果您有 activesupport，您可以使用 I18n.transliterate

I18n.transliterate("áëëçüñżλφθΩ

If you have activesupport you can use I18n.transliterate

I18n.transliterate("áëëçüñżλφθΩ????")
"aee?cunz?????"

Or if you don't want the question marks...

I18n.transliterate("áëëçüñżλφθΩ????", replacement: "")
"aeecunz"

Note that this doesn't remove invalid byte sequences it just replaces non ascii characters. For my use case this was what I wanted though and was simple.

回复收藏 0 原文

迷荒 2024-08-08 14:25:55

这是我使用 Iconv 的建议。

class String
  def remove_non_ascii
    require 'iconv'
    Iconv.conv('ASCII//IGNORE', 'UTF8', self)
  end
end

Here's my suggestion using Iconv.

class String
  def remove_non_ascii
    require 'iconv'
    Iconv.conv('ASCII//IGNORE', 'UTF8', self)
  end
end

回复收藏 0 原文

猛虎独行 2024-08-08 14:25:55

class String
  def strip_control_characters
    self.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join
  end
end

class String
  def strip_control_characters
    self.chars.reject { |char| char.ascii_only? and (char.ord < 32 or char.ord == 127) }.join
  end
end

回复收藏 0 原文

~没有更多了~

关于作者

一身仙ぐ女味

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

如何摆脱 ruby 中的非 ascii 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

使用 String#encode

Use String#encode

Update

关于作者

相关话题

热门标签

推荐作者

yangzhenyu123

lvzun

执笔绘流年

芯好空

始于初秋

谁与争疯

友情链接

如何摆脱 ruby​​ 中的非 ascii 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

使用 String#encode

Use String#encode

Update

关于作者

相关话题

热门标签

推荐作者

yangzhenyu123

lvzun

执笔绘流年

芯好空

始于初秋

谁与争疯

友情链接

如何摆脱 ruby 中的非 ascii 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。