清理 open(url).read 中的内容
我正在使用 Ruby 打开 URL 并读取其内容。我正在阅读的文件的内容类型是“text/plain”。
问题是这包含一些我想转义的字符。例如,纯文本中出现的字符之一是“\240”,它是连字符的 ASCII。
我很好奇这是如何生成的,因为我在文本中的任何地方都没有看到连字符。然而它是无形的,当我使用 puts 在控制台中打印文本时,“\240”就会出现。
其次,我如何避免出现这种奇怪的字符?理想情况下,我想转义所有“\[some number]”形式的字符。我正在使用
"\240".gsub(Regexp.new("\\\d+"),"")
,但似乎不起作用。
是否有更传统的方法来清理从打开 URL 读取的纯文本内容?
I am using Ruby to open a URL and read its content. The content type of the file I am reading is 'text/plain'.
The issue is that this contains some characters which I want to escape. For example, one of the characters that is coming up in the plain text is "\240" which is ASCII for a hyphen.
I am curious how this is being generated, because I don't see a hyphen anywhere in the text. Yet it exists invisibly and "\240" shows up when I use puts
to print the text in the console.
Second of all, how do I escape such instances of weird characters? Ideally, I want to escape all characters which are of the form "\[some number]". I am using
"\240".gsub(Regexp.new("\\\d+"),"")
but it doesn't seem to work.
Are there more traditional ways of sanitizing plain text content read from opening a URL?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可能想要检查返回的文本的字符集。它可能是 UTF-8,它通常具有如此高的字符。 Ruby 1.9 对字符集以及字符集之间的切换提供了很好的支持。我使用
str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?")
强制使用字符串转换为标准 ASCII,用?
替换所有奇数字符。You might want to check on the character set of the text that's getting returned. It could be UTF-8, which frequently has characters that high. Ruby 1.9 has great support for character sets and switching between them. I've used
str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?")
to force a string to standard ASCII, replacing any odd characters with a?
.在尝试了这个之后,我发现下面的正则表达式对我来说很有效:
After having a play with this, I found the following regular expression which does the trick for me: