清理 open(url).read 中的内容

发布于 2024-11-19 09:29:14 字数 385 浏览 5 评论 0原文

我正在使用 Ruby 打开 URL 并读取其内容。我正在阅读的文件的内容类型是“text/plain”。

问题是这包含一些我想转义的字符。例如,纯文本中出现的字符之一是“\240”,它是连字符的 ASCII。

我很好奇这是如何生成的,因为我在文本中的任何地方都没有看到连字符。然而它是无形的,当我使用 puts 在控制台中打印文本时,“\240”就会出现。

其次,我如何避免出现这种奇怪的字符?理想情况下,我想转义所有“\[some number]”形式的字符。我正在使用

"\240".gsub(Regexp.new("\\\d+"),"")

,但似乎不起作用。

是否有更传统的方法来清理从打开 URL 读取的纯文本内容?

I am using Ruby to open a URL and read its content. The content type of the file I am reading is 'text/plain'.

The issue is that this contains some characters which I want to escape. For example, one of the characters that is coming up in the plain text is "\240" which is ASCII for a hyphen.

I am curious how this is being generated, because I don't see a hyphen anywhere in the text. Yet it exists invisibly and "\240" shows up when I use puts to print the text in the console.

Second of all, how do I escape such instances of weird characters? Ideally, I want to escape all characters which are of the form "\[some number]". I am using

"\240".gsub(Regexp.new("\\\d+"),"")

but it doesn't seem to work.

Are there more traditional ways of sanitizing plain text content read from opening a URL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

岁月静好 2024-11-26 09:29:14

您可能想要检查返回的文本的字符集。它可能是 UTF-8,它通常具有如此高的字符。 Ruby 1.9 对字符集以及字符集之间的切换提供了很好的支持。我使用 str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?") 强制使用字符串转换为标准 ASCII,用 ? 替换所有奇数字符。

You might want to check on the character set of the text that's getting returned. It could be UTF-8, which frequently has characters that high. Ruby 1.9 has great support for character sets and switching between them. I've used str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?") to force a string to standard ASCII, replacing any odd characters with a ?.

娇女薄笑 2024-11-26 09:29:14

在尝试了这个之后,我发现下面的正则表达式对我来说很有效:

str.gsub(/[^\x00-\x7F]/,'')

After having a play with this, I found the following regular expression which does the trick for me:

str.gsub(/[^\x00-\x7F]/,'')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文