清理 open(url).read 中的内容

发布于 2024-11-19 09:29:14 字数 385 浏览 5 评论 0原文

我正在使用 Ruby 打开 URL 并读取其内容。我正在阅读的文件的内容类型是“text/plain”。

问题是这包含一些我想转义的字符。例如，纯文本中出现的字符之一是“\240”，它是连字符的 ASCII。

我很好奇这是如何生成的，因为我在文本中的任何地方都没有看到连字符。然而它是无形的，当我使用 puts 在控制台中打印文本时，“\240”就会出现。

其次，我如何避免出现这种奇怪的字符？理想情况下，我想转义所有“\[some number]”形式的字符。我正在使用

"\240".gsub(Regexp.new("\\\d+"),"")

，但似乎不起作用。

是否有更传统的方法来清理从打开 URL 读取的纯文本内容？

原文

I am using Ruby to open a URL and read its content. The content type of the file I am reading is 'text/plain'.

The issue is that this contains some characters which I want to escape. For example, one of the characters that is coming up in the plain text is "\240" which is ASCII for a hyphen.

I am curious how this is being generated, because I don't see a hyphen anywhere in the text. Yet it exists invisibly and "\240" shows up when I use puts to print the text in the console.

Second of all, how do I escape such instances of weird characters? Ideally, I want to escape all characters which are of the form "\[some number]". I am using

"\240".gsub(Regexp.new("\\\d+"),"")

but it doesn't seem to work.

Are there more traditional ways of sanitizing plain text content read from opening a URL?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岁月静好 2024-11-26 09:29:14

您可能想要检查返回的文本的字符集。它可能是 UTF-8，它通常具有如此高的字符。 Ruby 1.9 对字符集以及字符集之间的切换提供了很好的支持。我使用 str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?") 强制使用字符串转换为标准 ASCII，用 ? 替换所有奇数字符。

回复收藏 0 原文