在 Ruby 中将正则表达式与非字符串进行匹配而不进行转换

发布于 2024-08-09 02:10:30 字数 670 浏览 2 评论 0原文

如果 Ruby 正则表达式与不是字符串的内容进行匹配,则会在该对象上调用 to_str 方法来获取要匹配的实际字符串。我想避免这种行为;我想将正则表达式与不是字符串的对象进行匹配,但在逻辑上可以将其视为可随机访问的字节序列,并且对它们的所有访问都通过 byte_at() 方法进行中介(本质上与 Java 的 CharSequence.char_at() 方法类似)。

例如,假设我想查找任意正则表达式在任意文件中的字节偏移量;该表达式可能是多行的,因此我不能一次读取一行并在每行中查找匹配项。如果文件很大,我无法将其全部放入内存中,因此我不能将其作为一个大字符串读取。然而,定义一个获取文件第 n 个字节的方法(根据速度需要进行缓冲和缓存)就足够简单了。

最终,我想构建一个功能齐全的 rope 类,例如Ruby Quiz #137,我希望能够在它们上使用正则表达式而不影响性能将它们转换为字符串的损失。

我不想深入了解 Ruby 正则表达式实现的内部结构,因此任何见解都将不胜感激。

If a Ruby regular expression is matching against something that isn't a String, the to_str method is called on that object to get an actual String to match against. I want to avoid this behavior; I'd like to match regular expressions against objects that aren't Strings, but can be logically thought of as randomly accessible sequences of bytes, and all accesses to them are mediated through a byte_at() method (similar in spirit to Java's CharSequence.char_at() method).

For example, suppose I want to find the byte offset in an arbitrary file of an arbitrary regular expression; the expression might be multi-line, so I can't just read in a line at a time and look for a match in each line. If the file is very big, I can't fit it all in memory, so I can't just read it in as one big string. However, it would be simple enough to define a method that gets the nth byte of a file (with buffering and caching as needed for speed).

Eventually, I'd like to build a fully featured rope class, like in Ruby Quiz #137, and I'd like to be able to use regular expressions on them without the performance loss of converting them to strings.

I don't want to get up to my elbows in the innards of Ruby's regular expression implementation, so any insight would be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

尬尬 2024-08-16 02:10:30

你不能。 Ruby 1.8.x 不支持这一点,可能是因为它是一种边缘情况;在 1.9 中它甚至没有意义。 Ruby 1.9 不以任何用户可服务的方式将其字符串映射到字节;相反,它使用字符代码点,以便它可以支持它接受的多种编码。 1.9 的新优化正则表达式引擎 Oniguruma 也是围绕相同的编码和代码点概念构建的。字节只是不进入这个级别的图片。

我怀疑您所要求的是过早优化的情况。对于任何合理的 Ruby 对象,实现 to_str 不应该成为一个巨大的性能障碍。如果是,那么 Ruby 可能不适合您,因为它以各种方式将您与原始数据进行抽象和隔离。

您在大型二进制文件中查找字节序列的示例并不是 Ruby 的理想用例 - 您最好使用 grep 或其他一些 Unix 工具。如果您需要 Ruby 程序中的结果,请使用反引号将其作为系统进程运行并处理输出。

You can't. This wasn't supported in Ruby 1.8.x, probably because it's such an edge case; and in 1.9 it wouldn't even make sense. Ruby 1.9 doesn't map its strings to bytes in any user-serviceable fashion; instead it uses character code points, so that it can support the multitude of encodings that it accepts. And 1.9's new optimized regex engine, Oniguruma, is also built around the same concept of encodings and code points. Bytes just don't enter into the picture at this level.

I have a suspicion that what you're asking for is a case of premature optimization. For any reasonable Ruby object, implementing to_str shouldn't be a huge performance hurdle. If it is, then Ruby's probably the wrong tool for you, as it abstracts and insulates you from your raw data in all sorts of ways.

Your example of looking for a byte sequence in a large binary file isn't an ideal use case for Ruby -- you'd be better off using grep or some other Unix tool. If you need the results in your Ruby program, run it as a system process using backticks and process the output.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文