Scheme 中的 O(n) 字符串处理

发布于 2024-10-10 06:54:04 字数 664 浏览 0 评论 0原文

背景:我一直在用Scheme(R5RS)编写一个小解释器。

读取器/词法分析器从输入中获取一个(有时很长)字符串并将其标记化。它通过将字符串的前几个字符与某个标记进行匹配并返回该标记和字符串中剩余的不匹配部分来实现此目的。

问题:为了返回字符串的剩余部分,每次读取令牌时都会创建一个新字符串。这意味着读者在字符串中存在的标记数量上的复杂度是 O(n^2)。

可能的解决方案:将字符串转换为列表,这可以在 O(n) 时间内完成,然后从列表而不是字符串中提取标记,返回列表的剩余部分而不是字符串。但这似乎效率极低且人为。

问题:这是我的想象,还是由于它的纯功能性外观,在Scheme中没有其他方法可以有效地做到这一点?

编辑:在 R5RS 方案中,没有办法将指针返回到字符串中。 “substring”函数是唯一提取本身就是字符串的对象的函数。但Scheme 标准坚持认为这是一个新分配的字符串。为什么?因为字符串在Scheme R5RS中不是不可变的,例如参见“字符串集!”功能!!

下面建议的一种可行的解决方案是将索引存储到字符串中。然后可以从该索引一次读取一个字符,直到读取到一个标记为止。太糟糕了,我用于标记化的正则表达式库需要一个实际的字符串,而不是一个索引......

Background: I've been writing a little interpreter in Scheme (R5RS).

The reader/lexer takes a (sometimes long) string from input and tokenises it. It does this by matching the first few characters of the string against some token and returning the token and the remaining unmatched part of the string.

Problem: to return the remaining portion of the string, a new string is created every time a token is read. This means the reader is O(n^2) in the number of tokens present in the string.

Possible solution: convert the string to a list, which can be done in time O(n), then pull tokens from the list instead of the string, returning the remainder of the list instead of the remainder of the string. But this seems terribly inefficient and artificial.

Question: am I imagining it, or is there just no other way to do this efficiently in Scheme due to its purely functional outlook?

Edit: in R5RS Scheme, there isn't a way to return a pointer into a string. The "substring" function is the only function which extracts an object which is itself a string. But the Scheme standard insists this be a newly allocated string. Why? Because strings are not immutable in Scheme R5RS, e.g. see the "string-set!" function!!

One solution suggested below which works is to store an index into the string. Then one can read off the characters one at a time from that index until a token is read. Too bad the regexp library I'm using for the tokenisation requires an actual string not an index into one...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

烂人 2024-10-17 06:54:05

考虑制作字符串的共享子字符串实现(例如,Java 就是这样做的)。因此,当您想要获取给定字符串的子字符串时,而不是复制字符,只需保留指向这些字符(其中的某个位置)的指针和长度即可。

Consider making a shared-substring implementation of strings (this is how Java does it, for example). So when you want to grab a substring of a given string, rather than copying the characters, simply keep a pointer to (some location in) those characters, and a length.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文