如何在 C 中实现变长“字符串”-y
我在谷歌上搜索了很多,但我找不到有关可变长度字符串通常如何在高级语言中实现的信息。我正在创建自己的这样的语言,并且不知道从哪里开始使用字符串。
我有一个描述 string
类型的结构,然后是一个分配此类“字符串”的 create
函数:
/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'
struct string {
// …
char native[1024];
};
string String__create(char native[]) {
string this = malloc(sizeof(struct string));
// …
STRCPY(this->native, native);
return this;
}
但是,这只允许 1kb 长的字符串。这有点愚蠢,并且在大多数情况下会浪费大量内存。
鉴于我必须以某种方式声明要使用的内存……我该如何实现一个可以(有效)存储(有效)无限数量字符的字符串?
I’ve googled quite a bit, but I can’t find information on how variable-length strings are generally implemented in higher-level languages. I’m creating my own such language, and am not sure where to start with strings.
I have a struct describing the string
type, and then a create
function that allocates such a ‘string’:
/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'
struct string {
// …
char native[1024];
};
string String__create(char native[]) {
string this = malloc(sizeof(struct string));
// …
STRCPY(this->native, native);
return this;
}
However, that would only allow 1kb-long strings. That’s sort of silly, and a huge waste of memory in most cases.
Given that I have to declare the memory to be used somehow… how do I go about implementing a string that can (efficiently) store an (effectively) unbounded number of characters?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
许多 C++
std::string
实现现在使用“小字符串优化”。在伪代码中:想法是将最多 12 个字符的字符串存储在
shortString
数组中。整个字符串将是连续的并且仅使用单个缓存行。较长的字符串存储在堆上。这在字符串对象中留下了 12 个备用字节。指针不会占用所有这些,因此您还可以记住在堆上分配了多少内存(>=length
)。这有助于支持以小增量增长字符串的场景。Many C++
std::string
implementations now use a "Small String Optimization". In pseudo-code:The idea is that string up to 12 characters are stored in the
shortString
array. The entire string will be contiguous and use only a single cache line. Longer strings are stored on the heap. This leaves you with 12 spare bytes in the string object. The pointer doesn't take all of that, so you can also remember how much memory you've allocated on the heap (>=length
). That helps to support scenario's in which you grow a string in small increments.常见的方法是有一个长度字段和一个指向动态分配的内存区域的指针来保存字符。
如果您想动态分配
string_t
:The common approach is to have a field for length and a pointer to a dynamically allocated region of memory to hold the characters.
If you want to dynamically allocate the
string_t
:除了其他人告诉你的之外,我还包括“容量”的概念:
不可能知道内存中分配的向量的大小,您必须存储它。如果您执行 String 结构的 sizeof,它将返回 4 个字节 * 3 个数字字段 = 12 个字节(由于结构中使用的填充,可能会更大)。另外,你无法通过sizeof获取分配的内存的长度。
这样,容量始终代表字符串所在块的实际大小。假设您的字符串变短了:您不必重新分配即可获得字符串的容量和大小之间的精确匹配。此外,您可以从头开始分配您期望字符串具有的字符,而不是初始字符串具有的字符。最后,您可以模仿 C++ 字符串 char 动态向量,并在每次字符串增长超出容量限制时将容量加倍。所有这些都会将内存操作保持在最低限度,这将转化为更好的性能(您也会浪费一些内存,但不会超过 1Kb)。
In addition to what others have told you, I'd also include the concept of "capacity":
It is not possible to know the size of the allocated vector in memory, you must store it. If you do a sizeof of the String struct, it will return you 4 bytes * 3 numeric fields = 12 bytes (probably bigger due to padding used in structures). Also, you cannot get the length of allocated memory through sizeof.
This way, capacity always bears the actual size of the chunk in which your string is. Say that your string goes shorter: you don't have to realloc in order to get an exact match between the capacity and the size of your string. Also, you can alloc from the beginning the characters you expect the string to have, and not the characters the initial string has. Finally, you can mimic the C++ string char dynamic vector and double capacity each time the string grows beyond the capacity limit. All of these will keep memory operations to a minimum, which will translate in better performance (you will also waste some memory, but never as much as 1Kb).
realloc 会将您的内存重新定位到更大的缓冲区 - 只需使用此命令即可调整字符串的大小。对字符串使用以下结构:
重要的部分是跟踪字符串的大小,并让每个字符串操作函数测试该大小是否有意义。
您可以预先分配一个大缓冲区并不断向其中添加内容,并且只有在该缓冲区已满时才重新分配 - 您必须为此实现所有功能。通过从一个不可变字符串移动到另一个(使用 realoc 语义)来改变字符串是更可取的(更不容易出错,也更安全)。
realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:
The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.
You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one
immutable string
to another (using the semantics of realoc).有些人喜欢使用“rope”数据结构来存储字符串无限长度,而不是连续的字符串(C 字符串)。
简化的绳索可以这样定义:
少于 rode_leaf_max 字符的绳索的存储方式与以 null 结尾的 C 字符串相同。
包含超过rope_leaf_max字符的绳索被存储为指向左子串和右子串的根non_leaf_rope_node,(它可能依次指向左子子串和右子子串),
最终指向叶子节点,
每个叶节点至少包含完整字符串的一个字符。
一根绳子总是至少存储一个字符,所以我们总是可以知道:
如果绳节点的第一个字节非零,则该节点是存储文字字符的叶节点。
如果绳节点的第一个字节为零,则该节点存储指向左子串和右子串的指针。
(真正的绳索实现通常具有第三种绳索节点)。
通常使用绳索比使用 C 字符串需要更少的总 RAM 空间。
(包含“纽约市”等短语的节点
可以在一根绳索中多次重复使用,或者在某些实现中在两根绳索之间共享)。
有时使用绳索比使用 C 弦更快。
Some people prefer to use the "rope" data structure to store a string of characters of unbounded length, rather than a contiguous string (C string).
A simplified rope can be defined something like:
A rope with less than rope_leaf_max characters is stored the same as a null-terminated C string.
A rope containing more than rope_leaf_max characters is stored as a root non_leaf_rope_node pointing to the left and right sub-strings, (which may in turn point to left and right sub-sub-strings),
eventually pointing to leaf nodes,
and the leaf nodes each contain at least one character of the full string.
A rope always stores at least one character, so we can always tell:
If the first byte of a rope node is non-zero, that node is a leaf node storing literal characters.
If the first byte of a rope node is zero, that node stores pointers to left and right sub-strings.
(Real rope implementations often have a third kind of rope node).
Often using ropes requires less total RAM space than using C strings.
(A node containing a phrase such as "New York City"
can be re-used multiple times in one rope, or in some implementations shared between two ropes).
Sometimes using ropes is faster than using C strings.