C Tokenizer - 它是如何工作的?
这是如何运作的?
我知道要使用它,您传入:
- start:字符串(例如“Item 1,Item 2,Item 3”)
- delim:分隔符字符串(例如“,”)
- tok:对将保存令牌 nextpos 的字符串的引用
- (可选) :引用原始字符串中下一个标记开始的位置
- sdelim (可选):指向将保存标记的起始分隔符的字符的指针
- edelim (可选):指向将保存标记的结束分隔符的字符的指针令牌
代码:
#include <stdlib.h>
#include <string.h>
int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
// Find beginning:
int len = 0;
char *scanner;
int dictionary[8];
int ptr;
for(ptr = 0; ptr < 8; ptr++) {
dictionary[ptr] = 0;
}
for(; *delim; delim++) {
dictionary[*delim / 32] |= 1 << *delim % 32;
}
if(sdelim) {
*sdelim = 0;
}
for(; *start; start++) {
if(!(dictionary[*start / 32] & 1 << *start % 32)) {
break;
}
if(sdelim) {
*sdelim = *start;
}
}
if(*start == 0) {
if(nextpos != NULL) {
*nextpos = start;
}
*tok = NULL;
return 0;
}
for(scanner = start; *scanner; scanner++) {
if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
break;
}
len++;
}
if(edelim) {
*edelim = *scanner;
}
if(nextpos != NULL) {
*nextpos = scanner;
}
*tok = (char*)malloc(sizeof(char) * (len + 1));
if(*tok == NULL) {
return 0;
}
memcpy(*tok, start, len);
*(*tok + len) = 0;
return len + 1;
}
我得到了大部分内容,除了:
dictionary[*delim / 32] |= 1 << *delim % 32;
和
字典[*start / 32] & 1<<; *start % 32
这很神奇吗?
How does this work?
I know to use it you pass in:
- start: string (e.g. "Item 1, Item 2, Item 3")
- delim: delimiter string (e.g. ",")
- tok: reference to a string which will hold the token
- nextpos (optional): reference to a the position in the original string where the next token starts
- sdelim (optional): pointer to a character which will hold the starting delimeter of the token
- edelim (optional): pointer to a character which will hold the ending delimeter of the token
Code:
#include <stdlib.h>
#include <string.h>
int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
// Find beginning:
int len = 0;
char *scanner;
int dictionary[8];
int ptr;
for(ptr = 0; ptr < 8; ptr++) {
dictionary[ptr] = 0;
}
for(; *delim; delim++) {
dictionary[*delim / 32] |= 1 << *delim % 32;
}
if(sdelim) {
*sdelim = 0;
}
for(; *start; start++) {
if(!(dictionary[*start / 32] & 1 << *start % 32)) {
break;
}
if(sdelim) {
*sdelim = *start;
}
}
if(*start == 0) {
if(nextpos != NULL) {
*nextpos = start;
}
*tok = NULL;
return 0;
}
for(scanner = start; *scanner; scanner++) {
if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
break;
}
len++;
}
if(edelim) {
*edelim = *scanner;
}
if(nextpos != NULL) {
*nextpos = scanner;
}
*tok = (char*)malloc(sizeof(char) * (len + 1));
if(*tok == NULL) {
return 0;
}
memcpy(*tok, start, len);
*(*tok + len) = 0;
return len + 1;
}
I get most of it except for:
dictionary[*delim / 32] |= 1 << *delim % 32;
and
dictionary[*start / 32] & 1 << *start % 32
Is it magic?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
由于分隔符的每个字符都是 8 位(
sizeof(char)
== 1 字节),因此它的可能值仅限于 256 个。字典分为 8 部分 (
int Dictionary[8]
),每部分有 32 种可能性(sizeof(int)
>= 4 字节),并且 32 * 8 = 256.这形成了一个 256 位值矩阵。然后,它打开分隔符中每个字符的标志 (
dictionary[*delim / 32] |= 1 << *delim % 32;
)。数组的索引是*delim / 32
,即字符的 ASCII 值除以 32。由于 ASCII 值的范围是从 0 到 255,因此除法会产生 0 到 7 之间的值,其中余。余数是打开哪一位,由模运算决定。如果分隔符中存在相应的 ASCII 字符,则所有这一切都会将 256 位矩阵的某些位标记为 true。
然后确定字符是否在分隔符中只需在 256 位矩阵中查找即可 (
dictionary[*start / 32] & 1 << *start % 32
)Since each character of the delimiter is 8 bits (
sizeof(char)
== 1 byte), it is limited to 256 possible values.The dictionary is broken into 8 pieces (
int dictionary[8]
), 32 possibilities per piece (sizeof(int)
is >= 4 bytes) and 32 * 8 = 256.This forms a 256 bit matrix of values. It then turns on the flag for each character in the delimiter (
dictionary[*delim / 32] |= 1 << *delim % 32;
). The index of the array is*delim / 32
, or the ASCII value of the character divided by 32. Since the ASCII value ranges from 0 to 255, this divide yields a value of 0 to 7 with a remainder. The remainder is which bit to turn on, decided by the modulus operation.All this does is flag certain bits of the 256 bit matrix as true, if the corresponding ASCII character exists in the delimiter.
Then determining if a character is in the delimiter is simply a lookup in the 256 bit matrix (
dictionary[*start / 32] & 1 << *start % 32
)它们通过在字典中存储 8 x 32 = 256 位表来存储出现的字符。
设置与 *delim
检查位相对应的位
They store which characters have occurred by making an 8 x 32 = 256 table of bits stored in dictionary.
sets the bit corresponding to *delim
checks the bit
好的,如果我们发送字符串
","
作为delimiter
,则dictionary[*delim / 32] |= 1 << *delim % 32
将是字典[1] = 4096
。表达式字典[*start / 32] & 1<<; *start % 32 只是检查匹配的字符。让我困惑的是为什么他们不使用直接的 char 比较。
OK, so if we send in the string
","
for thedelimiter
thendictionary[*delim / 32] |= 1 << *delim % 32
will bedictionary[1] = 4096
. The expressiondictionary[*start / 32] & 1 << *start % 32
simply checks for a matching character.What puzzles me is why they are not using direct
char
comparison.