C Tokenizer - 它是如何工作的?

发布于 2024-09-12 07:49:33 字数 1755 浏览 20 评论 0原文

这是如何运作的?

我知道要使用它,您传入:

  • start:字符串(例如“Item 1,Item 2,Item 3”)
  • delim:分隔符字符串(例如“,”)
  • tok:对将保存令牌 nextpos 的字符串的引用
  • (可选) :引用原始字符串中下一个标记开始的位置
  • sdelim (可选):指向将保存标记的起始分隔符的字符的指针
  • edelim (可选):指向将保存标记的结束分隔符的字符的指针令牌

代码:

#include <stdlib.h>
#include <string.h>

int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
    // Find beginning:
    int len = 0;
    char *scanner;
    int dictionary[8];
    int ptr;

    for(ptr = 0; ptr < 8; ptr++) {
        dictionary[ptr] = 0;
    }

    for(; *delim; delim++) {
        dictionary[*delim / 32] |= 1 << *delim % 32;
    }

    if(sdelim) {
        *sdelim = 0;
    }

    for(; *start; start++) {
        if(!(dictionary[*start / 32] & 1 << *start % 32)) {
            break;
        }
        if(sdelim) {
            *sdelim = *start;
        }
    }

    if(*start == 0) {
        if(nextpos != NULL) {
            *nextpos = start;
        }
        *tok = NULL;
        return 0;
    }

    for(scanner = start; *scanner; scanner++) {
        if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
            break;
        }
        len++;
    }

    if(edelim) {
        *edelim = *scanner;
    }

    if(nextpos != NULL) {
        *nextpos = scanner;
    }

    *tok = (char*)malloc(sizeof(char) * (len + 1));

    if(*tok == NULL) {
        return 0;
    }

    memcpy(*tok, start, len);
    *(*tok + len) = 0;


    return len + 1;
}

我得到了大部分内容,除了:

dictionary[*delim / 32] |= 1 << *delim % 32;

字典[*start / 32] & 1<<; *start % 32

这很神奇吗?

How does this work?

I know to use it you pass in:

  • start: string (e.g. "Item 1, Item 2, Item 3")
  • delim: delimiter string (e.g. ",")
  • tok: reference to a string which will hold the token
  • nextpos (optional): reference to a the position in the original string where the next token starts
  • sdelim (optional): pointer to a character which will hold the starting delimeter of the token
  • edelim (optional): pointer to a character which will hold the ending delimeter of the token

Code:

#include <stdlib.h>
#include <string.h>

int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
    // Find beginning:
    int len = 0;
    char *scanner;
    int dictionary[8];
    int ptr;

    for(ptr = 0; ptr < 8; ptr++) {
        dictionary[ptr] = 0;
    }

    for(; *delim; delim++) {
        dictionary[*delim / 32] |= 1 << *delim % 32;
    }

    if(sdelim) {
        *sdelim = 0;
    }

    for(; *start; start++) {
        if(!(dictionary[*start / 32] & 1 << *start % 32)) {
            break;
        }
        if(sdelim) {
            *sdelim = *start;
        }
    }

    if(*start == 0) {
        if(nextpos != NULL) {
            *nextpos = start;
        }
        *tok = NULL;
        return 0;
    }

    for(scanner = start; *scanner; scanner++) {
        if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
            break;
        }
        len++;
    }

    if(edelim) {
        *edelim = *scanner;
    }

    if(nextpos != NULL) {
        *nextpos = scanner;
    }

    *tok = (char*)malloc(sizeof(char) * (len + 1));

    if(*tok == NULL) {
        return 0;
    }

    memcpy(*tok, start, len);
    *(*tok + len) = 0;


    return len + 1;
}

I get most of it except for:

dictionary[*delim / 32] |= 1 << *delim % 32;

and

dictionary[*start / 32] & 1 << *start % 32

Is it magic?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

爱冒险 2024-09-19 07:49:33

由于分隔符的每个字符都是 8 位(sizeof(char) == 1 字节),因此它的可能值仅限于 256 个。

字典分为 8 部分 (int Dictionary[8]),每部分有 32 种可能性(sizeof(int) >= 4 字节),并且 32 * 8 = 256.

这形成了一个 256 位值矩阵。然后,它打开分隔符中每个字符的标志 (dictionary[*delim / 32] |= 1 << *delim % 32;)。数组的索引是 *delim / 32,即字符的 ASCII 值除以 32。由于 ASCII 值的范围是从 0 到 255,因此除法会产生 0 到 7 之间的值,其中余。余数是打开哪一位,由模运算决定。

如果分隔符中存在相应的 ASCII 字符,则所有这一切都会将 256 位矩阵的某些位标记为 true。

然后确定字符是否在分隔符中只需在 256 位矩阵中查找即可 (dictionary[*start / 32] & 1 << *start % 32)

Since each character of the delimiter is 8 bits (sizeof(char) == 1 byte), it is limited to 256 possible values.

The dictionary is broken into 8 pieces (int dictionary[8]), 32 possibilities per piece (sizeof(int) is >= 4 bytes) and 32 * 8 = 256.

This forms a 256 bit matrix of values. It then turns on the flag for each character in the delimiter (dictionary[*delim / 32] |= 1 << *delim % 32;). The index of the array is *delim / 32, or the ASCII value of the character divided by 32. Since the ASCII value ranges from 0 to 255, this divide yields a value of 0 to 7 with a remainder. The remainder is which bit to turn on, decided by the modulus operation.

All this does is flag certain bits of the 256 bit matrix as true, if the corresponding ASCII character exists in the delimiter.

Then determining if a character is in the delimiter is simply a lookup in the 256 bit matrix (dictionary[*start / 32] & 1 << *start % 32)

月寒剑心 2024-09-19 07:49:33

它们通过在字典中存储 8 x 32 = 256 位表来存储出现的字符。

dictionary[*delim / 32] |= 1 << *delim % 32;

设置与 *delim

dictionary[*start / 32] & 1 << *start % 32

检查位相对应的位

They store which characters have occurred by making an 8 x 32 = 256 table of bits stored in dictionary.

dictionary[*delim / 32] |= 1 << *delim % 32;

sets the bit corresponding to *delim

dictionary[*start / 32] & 1 << *start % 32

checks the bit

你另情深 2024-09-19 07:49:33

好的,如果我们发送字符串 "," 作为 delimiter,则 dictionary[*delim / 32] |= 1 << *delim % 32 将是字典[1] = 4096。表达式字典[*start / 32] & 1<<; *start % 32 只是检查匹配的字符。

让我困惑的是为什么他们不使用直接的 char 比较。

OK, so if we send in the string "," for the delimiter then dictionary[*delim / 32] |= 1 << *delim % 32 will be dictionary[1] = 4096. The expression dictionary[*start / 32] & 1 << *start % 32 simply checks for a matching character.

What puzzles me is why they are not using direct char comparison.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文