C Tokenizer - 它是如何工作的？

发布于 2024-09-12 07:49:33 字数 1755 浏览 20 评论 0原文

这是如何运作的？

我知道要使用它，您传入：

start：字符串（例如“Item 1，Item 2，Item 3”）
delim：分隔符字符串（例如“，”）
tok：对将保存令牌 nextpos 的字符串的引用
（可选）：引用原始字符串中下一个标记开始的位置
sdelim （可选）：指向将保存标记的起始分隔符的字符的指针
edelim （可选）：指向将保存标记的结束分隔符的字符的指针令牌

代码：

#include <stdlib.h>
#include <string.h>

int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
    // Find beginning:
    int len = 0;
    char *scanner;
    int dictionary[8];
    int ptr;

    for(ptr = 0; ptr < 8; ptr++) {
        dictionary[ptr] = 0;
    }

    for(; *delim; delim++) {
        dictionary[*delim / 32] |= 1 << *delim % 32;
    }

    if(sdelim) {
        *sdelim = 0;
    }

    for(; *start; start++) {
        if(!(dictionary[*start / 32] & 1 << *start % 32)) {
            break;
        }
        if(sdelim) {
            *sdelim = *start;
        }
    }

    if(*start == 0) {
        if(nextpos != NULL) {
            *nextpos = start;
        }
        *tok = NULL;
        return 0;
    }

    for(scanner = start; *scanner; scanner++) {
        if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
            break;
        }
        len++;
    }

    if(edelim) {
        *edelim = *scanner;
    }

    if(nextpos != NULL) {
        *nextpos = scanner;
    }

    *tok = (char*)malloc(sizeof(char) * (len + 1));

    if(*tok == NULL) {
        return 0;
    }

    memcpy(*tok, start, len);
    *(*tok + len) = 0;


    return len + 1;
}

我得到了大部分内容，除了：

dictionary[*delim / 32] |= 1 << *delim % 32;

和

字典[*start / 32] & 1<<; *start % 32

这很神奇吗？

原文

How does this work?

I know to use it you pass in:

start: string (e.g. "Item 1, Item 2, Item 3")
delim: delimiter string (e.g. ",")
tok: reference to a string which will hold the token
nextpos (optional): reference to a the position in the original string where the next token starts
sdelim (optional): pointer to a character which will hold the starting delimeter of the token
edelim (optional): pointer to a character which will hold the ending delimeter of the token

Code:

#include <stdlib.h>
#include <string.h>

int token(char* start, char* delim, char** tok, char** nextpos, char* sdelim, char* edelim) {
    // Find beginning:
    int len = 0;
    char *scanner;
    int dictionary[8];
    int ptr;

    for(ptr = 0; ptr < 8; ptr++) {
        dictionary[ptr] = 0;
    }

    for(; *delim; delim++) {
        dictionary[*delim / 32] |= 1 << *delim % 32;
    }

    if(sdelim) {
        *sdelim = 0;
    }

    for(; *start; start++) {
        if(!(dictionary[*start / 32] & 1 << *start % 32)) {
            break;
        }
        if(sdelim) {
            *sdelim = *start;
        }
    }

    if(*start == 0) {
        if(nextpos != NULL) {
            *nextpos = start;
        }
        *tok = NULL;
        return 0;
    }

    for(scanner = start; *scanner; scanner++) {
        if(dictionary[*scanner / 32] & 1 << *scanner % 32) {
            break;
        }
        len++;
    }

    if(edelim) {
        *edelim = *scanner;
    }

    if(nextpos != NULL) {
        *nextpos = scanner;
    }

    *tok = (char*)malloc(sizeof(char) * (len + 1));

    if(*tok == NULL) {
        return 0;
    }

    memcpy(*tok, start, len);
    *(*tok + len) = 0;


    return len + 1;
}

I get most of it except for:

dictionary[*delim / 32] |= 1 << *delim % 32;

and

dictionary[*start / 32] & 1 << *start % 32

Is it magic?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱冒险 2024-09-19 07:49:33

由于分隔符的每个字符都是 8 位（sizeof(char) == 1 字节），因此它的可能值仅限于 256 个。

字典分为 8 部分 (int Dictionary[8])，每部分有 32 种可能性（sizeof(int) >= 4 字节），并且 32 * 8 = 256.

这形成了一个 256 位值矩阵。然后，它打开分隔符中每个字符的标志 (dictionary[*delim / 32] |= 1 << *delim % 32;)。数组的索引是 *delim / 32，即字符的 ASCII 值除以 32。由于 ASCII 值的范围是从 0 到 255，因此除法会产生 0 到 7 之间的值，其中余。余数是打开哪一位，由模运算决定。

如果分隔符中存在相应的 ASCII 字符，则所有这一切都会将 256 位矩阵的某些位标记为 true。

然后确定字符是否在分隔符中只需在 256 位矩阵中查找即可 (dictionary[*start / 32] & 1 << *start % 32)

回复收藏 0 原文

月寒剑心 2024-09-19 07:49:33

它们通过在字典中存储 8 x 32 = 256 位表来存储出现的字符。

dictionary[*delim / 32] |= 1 << *delim % 32;

设置与 *delim

dictionary[*start / 32] & 1 << *start % 32

检查位相对应的位

They store which characters have occurred by making an 8 x 32 = 256 table of bits stored in dictionary.

dictionary[*delim / 32] |= 1 << *delim % 32;

sets the bit corresponding to *delim

dictionary[*start / 32] & 1 << *start % 32

checks the bit

回复收藏 0 原文

你另情深 2024-09-19 07:49:33

好的，如果我们发送字符串 "," 作为 delimiter，则 dictionary[*delim / 32] |= 1 << *delim % 32 将是字典[1] = 4096。表达式字典[*start / 32] & 1<<; *start % 32 只是检查匹配的字符。

让我困惑的是为什么他们不使用直接的 char 比较。

回复收藏 0 原文

~没有更多了~

关于作者

深海少女心

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

C Tokenizer - 它是如何工作的？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

C Tokenizer - 它是如何工作的？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

书间行客

我ぃ本無心為│何有愛

神妖

undefined

38169838

彡翼

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。