如何手动编写(shell)词法分析器

发布于 10-27 21:03 字数 561 浏览 1 评论 0原文

我正在开发一个 shell,一个类似 bash 的小型 shell,没有脚本(如果 while ...) 我必须手工制作词法分析器/解析器(LL)。

因此词法分析器会将命令 (char *cmd) 转换为链接列表 (t_list *list)。 LL 解析器会将链表 (t_list *list) 转换为带有 语法

所以,我知道如何制作 LL 解析器,但我不知道如何标记我的命令。

例如:ps | grep ls>>文件 ;制作&& ./a.out

=> <代码>'ps''|' 'grep' 'ls' '>>' '文件' ';' ''制作'&&' './a.out'

谢谢。

(我不想使用任何发电机)

I'm working on a shell, a small bash-like shell, without scripting (if while ...)
I have to make the lexer/parser (LL) by hand.

So the lexer will transform the command (char *cmd) to a linked list (t_list *list).
And the LL parser will transform the linked list (t_list *list) to an AST (binary tree t_btree *root) with a grammar

So, I know how to make the LL parser but I don't know how to tokenize my command.

For example: ps | grep ls >> file ; make && ./a.out

=> 'ps' '|' 'grep' 'ls' '>>' 'file' ';' ''make '&&' './a.out'

Thanks.

(I don't wanna use any generator)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

落花浅忆2024-11-03 21:03:29

(这解释了Spudd86暗示的想法)。

您需要实现一个有限状态机。有以下状态:

  • 一般状态
  • 文件名内部
  • && 标记
  • 内部 || 标记内部

对于每个状态和下一个输入字符,您必须决定下一个状态是什么,是否输出令牌。例如:

  • 当前状态:常规;字符:x =>下一个状态:内部文件名
  • 当前状态:内部文件名;字符:空格 =>下一个状态:常规;输出令牌
  • 当前状态:inside-file-name;字符:& =>下一个状态:inside-&&;输出token
  • 当前状态:inside-&&;字符:& =>下一个状态:常规;输出token
  • 当前状态:inside-&&;字符:x =>下一个状态:常规;语法错误
  • ...(令人作呕)

制定所有规则是一件非常无聊的工作(当您必须调试生成的代码时,乐趣就开始了),因此大多数人使用代码生成器来做到这一点。


编辑:一些代码(抱歉,如果语法混乱;我通常用 C++ 编程)

enum state {
    STATE_GENERAL,
    STATE_IN_FILENAME,
    ...
};

// Many characters are treated the same (e.g. 'x' and 'y') - so use categories
enum character_category
{
    CHAR_GENERAL, // can appear in filenames
    CHAR_WHITESPACE = ' ',
    CHAR_AMPERSAND = '&',
    CHAR_PIPE = '|',
    CHAR_EOF = EOF,
    ...
};

character_category translate(int c)
{
    switch (c) {
    case '&': return CHAR_AMPERSAND;
    case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
    ...
    default: return CHAR_GENERAL;
    }
}

void do_stuff()
{
    character_category cat;
    state current_state = STATE_GENERAL;
    state next_state;
    char token[100];
    char token_length = 0;
    do {
        int c = getchar();
        cat = translate(c);
        // The following implements a switch on 2 variables
        int selector = 1000 * current_state + cat;

        switch (selector)
        {
        case 1000 * STATE_GENERAL + CHAR_GENERAL:
            next_state = STATE_IN_FILENAME;
            token[token_length++] = c; // append a character to a filename token
            break;

        case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
            next_state = STATE_GENERAL; // do nothing
            break;

        case 1000 * STATE_GENERAL + CHAR_PIPE:
            next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
            break;

        // Much repetitive code already; define a macro for the case constants?
        // Have to cover all states and all character categories; good luck...

        case 1000 * STATE_IN_FILENAME + EOF:
        case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
            next_state = STATE_GENERAL;
            printf("Filename token: %s\n", token);
            break;

        default:
            printf("Bug\n"); // forgot one of the cases?
        }

        current_state = next_state;

    } while (cat != CHAR_EOF);
}

(This explains the idea hinted by Spudd86).

You need to implement a finite state machine. There are the following states:

  • General state
  • Inside a file name
  • Inside the && token
  • Inside the || token

For each state and next input character, you have to decide what is the next state, and whether to output a token. For example:

  • Current state: General; character: x => next state: inside-file-name
  • Current state: inside-file-name; character: space => next state: General; output the token
  • Current state: inside-file-name; character: & => next state: inside-&&; output the token
  • Current state: inside-&&; character: & => next state: General; output the token
  • Current state: inside-&&; character: x => next state: General; syntax error
  • ... (ad nauseum)

It's much boring work to work out all the rules (the fun starts when you must debug the resulting code), so most people use code generators to do that.


Edit: some code (sorry if the syntax is messed-up; i usually program in C++)

enum state {
    STATE_GENERAL,
    STATE_IN_FILENAME,
    ...
};

// Many characters are treated the same (e.g. 'x' and 'y') - so use categories
enum character_category
{
    CHAR_GENERAL, // can appear in filenames
    CHAR_WHITESPACE = ' ',
    CHAR_AMPERSAND = '&',
    CHAR_PIPE = '|',
    CHAR_EOF = EOF,
    ...
};

character_category translate(int c)
{
    switch (c) {
    case '&': return CHAR_AMPERSAND;
    case ' ': case '\t': case '\n': return CHAR_WHITESPACE;
    ...
    default: return CHAR_GENERAL;
    }
}

void do_stuff()
{
    character_category cat;
    state current_state = STATE_GENERAL;
    state next_state;
    char token[100];
    char token_length = 0;
    do {
        int c = getchar();
        cat = translate(c);
        // The following implements a switch on 2 variables
        int selector = 1000 * current_state + cat;

        switch (selector)
        {
        case 1000 * STATE_GENERAL + CHAR_GENERAL:
            next_state = STATE_IN_FILENAME;
            token[token_length++] = c; // append a character to a filename token
            break;

        case 1000 * STATE_GENERAL + CHAR_WHITESPACE:
            next_state = STATE_GENERAL; // do nothing
            break;

        case 1000 * STATE_GENERAL + CHAR_PIPE:
            next_state = STATE_IN_OR_TOKEN; // the first char in '||' or just '|'
            break;

        // Much repetitive code already; define a macro for the case constants?
        // Have to cover all states and all character categories; good luck...

        case 1000 * STATE_IN_FILENAME + EOF:
        case 1000 * STATE_IN_FILENAME + CHAR_WHITESPACE:
            next_state = STATE_GENERAL;
            printf("Filename token: %s\n", token);
            break;

        default:
            printf("Bug\n"); // forgot one of the cases?
        }

        current_state = next_state;

    } while (cat != CHAR_EOF);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文