创建列表词法分析器/解析器

发布于 2024-12-27 02:19:11 字数 651 浏览 4 评论 0原文

我需要创建一个词法分析器/解析器来处理可变长度和结构的输入数据。

假设我有一个保留关键字列表:

keyWordList = ['command1', 'command2', 'command3']

和一个用户输入字符串:

userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()

我将如何编写这个函数:

INPUT:

tokenize(userInputList, keyWordList)

OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']

我已经编写了一个可以识别关键字的分词器,但无法找到一种有效的方法来嵌入非组-将关键字放入更深层次的列表中。

RE 解决方案是受欢迎的,但我真的很想看到底层算法,因为我可能会将应用程序扩展到其他对象的列表,而不仅仅是字符串。

I need to create a lexer/parser which deals with input data of variable length and structure.

Say I have a list of reserved keywords:

keyWordList = ['command1', 'command2', 'command3']

and a user input string:

userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()

How would I go about writing this function:

INPUT:

tokenize(userInputList, keyWordList)

OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']

I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.

RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

沉睡月亮 2025-01-03 02:19:11

像这样的事情:

def tokenize(lst, keywords):
    cur = []
    for x in lst:
        if x in keywords:
            yield cur
            yield x
            cur = []
        else:
            cur.append(x)

这会返回一个生成器,因此将您的调用包装在一个列表中。

Something like this:

def tokenize(lst, keywords):
    cur = []
    for x in lst:
        if x in keywords:
            yield cur
            yield x
            cur = []
        else:
            cur.append(x)

This returns a generator, so wrap your call in one to list.

锦欢 2025-01-03 02:19:11

使用一些正则表达式很容易做到这一点:

>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]

现在您只需拆分每个元组的第一个元素即可。

对于不止一层的深度,正则表达式可能不是一个好的答案。

此页面上有一些不错的解析器供您选择: http://wiki.python.org/moin/ LanguageParsing

我认为 Lepl 是一个不错的。

That is easy to do with some regex:

>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]

Now you just have to split the first element of each tuple.

For more than one level of deepness, regex may not be a good answer.

There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing

I think Lepl is a good one.

骷髅 2025-01-03 02:19:11

试试这个:

keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()

def tokenize(userInputList, keyWordList):
    keywords = set(keyWordList)
    tokens, acc = [], []
    for e in userInputList:
        if e in keywords:
            tokens.append(acc)
            tokens.append(e)
            acc = []
        else:
            acc.append(e)
    if acc:
        tokens.append(acc)
    return tokens

tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']

Try this:

keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()

def tokenize(userInputList, keyWordList):
    keywords = set(keyWordList)
    tokens, acc = [], []
    for e in userInputList:
        if e in keywords:
            tokens.append(acc)
            tokens.append(e)
            acc = []
        else:
            acc.append(e)
    if acc:
        tokens.append(acc)
    return tokens

tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
对风讲故事 2025-01-03 02:19:11

或者看看 PyParsing。相当不错的小 lex 解析器组合

Or have a look at PyParsing. Quite a nice little lex parser combination

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文