创建列表词法分析器/解析器

发布于 2024-12-27 02:19:11 字数 651 浏览 4 评论 0原文

我需要创建一个词法分析器/解析器来处理可变长度和结构的输入数据。

假设我有一个保留关键字列表：

keyWordList = ['command1', 'command2', 'command3']

和一个用户输入字符串：

userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()

我将如何编写这个函数：

INPUT:

tokenize(userInputList, keyWordList)

OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']

我已经编写了一个可以识别关键字的分词器，但无法找到一种有效的方法来嵌入非组-将关键字放入更深层次的列表中。

RE 解决方案是受欢迎的，但我真的很想看到底层算法，因为我可能会将应用程序扩展到其他对象的列表，而不仅仅是字符串。

原文

I need to create a lexer/parser which deals with input data of variable length and structure.

Say I have a list of reserved keywords:

keyWordList = ['command1', 'command2', 'command3']

and a user input string:

userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()

How would I go about writing this function:

INPUT:

tokenize(userInputList, keyWordList)

OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']

I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.

RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉睡月亮 2025-01-03 02:19:11

像这样的事情：

def tokenize(lst, keywords):
    cur = []
    for x in lst:
        if x in keywords:
            yield cur
            yield x
            cur = []
        else:
            cur.append(x)

这会返回一个生成器，因此将您的调用包装在一个列表中。

Something like this:

def tokenize(lst, keywords):
    cur = []
    for x in lst:
        if x in keywords:
            yield cur
            yield x
            cur = []
        else:
            cur.append(x)

This returns a generator, so wrap your call in one to list.

回复收藏 0 原文

锦欢 2025-01-03 02:19:11

使用一些正则表达式很容易做到这一点：

>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]

现在您只需拆分每个元组的第一个元素即可。

对于不止一层的深度，正则表达式可能不是一个好的答案。

此页面上有一些不错的解析器供您选择： http://wiki.python.org/moin/ LanguageParsing

我认为 Lepl 是一个不错的。

That is easy to do with some regex:

>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]

Now you just have to split the first element of each tuple.

For more than one level of deepness, regex may not be a good answer.

There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing

I think Lepl is a good one.

回复收藏 0 原文

骷髅 2025-01-03 02:19:11

试试这个：

keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()

def tokenize(userInputList, keyWordList):
    keywords = set(keyWordList)
    tokens, acc = [], []
    for e in userInputList:
        if e in keywords:
            tokens.append(acc)
            tokens.append(e)
            acc = []
        else:
            acc.append(e)
    if acc:
        tokens.append(acc)
    return tokens

tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']

Try this:

keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()

def tokenize(userInputList, keyWordList):
    keywords = set(keyWordList)
    tokens, acc = [], []
    for e in userInputList:
        if e in keywords:
            tokens.append(acc)
            tokens.append(e)
            acc = []
        else:
            acc.append(e)
    if acc:
        tokens.append(acc)
    return tokens

tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']

回复收藏 0 原文