实现分词器的 Pythonic 方式

发布于 2024-07-16 11:25:11 字数 1705 浏览 6 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

那一片橙海, 2024-07-23 11:25:12

当我开始使用 Python 进行新操作时,我通常会首先查看要使用的一些模块或库。 有 90% 以上的可能性已经有可用的东西。

对于分词器和解析器来说确实如此。 您看过 PyParsing 吗?

When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.

For tokenizers and parsers this is certainly so. Have you looked at PyParsing ?

吹泡泡o 2024-07-23 11:25:12

我已经为类似 C 的编程语言实现了一个分词器。 我所做的是将标记的创建分为两层:

  • 表面扫描仪:这个层实际上读取文本并使用正则表达式将其仅分成最原始的标记(运算符、标识符) ,数字,...); 这个产生元组(tokenname、scannedstring、startpos、endpos)。
  • 标记器:它消耗第一层的元组,将它们转换为标记对象(我认为命名元组也可以)。 它的目的是检测令牌流中的一些远程依赖关系,特别是字符串(带有它们的开始和结束引号)和注释(带有它们的开始和结束词法; - 是的,我想保留注释!)并将它们强制转换为单个代币。 然后将生成的令牌对象流返回到消费解析器。

两者都是发电机。 这种方法的好处是:

  • 原始文本的读取仅以最原始的方式完成,使用简单的正则表达式 - 快速而干净。
  • 第二层已经实现为原始解析器,用于检测字符串文字和注释 - 解析器技术的重用。
  • 您不必通过复杂的检测来使表面扫描仪紧张。
  • 但真正的解析器获取要解析的语言的语义级别的标记(同样是字符串、注释)。

我对这种分层方法感到非常满意。

I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:

  • a surface scanner: This one actually reads the text and uses regular expression to split it up into only the most primitve tokens (operators, identifiers, numbers,...); this one yields tuples (tokenname, scannedstring, startpos, endpos).
  • a tokenizer: This consumes the tuples from the first layer, turning them into token objects (named tuples would do as well, I think). Its purpose is to detect some long-range dependencies in the token stream, particularly strings (with their opening and closing quotes) and comments (with their opening an closing lexems; - yes, I wanted to retain comments!) and coerce them into single tokens. The resulting stream of token objects is then returned to a consuming parser.

Both are generators. The benefits of this approach were:

  • Reading of the raw text is done only in the most primitive way, with simple regexps - fast and clean.
  • The second layer is already implemented as a primitive parser, to detect string literals and comments - re-use of parser technology.
  • You don't have to strain the surface scanner with complex detections.
  • But the real parser gets tokens on the semantic level of the language to be parsed (again strings, comments).

I feel quite happy with this layered approach.

怀念你的温柔 2024-07-23 11:25:12

我会转向优秀的Python 中的文本处理作者:大卫·默茨

I'd turn to the excellent Text Processing in Python by David Mertz

三人与歌 2024-07-23 11:25:12

“除了简单地返回元组列表之外,是否有更好的替代方案”

我必须实现一个标记生成器,但它需要比元组列表更复杂的方法,因此我为每个标记实现了一个类。 然后,您可以返回类实例的列表,或者如果您想节省资源,您可以返回实现迭代器接口的内容,并在解析过程中生成下一个标记。

"Is there a better alternative to just simply returning a list of tuples"

I had to implement a tokenizer, but it required a more complex approach than a list of tuples, therefore I implemented a class for each token. You can then return a list of class instances, or if you want to save resources, you can return something implementing the iterator interface and generate the next token while you progress in the parsing.

来世叙缘 2024-07-23 11:25:11

re 模块中有一个未记录的类,名为 re.Scanner。 使用标记器非常简单:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

将导致

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

我使用 re.Scanner 只需几百行即可编写一个非常漂亮的配置/结构化数据格式解析器。

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

橘亓 2024-07-23 11:25:11

Python 采用“我们都是同意的成年人”的信息隐藏方法。 可以像使用常量一样使用变量,并且相信代码的用户不会做一些愚蠢的事情。

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

凤舞天涯 2024-07-23 11:25:11

在许多情况下,exp。 当解析长输入流时,您可能会发现将标记生成器实现为生成器函数更有用。 通过这种方式,您可以轻松地迭代所有标记,而不需要大量内存来首先构建标记列表。

对于生成器,请参阅原始提案或其他在线文档

In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.

For generator see the original proposal or other online docs

傾城如夢未必闌珊 2024-07-23 11:25:11

感谢您的帮助,我开始将这些想法整合在一起,并提出了以下建议。 这个实现有什么严重的错误吗(特别是我担心将文件对象传递给标记器):

class Tokenizer(object):

  def __init__(self,file):
     self.file = file

  def __get_next_character(self):
      return self.file.read(1)

  def __peek_next_character(self):
      character = self.file.read(1)
      self.file.seek(self.file.tell()-1,0)
      return character

  def __read_number(self):
      value = ""
      while self.__peek_next_character().isdigit():
          value += self.__get_next_character()
      return value

  def next_token(self):
      character = self.__peek_next_character()

      if character.isdigit():
          return self.__read_number()

Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):

class Tokenizer(object):

  def __init__(self,file):
     self.file = file

  def __get_next_character(self):
      return self.file.read(1)

  def __peek_next_character(self):
      character = self.file.read(1)
      self.file.seek(self.file.tell()-1,0)
      return character

  def __read_number(self):
      value = ""
      while self.__peek_next_character().isdigit():
          value += self.__get_next_character()
      return value

  def next_token(self):
      character = self.__peek_next_character()

      if character.isdigit():
          return self.__read_number()
望喜 2024-07-23 11:25:11

“除了简单地返回元组列表之外,还有更好的选择吗?”

没有。 效果非常好。

"Is there a better alternative to just simply returning a list of tuples?"

Nope. It works really well.

吹梦到西洲 2024-07-23 11:25:11

“除了简单地返回元组列表之外,还有更好的选择吗?”

这就是“tokenize”模块用于解析 Python 源代码的方法。 返回一个简单的元组列表可以很好地工作。

"Is there a better alternative to just simply returning a list of tuples?"

That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.

ぽ尐不点ル 2024-07-23 11:25:11

我最近也构建了一个分词器,并解决了您的一些问题。

令牌类型在模块级别被声明为“常量”,即名称全部大写的变量。 例如,

_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009

等等。 我在名称前面使用了下划线来指出这些字段对于模块来说是“私有的”,但我真的不知道这是典型的还是可取的,甚至不知道有多少Pythonic。 (另外,我可能会放弃数字而改用字符串,因为在调试过程中它们更具可读性。)

令牌以命名元组的形式返回。

from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly

我使用了命名元组,因为分词器的客户端代码(例如解析器)在使用名称(例如 token.value)而不是索引(例如 token[0])时看起来更清晰一些。

最后,我注意到有时,特别是编写测试时,我更喜欢将字符串而不是文件对象传递给标记生成器。 我将其称为“阅读器”,并有一个特定的方法来打开它并让分词器通过相同的接口访问它。

def open_reader(self, source):
    """
    Produces a file object from source.
    The source can be either a file object already, or a string.
    """
    if hasattr(source, 'read'):
        return source
    else:
        from io import StringIO
        return StringIO(source)

I have recently built a tokenizer, too, and passed through some of your issues.

Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,

_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009

and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)

Tokens are returned as named tuples.

from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly

I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).

Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.

def open_reader(self, source):
    """
    Produces a file object from source.
    The source can be either a file object already, or a string.
    """
    if hasattr(source, 'read'):
        return source
    else:
        from io import StringIO
        return StringIO(source)
影子是时光的心 2024-07-23 11:25:11

这是一个迟到的答案,现在官方文档中有一些内容: 使用 re 标准库编写分词器。 这是 Python 3 文档中的内容,不在 Py 2.7 文档中。 但它仍然适用于较旧的 Python。

这包括简短的代码、简单的设置以及编写生成器,正如这里提出的几个答案一样。

如果文档不是Pythonic,我不知道什么是:-)

This being a late answer, there is now something in the official documentation: Writing a tokenizer with the re standard library. This is content in the Python 3 documentation that isn't in the Py 2.7 docs. But it is still applicable to older Pythons.

This includes both short code, easy setup, and writing a generator as several answers here have proposed.

If the docs are not Pythonic, I don't know what is :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文