词法分析或一系列正则表达式，用于将非结构化文本解析为结构化形式

发布于 2024-09-07 17:47:35 字数 355 浏览 9 评论 0原文

我正在尝试编写一些代码，其功能类似于谷歌日历快速添加功能。您知道可以在其中输入以下任何一项： 1) 2010年9月24日，约翰生日 2) 约翰生日，2010 年 9 月 24 日 3) 2010年9月24日，无名氏生日 4) 2010年9月24日：约翰过生日 5) John 的生日是 2010 年 9 月 24 日

，它可以计算出我们想要在 24/9/2010 日期举办一个活动，并将其余材料作为活动文本。

我想做的是 python 。

我正在考虑一种设计，其中我编写可能匹配上面列出的所有情况并提取日期的正则表达式。但我确信有一种更聪明的方法来解决这个问题。因为我显然没有接受过词法分析或多种解析器样式的培训。我正在寻找解决这个问题的好方法。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

耀眼的星火 2024-09-14 17:47:35

注意：这里的Python代码不正确！它只是一个粗略的伪代码，展示了它的外观。

正则表达式擅长从固定格式的文本中查找和提取数据（例如 DD/MM/YYYY 日期）。

词法分析器/解析器对擅长以结构化但有些可变的格式处理数据。词法分析器将文本分割成标记。这些标记是给定类型（数字、字符串等）的信息单元。解析器获取这一系列标记并根据标记的顺序执行某些操作。

查看数据，您有一个关系（人、“生日”、日期）的不同组合的基本（主语、动词、宾语）结构：

我会将 29/9/10 和 24-9-2010 作为一个单独处理使用正则表达式的令牌，将其作为日期类型返回。您可以对其他日期执行相同的操作，使用映射将 September 和 sep 转换为 9。

然后您可以将其他所有内容作为字符串返回（由空格分隔）。

然后你有：

date ',' string 'birthday'
string 'birthday' ',' date
date 'birthday' 'of' string string
date ':' string string 'birthday'
string string 'birthday' date

注意： 'birthday', ',', ':' 和 'of' 这里是关键字，所以：

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

除了 3 之外，所有都使用人称所有格形式（'s 如果最后一个字符是辅音，s< /code> 如果它是元音）。这可能很棘手，因为“Alexis”可能是“Alexi”的复数形式，但由于您限制了复数形式的位置，因此很容易检测到：

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

现在，名称可以是first-name 或 first-name last-name （是的，我知道日本会交换这些，但从处理的角度来看，上述问题不需要区分名字和姓氏）。下面将处理这两种形式：

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

最后，您可以使用如下方式处理形式 1-5：

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date

NOTE: The python code here is not correct! It is just a rough pseudo-code of how it might look.

Regular Expressions are good at finding and extracting data from text in a fixed format (e.g. a DD/MM/YYYY date).

A lexer/parser pair is good at processing data in a structured, but somewhat variable format. Lexers split text into tokens. These tokens are units of information of a given type (number, string, etc.). Parsers take this series of tokens and does something depending on the order of the tokens.

Looking at the data, you have a basic (subject, verb, object) structure in different combinations for the relation (person, 'birthday', date):

I would handle 29/9/10 and 24-9-2010 as a single token using a regex, returning it as a date type. You could probably do the same for the other dates, with a map to convert September and sep to 9.

You could then return the everything else as strings (separated by whitespace).

You then have:

date ',' string 'birthday'
string 'birthday' ',' date
date 'birthday' 'of' string string
date ':' string string 'birthday'
string string 'birthday' date

NOTE: 'birthday', ',', ':' and 'of' here are keywords, so:

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

All except 3 use the possessive form of the person ('s if the last character is a consonant, s if it is a vowel). This can be tricky, as 'Alexis' could be the plural form of 'Alexi', but since you are restricting where plural forms can be, it is easy to detect:

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

Now, name can either be first-name or first-name last-name (yes, I know Japan swaps these around, but from a processing perspective, the above problem does not need to differentiate first and last names). The following will handle these two forms:

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

Finally, you can process forms 1-5 using something like this:

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date

回复收藏 0 原文

~没有更多了~