词法分析或一系列正则表达式,用于将非结构化文本解析为结构化形式

发布于 2024-09-07 17:47:35 字数 355 浏览 9 评论 0原文

我正在尝试编写一些代码,其功能类似于谷歌日历快速添加功能。您知道可以在其中输入以下任何一项: 1) 2010年9月24日,约翰生日 2) 约翰生日,2010 年 9 月 24 日 3) 2010年9月24日,无名氏生日 4) 2010年9月24日:约翰过生日 5) John 的生日是 2010 年 9 月 24 日

,它可以计算出我们想要在 24/9/2010 日期举办一个活动,并将其余材料作为活动文本。

我想做的是 python 。

我正在考虑一种设计,其中我编写可能匹配上面列出的所有情况并提取日期的正则表达式。但我确信有一种更聪明的方法来解决这个问题。因为我显然没有接受过词法分析或多种解析器样式的培训。我正在寻找解决这个问题的好方法。

I am trying to write some code that will function like google calendars quick add feature . You know the One where you can input any of the following :
1) 24th sep 2010 , Johns Birthday
2) John's Birthday , 24/9/10
3) 24 September 2010 , Birthday of John Doe
4) 24-9-2010 : John Does Birthday
5) John Does Birthday 24th of September 2010

And it can figure out that we want an event on a date 24/9/2010 have the rest of the material as the event text.

I want to do this is python .

I am thinking of a design where I write regular expressions that may match all of the cases listed above and extract the date. But I am sur there is a smarter way to approach this problem . Since I clearly am not trained in lexical analysis or the many types of parsers styles. I am looking for whats a good way to approach this problem.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

耀眼的星火 2024-09-14 17:47:35

注意:这里的Python代码不正确!它只是一个粗略的伪代码,展示了它的外观。

正则表达式擅长从固定格式的文本中查找和提取数据(例如 DD/MM/YYYY 日期)。

词法分析器/解析器对擅长以结构化但有些可变的格式处理数据。词法分析器将文本分割成标记。这些标记是给定类型(数字、字符串等)的信息单元。解析器获取这一系列标记并根据标记的顺序执行某些操作。

查看数据,您有一个关系(人、“生日”、日期)的不同组合的基本(主语、动词、宾语)结构:

我会将 29/9/10 和 24-9-2010 作为一个单独处理使用正则表达式的令牌,将其作为日期类型返回。您可以对其他日期执行相同的操作,使用映射将 September 和 sep 转换为 9。

然后您可以将其他所有内容作为字符串返回(由空格分隔)。

然后你有:

  1. date ',' string 'birthday'
  2. string 'birthday' ',' date
  3. date 'birthday' 'of' string string
  4. date ':' string string 'birthday'
  5. string string 'birthday' date

注意: 'birthday', ',', ':' 和 'of' 这里是关键字,所以:

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

除了 3 之外,所有都使用人称所有格形式('s 如果最后一个字符是辅音,s< /code> 如果它是元音)。这可能很棘手,因为“Alexis”可能是“Alexi”的复数形式,但由于您限制了复数形式的位置,因此很容易检测到:

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

现在,名称可以是first-name 或 first-name last-name (是的,我知道日本会交换这些,但从处理的角度来看,上述问题不需要区分名字和姓氏)。下面将处理这两种形式:

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

最后,您可以使用如下方式处理形式 1-5:

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date

NOTE: The python code here is not correct! It is just a rough pseudo-code of how it might look.

Regular Expressions are good at finding and extracting data from text in a fixed format (e.g. a DD/MM/YYYY date).

A lexer/parser pair is good at processing data in a structured, but somewhat variable format. Lexers split text into tokens. These tokens are units of information of a given type (number, string, etc.). Parsers take this series of tokens and does something depending on the order of the tokens.

Looking at the data, you have a basic (subject, verb, object) structure in different combinations for the relation (person, 'birthday', date):

I would handle 29/9/10 and 24-9-2010 as a single token using a regex, returning it as a date type. You could probably do the same for the other dates, with a map to convert September and sep to 9.

You could then return the everything else as strings (separated by whitespace).

You then have:

  1. date ',' string 'birthday'
  2. string 'birthday' ',' date
  3. date 'birthday' 'of' string string
  4. date ':' string string 'birthday'
  5. string string 'birthday' date

NOTE: 'birthday', ',', ':' and 'of' here are keywords, so:

class Lexer:
    DATE = 1
    STRING = 2
    COMMA = 3
    COLON = 4
    BIRTHDAY = 5
    OF = 6

    keywords = { 'birthday': BIRTHDAY, 'of': OF, ',': COMMA, ':', COLON }

    def next_token():
        if have_saved_token:
            have_saved_token = False
            return saved_type, saved_value
        if date_re.match(): return DATE, date
        str = read_word()
        if str in keywords.keys(): return keywords[str], str
        return STRING, str

    def keep(type, value):
        have_saved_token = True
        saved_type = type
        saved_value = value

All except 3 use the possessive form of the person ('s if the last character is a consonant, s if it is a vowel). This can be tricky, as 'Alexis' could be the plural form of 'Alexi', but since you are restricting where plural forms can be, it is easy to detect:

def parseNameInPluralForm():
    name = parseName()
    if name.ends_with("'s"): name.remove_from_end("'s")
    elif name.ends_with("s"): name.remove_from_end("s")
    return name

Now, name can either be first-name or first-name last-name (yes, I know Japan swaps these around, but from a processing perspective, the above problem does not need to differentiate first and last names). The following will handle these two forms:

def parseName():
    type, firstName = Lexer.next_token()
    if type != Lexer.STRING: raise ParseError()
    type, lastName = Lexer.next_token()
    if type == Lexer.STRING: # first-name last-name
        return firstName + ' ' + lastName
    else:
        Lexer.keep(type, lastName)
        return firstName

Finally, you can process forms 1-5 using something like this:

def parseBirthday():
    type, data = Lexer.next_token()
    if type == Lexer.DATE: # 1, 3 & 4
        date = data
        type, data = Lexer.next_token()
        if type == Lexer.COLON or type == Lexer.COMMA: # 1 & 4
            person = parsePersonInPluralForm()
            type, data = Lexer.next_token()
            if type != Lexer.BIRTHDAY: raise ParseError()
        elif type == Lexer.BIRTHDAY: # 3
            type, data = Lexer.next_token()
            if type != Lexer.OF: raise ParseError()
            person = parsePerson()
    elif type == Lexer.STRING: # 2 & 5
        Lexer.keep(type, data)
        person = parsePersonInPluralForm()
        type, data = Lexer.next_token()
        if type != Lexer.BIRTHDAY: raise ParseError()
        type, data = Lexer.next_token()
        if type == Lexer.COMMA: # 2
            type, data = Lexer.next_token()
        if type != Lexer.DATE: raise ParseError()
        date = data
    else:
        raise ParseError()
    return person, date
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文