nltk 自定义分词器和标记器

发布于 2024-09-28 02:14:32 字数 645 浏览 6 评论 0原文

这是我的要求。我想以一种允许我实现以下目标的方式标记和标记一个段落。

应该识别段落中的日期和时间并将它们标记为日期和时间
应该识别段落中的已知短语并将它们标记为自定义
其余内容应该标记化应该由默认的nltk的word_tokenize和pos_tag函数标记化？

例如，

"They all like to go there on 5th November 2010, but I am not interested."

如果自定义短语是“我不感兴趣”，则应按如下方式标记和标记以下句子。

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

任何建议都会有用。

原文

Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.

Should identify date and time in the paragraph and Tag them as DATE and TIME
Should identify known phrases in the paragraph and Tag them as CUSTOM
And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?

For example, following sentense

"They all like to go there on 5th November 2010, but I am not interested."

should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".

[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'), 
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','), 
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]

Any suggestions would be useful.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡看悲欢离合 2024-10-05 02:14:32

正确的答案是编译一个以您想要的方式标记的大型数据集，然后在其上训练机器学习的分块器。如果这太耗时，简单的方法是运行 POS 标记器并使用正则表达式对其输出进行后处理。获得最长的匹配是这里的困难部分：

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?
 Todo：展开 DATE re，插入代码来搜索 CUSTOM 短语，通过匹配 POS 标签以及标记并决定 5th 本身是否应算作日期。 （可能不是，因此过滤掉长度为 1、仅包含序数的日期。）
)

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo：展开 DATE re，插入代码来搜索 CUSTOM 短语，通过匹配 POS 标签以及标记并决定 5th 本身是否应算作日期。（可能不是，因此过滤掉长度为 1、仅包含序数的日期。）

The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:

s = "They all like to go there on 5th November 2010, but I am not interested."

DATE = re.compile(r'^[1-9][0-9]?(th|st|rd)? (January|...)( [12][0-9][0-9][0-9])?
Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)
)

def custom_tagger(sentence):
    tagged = pos_tag(word_tokenize(sentence))
    phrase = []
    date_found = False

    i = 0
    while i < len(tagged):
        (w,t) = tagged[i]
        phrase.append(w)
        in_date = DATE.match(' '.join(phrase))
        date_found |= bool(in_date)
        if date_found and not in_date:          # end of date found
            yield (' '.join(phrase[:-1]), 'DATE')
            phrase = []
            date_found = False
        elif date_found and i == len(tagged)-1:    # end of date found
            yield (' '.join(phrase), 'DATE')
            return
        else:
            i += 1
            if not in_date:
                yield (w,t)
                phrase = []

Todo: expand the DATE re, insert code to search for CUSTOM phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether 5th on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)

回复收藏 0 原文