nltk 自定义分词器和标记器
这是我的要求。我想以一种允许我实现以下目标的方式标记和标记一个段落。
- 应该识别段落中的日期和时间并将它们标记为日期和时间
- 应该识别段落中的已知短语并将它们标记为自定义
- 其余内容应该标记化应该由默认的nltk的word_tokenize和pos_tag函数标记化?
例如,
"They all like to go there on 5th November 2010, but I am not interested."
如果自定义短语是“我不感兴趣”,则应按如下方式标记和标记以下句子。
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
任何建议都会有用。
Here is my requirement. I want to tokenize and tag a paragraph in such a way that it allows me to achieve following stuffs.
- Should identify date and time in the paragraph and Tag them as DATE and TIME
- Should identify known phrases in the paragraph and Tag them as CUSTOM
- And rest content should be tokenized should be tokenized by the default nltk's word_tokenize and pos_tag functions?
For example, following sentense
"They all like to go there on 5th November 2010, but I am not interested."
should be tagged and tokenized as follows in case of that custom phrase is "I am not interested".
[('They', 'PRP'), ('all', 'VBP'), ('like', 'IN'), ('to', 'TO'), ('go', 'VB'),
('there', 'RB'), ('on', 'IN'), ('5th November 2010', 'DATE'), (',', ','),
('but', 'CC'), ('I am not interested', 'CUSTOM'), ('.', '.')]
Any suggestions would be useful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正确的答案是编译一个以您想要的方式标记的大型数据集,然后在其上训练机器学习的分块器。如果这太耗时,简单的方法是运行 POS 标记器并使用正则表达式对其输出进行后处理。获得最长的匹配是这里的困难部分:
Todo:展开
DATE
re,插入代码来搜索CUSTOM
短语,通过匹配 POS 标签以及标记并决定5th
本身是否应算作日期。 (可能不是,因此过滤掉长度为 1、仅包含序数的日期。)The proper answer is to compile a large dataset tagged in the way you want, then train a machine learned chunker on it. If that's too time-consuming, the easy way is to run the POS tagger and post-process its output using regular expressions. Getting the longest match is the hard part here:
Todo: expand the
DATE
re, insert code to search forCUSTOM
phrases, make this more sophisticated by matching POS tags as well as tokens and decide whether5th
on its own should count as a date. (Probably not, so filter out dates of length one that only contain an ordinal number.)您可能应该使用 nltk.RegexpParser 进行分块来实现您的目标。
参考:
http://nltk.googlecode.com/svn/trunk /doc/book/ch07.html#code-chunker1
You should probably do chunking with the nltk.RegexpParser to achieve your objective.
Reference:
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html#code-chunker1