我开始编写代码来捕获部分句子“类型”,如果它们符合条件,则启动一个处理“类型”的特定 python 脚本。我正在“发现”:) findall 更适合我正在做的事情,因此:
m = re.compile(r'([0-9] days from now)')
m.match("i think maybe 7 days from now i hope")
print m.match("i think maybe 7 days from now i hope")
None
f= m.findall("i think maybe 7 days from now i hope")
print f[0]
7 days from now
这似乎给了我我正在寻找的句子部分。然后我可以将其提供给例如 - pyparsing 模块使用其示例日期时间转换脚本,该脚本从类似的 NL 语句返回日期时间(我知道还有其他模块,但它们在可以处理的输入语句中是严格的)。
然后,如果句子的其他部分与另一个“类型”匹配,例如,我可以在我的在线日记中或在托管网络应用程序中执行数据库插入。约会、截止日期等
我只是在这里修补,但慢慢地我正在构建一些有用的东西。这种结构/流程是否合乎逻辑,或者是否有更好的方法/途径:这就是我现在问自己的问题。任何反馈表示赞赏
I am starting to write code that would capture part of sentence "types" and if they match a criteria, start a specific python script that deals with the "type." I am "finding":) that findall kind of works better for what i am doing hence:
m = re.compile(r'([0-9] days from now)')
m.match("i think maybe 7 days from now i hope")
print m.match("i think maybe 7 days from now i hope")
None
f= m.findall("i think maybe 7 days from now i hope")
print f[0]
7 days from now
This seems to give me the part of sentence that i was looking for. I can then give this to for example - the pyparsing module using its example datetime conversion script that returns a datetime from a similar NL statement (I know there are other modules but they are rigid in input statements they can handle) .
Then I could do a db insert into my online diary for example or on a hosted web app if other parts of the sentence matched another "type" ie. appointments, deadlines etc.
I am just tinkering here but slowly i am building something useful. Is this structure /process logical or are there better methods/ ways: that is what i am asking myself now. Any feedback is appreciated
发布评论
评论(2)
m.match()
失败的原因是它期望匹配从字符串的开头开始。如果您希望字符串中有多个(非重叠)匹配项,则
findall()
是有意义的。否则,请使用search()
方法(该方法将返回找到的第一个匹配项)。文档中对此进行了详细介绍。
The reason why
m.match()
fails is that it expects the match to start at the beginning of the string.findall()
makes sense if you expect more than one (non-overlapping) match in your string. Otherwise, use thesearch()
method (which will return the first match it finds).This is all well covered in the docs.
根据我对搜索界面的了解,似乎您需要大量的正则表达式来捕获人们表达自己的各种方式。要了解具体有多少,请参阅关于“词汇问题”的这篇文章 。
因此,如果您只是做日期/时间的事情,并且您将非常具体的操作与它们联系起来,那么如果出错就很糟糕,那么 RE 似乎是一个不错的选择。另一方面,如果您只是想检测“日期”表达式与“电子邮件”表达式或“注释”表达式,那么也许值得一试 POS 标记 使用 NLTK 并在词性级别上匹配模式。
From my knowledge of search interfaces, it seems like you'd need an awful lot of regular expressions to capture the great variety of ways in which people express themselves. For a feeling for just how many, see this writeup on "the vocabulary problem".
So, if you're just doing date/time stuff, and you're tying very specific actions to them that it would suck to get wrong, then RE's seem like a good way to go. On the other hand, if you're just trying to detect a "date" expression vs. e.g. an "email" expression or a "note" expression, then perhaps it might be worth a try to POS-tag the sentences using NLTK and match patterns on the part of speech level.