从python句子中删除非英语单词
我编写了一个代码,用于向 Google 发送查询并返回结果。我从这些结果中提取片段(摘要)以进行进一步处理。然而,有时这些片段中会出现我不想要的非英语单词。例如:
/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/
我只想要这句话中的“unstressed”这个词。 我怎样才能做到这一点? 谢谢
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
PyEnchant 对您来说可能是一个简单的选择。我不知道它的速度,但你可以执行以下操作:
找到教程 Enca 根据语言知识检测文本文件的编码。)
PyEnchant might be a simple option for you. I do not know about its speed, but you can do things like:
A tutorial is found here, it also has options to return suggestions which you can you again for another query or something. In addition you can check if your result is in latin-1 (is_utf8() excists, do not know if is_latin-1() does also, maybe use something like Enca which detects the encoding of text files, on the basis of knowledge of their language.)
您可以将收到的单词与英语单词词典进行比较,例如 BSD 系统上的 /usr/share/dict/words。
我猜想谷歌的结果在很大程度上在语法上是正确的,但如果不是,你可能需要研究词干以便与你的字典匹配。
You can compare the words you receive with a dictionary of english words, for example /usr/share/dict/words on a BSD system.
I would guess that googles results for the most part is grammatically correct, but if not, you might have to look into stemming in order to match against your dictionary.
您可以使用 PyWordNet。这是 WordNet 的 python 接口。只需将句子分成空格,然后检查每个单词是否在字典中即可。
You can use PyWordNet. That is a python interface for the WordNet. Just split your sentence on white spaces and check for each word is it in the dictionary.