从python句子中删除非英语单词

发布于 2024-09-29 11:31:39 字数 225 浏览 10 评论 0 原文

我编写了一个代码，用于向 Google 发送查询并返回结果。我从这些结果中提取片段（摘要）以进行进一步处理。然而，有时这些片段中会出现我不想要的非英语单词。例如：

/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/

我只想要这句话中的“unstressed”这个词。我怎样才能做到这一点？谢谢

原文

I have written a code which sends queries to Google and returns the results. I extract the snippets(summaries) from these results for further processing. However, sometime non-english words are in these snippets which I don't want them. for example:

/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/

I only want the "unstressed" word in this sentence.
How can I do that?
thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水染的天色ゝ 2024-10-06 11:31:40

PyEnchant 对您来说可能是一个简单的选择。我不知道它的速度，但你可以执行以下操作：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

找到教程 Enca 根据语言知识检测文本文件的编码。）

PyEnchant might be a simple option for you. I do not know about its speed, but you can do things like:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

A tutorial is found here, it also has options to return suggestions which you can you again for another query or something. In addition you can check if your result is in latin-1 (is_utf8() excists, do not know if is_latin-1() does also, maybe use something like Enca which detects the encoding of text files, on the basis of knowledge of their language.)