如何从 RSS 提要中过滤亚洲语言?

发布于 2024-07-29 19:34:57 字数 321 浏览 10 评论 0原文

我喜欢跟踪 delicious.com/popular RSS 源。 然而,最近项目中的亚洲页面越来越多。 由于我不懂任何亚洲语言,因此我想以某种方式从提要中过滤它们并节省自己一些时间。

我一直在尝试使用 Yahoo Pipes 制作一些东西,但一直无法得到它在职的。

任何人有任何想法如何使这项工作?

I like to keep track of delicious.com/popular RSS feed. However, lately there are more and more Asian pages in the items. Since I do not understand any Asian languages, I would like to somehow filter them from the feed and save myself some time.

I've been trying to cook up something using Yahoo pipes, but have not been able to get it working.

Anyone any ideas how to make this work?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

说不完的你爱 2024-08-05 19:34:57

我在 http://pipes.yahoo.com/pipes 上有一些运气/pipe.info?_id=yJh1aRp_3hGaPi23tPvyrQ

管道的源包含所有信息,但关键位是使用正则表达式 ^[A-Za-z 0-9 \.,\ 运行过滤器?'""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]~ ]+$`。

这将过滤掉标题中使用非标准 ASCII 的任何内容的提要。 不幸的是,这意味着它还会过滤“简历”之类的单词,但您应该很容易调整正则表达式以包含您所知道的语言中的常见非英语字符。

I've had some luck at http://pipes.yahoo.com/pipes/pipe.info?_id=yJh1aRp_3hGaPi23tPvyrQ

The source of the pipe has all the info, but the key bit is running a filter with the regex ^[A-Za-z 0-9 \.,\?'""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]~]+$`.

This will filter out any feeds that use anything but standard ASCII in the title. Unfortunately, this means it will also filter words like "résumé," but it should be pretty easy for you to adjust the regex to include common non-english characters from the languages you know.

如果没有你 2024-08-05 19:34:57

您可能想跳过其中超过 X% 的字符不是来自分配给您可以理解的语言脚本的代码块的标题。 例如,如果您无法阅读希腊语、俄语、阿拉伯语、希伯来语、亚美尼亚语、中文、日语、韩语、印度语等,请拒绝超过(例如)10% 的字符不在 U+0000 到U+0233。 这给你留下了拉丁字母。 留出 10% 等边距的想法是为了标点符号; 技术文章也可能使用基本字母表之外的符号。

You probably want to skip titles where more than X% of the characters are NOT from the code blocks assigned to the scripts of those languages that you can understand. For example, if you can't read Greek, Russian, Arabic, Hebrew, Armenian, Chinese, Japanese, Korean, Indic languages etc, reject titles where more than (say) 10% of characters are not in the range U+0000 to U+0233. This leaves you with the Latin alphabet. The idea of leaving a margin like 10% is for punctuation marks; also technical articles may use symbols that are not in the base alphabet.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文