删除 RSS 源中的广告

发布于 2024-12-02 02:52:08 字数 1056 浏览 0 评论 0原文

我正在开发一个本地 Intranet 站点,我想在该站点上显示来自其他站点的一些 rss 提要。目前它是基于 Concrete5 CMS 构建的,我使用 RSS 显示插件来显示提要。该插件使用 SimplePie 来解析提要。默认情况下,该插件显示整个 RSS 内容。我已经调整了插件(SimplePie)以仅显示带有链接、日期和每个帖子/条目中的第一张图像的标题。 我发现这个函数,我将 $item->get_content() 传递给它,以便获取第一张图像的源:

function getFirstImage($text) {
    $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
$pattern = "/<img[^>]+\>/i";
preg_match($pattern, $text, $matches);
    $text = $matches[0];
return $text;
}
function scrapeImage($text) {

    $pattern = '/src=[\'"]?([^\'" >]+)[\'" >]/'; 
    preg_match($pattern, $text, $link);
$link = $link[1];
$link = urldecode($link);
return $link;
}

它工作正常,问题是某些提要中包含广告,这些广告是有时放置在实际帖子内容之前,因此此函数返回广告的网址。显然这些RSS广告是针对使用RSS阅读器的人,但是将它们显示在网站上却非常烦人。

如果我尝试在 preg_match() 中定位 之外的确切标签,我觉得它只适用于我从中获取标签的特定提要。 (例如,如果我尝试使用 preg_match() 只查找

标签内的图像)

如何从实际帖子中获取第一张图像难道广告不需要更改我想要显示的每个提要的代码吗?

I have a local intranet site I am developing on which I want to display some rss feeds from other sites. Currently is is built on the Concrete5 CMS and I am using an RSS displayer plugin to display the feeds. The plugin uses SimplePie to parse the feed. By default, the plugin displays the entire RSS content. I've tweaked the plugin (SimplePie) to display only a title with link, date, and the first image in each post/entry.
I found this function that I pass $item->get_content() to in order to get the first image's source:

function getFirstImage($text) {
    $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
$pattern = "/<img[^>]+\>/i";
preg_match($pattern, $text, $matches);
    $text = $matches[0];
return $text;
}
function scrapeImage($text) {

    $pattern = '/src=[\'"]?([^\'" >]+)[\'" >]/'; 
    preg_match($pattern, $text, $link);
$link = $link[1];
$link = urldecode($link);
return $link;
}

It works fine, the problem is that some of the feeds have ads in them which are sometimes placed before the actual post content, therefore this function returns the url of an ad. Obviously these RSS ads are targeted at people who use rss readers, but for displaying them on a site, they are very annoying.

If I try to target exact tags besides <img> within preg_match() I feel it will only work for the specific feed that I've taken the tag from. (For example, if I try to use preg_match() to find only images inside <p> tags)

How can I get the first image from the actual post that isn't an ad without having to change the code for each feed I want to display?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

红衣飘飘貌似仙 2024-12-09 02:52:08

我不确定这是否适合您的情况,但通常广告图像来自与常规内容不同的域或子域。您可以尝试根据 URL 中的域或子域与 rss feed 的域或子域不同来过滤图像。

I'm not sure if this would work for your situation but usually ad images come from a different domain or sub-domain than the regular content. You could try to filter out images based on the domain or sub-domain in the URL being different then the domain or sub-domain of the rss feed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文