返回使用Beautifoulsoup的特定标题关键字
我正在尝试创建一个网络刮板,仅当标题中的RSS feed(XML格式)中有某个关键字时,才能返回文章。但是,每当我运行代码时,它也会返回空白,即使文章的标题本身正确运行(例如,标题都会正确打印,但是当我要求它返回时,当标题中有“说”一词时,即使“说”一词实际上是标题:
示例
xml_text = requests.get('https://nypost.com/feed/').text
soup = BeautifulSoup(xml_text, 'xml')
ny_rss_search = soup.find_all("Mark")
ny_rss_title3 = soup.find_all('title')
ny_rss_url3 = soup.find_all('link')
ny_rss_summary3 = soup.find_all('description')
ny_rss_url_compact3 = ny_rss_url3[2].text.strip()
if 'Guide' in ny_rss_title3:
webbrowser.open(ny_rss_url_compact3, new=2)
print(f'NY Post Article Title: {ny_rss_title3[1].text.strip()}\n')
print(f"NY Post Article URL: {ny_rss_url3[2].text.strip()}\n")
print(f'NY Post Article Summary: {ny_rss_summary3[1].text.strip()}\n')
winsound.PlaySound("notify.wav", winsound.SND_ALIAS)
html文本:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:georss="http://www.georss.org/georss"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:media="http://search.yahoo.com/mrss/"
>
<channel>
<title>New York Post</title>
<atom:link href="https://nypost.com/feed/" rel="self" type="application/rss+xml" />
<link>https://nypost.com</link>
<description>Your source for breaking news, news about New York, sports, business, entertainment, opinion, real estate, culture, fashion, and more.</description>
<lastBuildDate>Tue, 05 Jul 2022 14:06:44 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<generator>https://wordpress.org/?v=5.9.3</generator>
<item>
<title>Blue Jays coach Mark Budzinski’s daughter Julia died in boating accident</title>
<comments>https://nypost.com/2022/07/05/mark-budzinskis-daughter-julia-17-died-in-boating-accident/#respond</comments>
<pubDate>Tue, 05 Jul 2022 10:01:06 -0400</pubDate>
<link>https://nypost.com/2022/07/05/mark-budzinskis-daughter-julia-17-died-in-boating-accident/</link>
<dc:creator>Associated Press</dc:creator>
<guid isPermaLink="false">https://nypost.com/?post_type=article&p=22918233</guid>
<description><![CDATA[Pearson said no foul play is suspected and alcohol was not a factor. “It was a terrible accident,” she said.]]></description>
<content:encoded><![CDATA[Pearson said no foul play is suspected and alcohol was not a factor. “It was a terrible accident,” she said.]]></content:encoded>
<enclosure url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Julia-Budzinski.jpg?quality=90&strip=all" type="image/jpeg" />
<slash:comments>0</slash:comments>
<media:content url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Julia-Budzinski.jpg?w=1024" medium="image">
<media:title type="html">The Blue Jays held a moment of silence for first base coach Mark Budzinski's daughter Julia on Sunday.</media:title>
</media:content>
<media:content url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Mark-Budzinski.jpg?w=1024" medium="image">
<media:title type="html">Mark Budzinski</media:title>
</media:content>
I'm trying to create a web scraper that returns articles only if there is a certain keyword in the title from an rss feed (xml format). However, whenever I run the code it returns blank, even if the title of the article by itself runs correctly (for example the title will print properly, but when I ask it to return only if there is the word "said" in the title, nothing returns even if the word "said" is in fact in the title.
Code:
xml_text = requests.get('https://nypost.com/feed/').text
soup = BeautifulSoup(xml_text, 'xml')
ny_rss_search = soup.find_all("Mark")
ny_rss_title3 = soup.find_all('title')
ny_rss_url3 = soup.find_all('link')
ny_rss_summary3 = soup.find_all('description')
ny_rss_url_compact3 = ny_rss_url3[2].text.strip()
if 'Guide' in ny_rss_title3:
webbrowser.open(ny_rss_url_compact3, new=2)
print(f'NY Post Article Title: {ny_rss_title3[1].text.strip()}\n')
print(f"NY Post Article URL: {ny_rss_url3[2].text.strip()}\n")
print(f'NY Post Article Summary: {ny_rss_summary3[1].text.strip()}\n')
winsound.PlaySound("notify.wav", winsound.SND_ALIAS)
Sample HTML text:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
xmlns:georss="http://www.georss.org/georss"
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:media="http://search.yahoo.com/mrss/"
>
<channel>
<title>New York Post</title>
<atom:link href="https://nypost.com/feed/" rel="self" type="application/rss+xml" />
<link>https://nypost.com</link>
<description>Your source for breaking news, news about New York, sports, business, entertainment, opinion, real estate, culture, fashion, and more.</description>
<lastBuildDate>Tue, 05 Jul 2022 14:06:44 +0000</lastBuildDate>
<language>en-US</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<generator>https://wordpress.org/?v=5.9.3</generator>
<item>
<title>Blue Jays coach Mark Budzinski’s daughter Julia died in boating accident</title>
<comments>https://nypost.com/2022/07/05/mark-budzinskis-daughter-julia-17-died-in-boating-accident/#respond</comments>
<pubDate>Tue, 05 Jul 2022 10:01:06 -0400</pubDate>
<link>https://nypost.com/2022/07/05/mark-budzinskis-daughter-julia-17-died-in-boating-accident/</link>
<dc:creator>Associated Press</dc:creator>
<guid isPermaLink="false">https://nypost.com/?post_type=article&p=22918233</guid>
<description><![CDATA[Pearson said no foul play is suspected and alcohol was not a factor. “It was a terrible accident,” she said.]]></description>
<content:encoded><![CDATA[Pearson said no foul play is suspected and alcohol was not a factor. “It was a terrible accident,” she said.]]></content:encoded>
<enclosure url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Julia-Budzinski.jpg?quality=90&strip=all" type="image/jpeg" />
<slash:comments>0</slash:comments>
<media:content url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Julia-Budzinski.jpg?w=1024" medium="image">
<media:title type="html">The Blue Jays held a moment of silence for first base coach Mark Budzinski's daughter Julia on Sunday.</media:title>
</media:content>
<media:content url="https://nypost.com/wp-content/uploads/sites/2/2022/07/Mark-Budzinski.jpg?w=1024" medium="image">
<media:title type="html">Mark Budzinski</media:title>
</media:content>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您必须在feed的
项目
上迭代,并检查标题文本
包含您的术语:示例
You have to iterate over the
items
of the feed and check iftitle text
contains your term:Example