如何用美丽的汤刮擦YouTube视频描述

发布于 2025-02-01 04:15:46 字数 644 浏览 1 评论 0 原文

我正在尝试网络刮擦YouTube视频列表,并且想收集每个视频的YouTube描述。但是,我没有成功,不明白为什么。任何帮助都非常感谢。 (有问题的YouTube视频: https:> https://www.youtube.com/watch? V = 57TJVV_PCXG& t = 55s

element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))

递减的结果为 none

注意 我知道存在YouTube API,但是您必须为API键付费,这不是我的利益

I am trying to web scrape a list of YouTube videos and I want to collect each video's YouTube description. However, I am unsuccessful and do not understand why so. Any help is much appreciated. (Youtube video in question: https://www.youtube.com/watch?v=57Tjvv_pCXg&t=55s)

element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))

The results of the decription is None

Note
I understand that there exists a Youtube API however you must pay for an API key and it is not in my interest to do so

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鹤仙姿 2025-02-08 04:15:46

要提取描述,您可以同时使用硒或美丽的套件。后者更快,

import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)

如果您运行 print(sip.prettify())并查找视频说明的一部分,则说明知道这只是我的, 这是代码。 ,您将看到完整的描述在一个大的JSON结构内部,

...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...

特别是在 shortdescription之间包含了描述”:“ and ”,“ isCrawlable ,因此我们可以使用Regex提取这两个字符串之间包含的子字符串。将两个字符串之间包含的每个字符(*)找到每个字符的正则命令是(?&lt; = ShortDescription“:”)。 >

To extract the description you can use both selenium or beautifulsoup. The latter is faster, here is the code

import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)

If you run print(soup.prettify()) and look for a part of the video description, say know this is just my, you will see that the complete description is inside a big json structure

...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...

In particular the description is included between shortDescription":" and ","isCrawlable, so we can use regex to extract the substring included between these two strings. The regex command to find every character (.*) included between the two strings is (?<=shortDescription":").*(?=","isCrawlable)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文