如何用美丽的汤刮擦YouTube视频描述

发布于 2025-02-01 04:15:46 字数 644 浏览 1 评论 0 原文

我正在尝试网络刮擦YouTube视频列表，并且想收集每个视频的YouTube描述。但是，我没有成功，不明白为什么。任何帮助都非常感谢。（有问题的YouTube视频： https：> https://www.youtube.com/watch？ V = 57TJVV_PCXG＆amp; t = 55s ）

element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))

递减的结果为 none

注意我知道存在YouTube API，但是您必须为API键付费，这不是我的利益

原文

I am trying to web scrape a list of YouTube videos and I want to collect each video's YouTube description. However, I am unsuccessful and do not understand why so. Any help is much appreciated. (Youtube video in question: https://www.youtube.com/watch?v=57Tjvv_pCXg&t=55s)

element_titles = driver.find_elements_by_id("video-title")
result = requests.get(element_titles[1].get_attribute("href"))
soup = BeautifulSoup(result.content)
description = str(soup.find("div", {"class": "style-scope yt-formatted-string"}))

The results of the decription is None

Note
I understand that there exists a Youtube API however you must pay for an API key and it is not in my interest to do so

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鹤仙姿 2025-02-08 04:15:46

要提取描述，您可以同时使用硒或美丽的套件。后者更快，

import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)

如果您运行 print（sip.prettify（））并查找视频说明的一部分，则说明知道这只是我的，这是代码。，您将看到完整的描述在一个大的JSON结构内部，

...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...

特别是在 shortdescription之间包含了描述”：“ and ”，“ isCrawlable ，因此我们可以使用Regex提取这两个字符串之间包含的子字符串。将两个字符串之间包含的每个字符（*）找到每个字符的正则命令是（？＆lt; = ShortDescription“：”）。 >

To extract the description you can use both selenium or beautifulsoup. The latter is faster, here is the code

import re
soup = BeautifulSoup(requests.get('https://www.youtube.com/watch?v=57Tjvv_pCXg').content)
pattern = re.compile('(?<=shortDescription":").*(?=","isCrawlable)')
description = pattern.findall(str(soup))[0].replace('\\n','\n')
print(description)

If you run print(soup.prettify()) and look for a part of the video description, say know this is just my, you will see that the complete description is inside a big json structure

...,"isOwnerViewing":false,"shortDescription":"Listen: https://quellechris360.bandcamp.com/album/deathfame\n\nQuelle Chris delivers what might be his most challengi...bla bla...ABSTRACT HIP HOP\n\n7/10\n\nY'all know this is just my opinion, right?","isCrawlable":true,"thumbnail":{...

In particular the description is included between shortDescription":" and ","isCrawlable, so we can use regex to extract the substring included between these two strings. The regex command to find every character (.*) included between the two strings is (?<=shortDescription":").*(?=","isCrawlable)