抓取youtube用户信息
我正在尝试抓取 Youtube 以检索有关一组用户(大约 200 人)的信息。我有兴趣寻找用户之间的关系:
- 联系人
- 订阅者
- 订阅
- 他们评论的视频
- 等
我已设法通过以下来源获取联系信息:
import gdata.youtube
import gdata.youtube.service
from gdata.service import RequestError
from pub_author import KEY, NAME_REGEX
def get_details(name):
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = KEY
contact_feed = yt_service.GetYouTubeContactFeed(username=name)
contacts = [ e.title.text for e in contact_feed.entry ]
return contacts
我似乎无法获取我需要的其他信息。 参考指南说我可以从 < a href="http://gdata.youtube.com/feeds/api/users/username/subscriptions?v=2" rel="nofollow">http://gdata.youtube.com/feeds/api/users/ username/subscriptions?v=2 (对于某些任意用户)。但是,如果我尝试获取其他用户的订阅,则会收到 403 错误并显示以下消息:
用户必须登录才能访问这些订阅。
如果我使用 gdata API:
sub_feed = yt_service.GetYouTubeSubscriptionFeed(username=name)
sub = [ e.title.text for e in contact_feed.entry ]
那么我会得到同样的错误。
我如何在不登录的情况下获得这些订阅?应该是可以的,因为您无需登录 Youtube 网站即可访问此信息。
此外,似乎没有针对特定用户的订阅者的提要。这些信息可以通过 API 获取吗?
编辑
因此,这似乎无法通过 API 完成。我必须以快速而肮脏的方式做到这一点:
for f in `cat users.txt`; do wget "www.youtube.com/profile?user=$f&view=subscriptions" --output-document subscriptions/$f.html; done
然后使用此脚本从下载的 HTML 文件中获取用户名:
"""Extract usernames from a Youtube profile using regex"""
import re
def main():
import sys
lines = open(sys.argv[1]).read().split('\n')
#
# The html files has two <a href="..."> tags for each user: once for an
# image thumbnail, and once for a text link.
#
users = set()
for l in lines:
match = re.search('<a href="/user/(?P<name>[^"]+)" onmousedown', l)
if match:
users.add(match.group('name'))
users = list(users)
users.sort()
print users
if __name__ == '__main__':
main()
I'm trying to crawl Youtube to retrieve information about a group of users (approx. 200 people). I'm interested in looking for relationships between the users:
- contacts
- subscribers
- subscriptions
- what videos they commented on
- etc
I've managed to get contact information with the following source:
import gdata.youtube
import gdata.youtube.service
from gdata.service import RequestError
from pub_author import KEY, NAME_REGEX
def get_details(name):
yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = KEY
contact_feed = yt_service.GetYouTubeContactFeed(username=name)
contacts = [ e.title.text for e in contact_feed.entry ]
return contacts
I can't seem the get the other bits of information I need. The reference guide says that I can grab the XML feed from http://gdata.youtube.com/feeds/api/users/username/subscriptions?v=2 (for some arbitrary user). However, if I try to get other users' subscriptions, I get the a 403 error with the following message:
User must be logged in to access these subscriptions.
If I use the gdata API:
sub_feed = yt_service.GetYouTubeSubscriptionFeed(username=name)
sub = [ e.title.text for e in contact_feed.entry ]
then I get the same error.
How can I get these subscriptions without logging in? It should be possible, as you can access this information without logging in to the Youtube web-site.
Also, there seems to be no feed for the subscribers of particular user. Is this information available through the API?
EDIT
So, it appears this can't be done through the API. I had to do this the quick and dirty way:
for f in `cat users.txt`; do wget "www.youtube.com/profile?user=$f&view=subscriptions" --output-document subscriptions/$f.html; done
Then use this script to get out the usernames from the downloaded HTML files:
"""Extract usernames from a Youtube profile using regex"""
import re
def main():
import sys
lines = open(sys.argv[1]).read().split('\n')
#
# The html files has two <a href="..."> tags for each user: once for an
# image thumbnail, and once for a text link.
#
users = set()
for l in lines:
match = re.search('<a href="/user/(?P<name>[^"]+)" onmousedown', l)
if match:
users.add(match.group('name'))
users = list(users)
users.sort()
print users
if __name__ == '__main__':
main()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了在用户未登录的情况下访问用户的订阅源,用户必须选中其 帐户共享设置。
目前,没有直接的方法可以通过
gdata
API 获取频道的订阅者。事实上,有一个突出的功能请求已经开放了 3 年多!请参阅检索用户的订阅者列表?。In order to access a user's subscriptions feed without the user being logged in, the user must check the "Subscribe to a channel" checkbox under his Account Sharing settings.
Currently, there is no direct way to get a channel's subscribers through the
gdata
API. In fact, there has been an outstanding feature request for it that has remained open for over 3 years! See Retrieving a list of a user's subscribers?.