抓取youtube用户信息

发布于 2024-11-13 05:25:53 字数 2149 浏览 5 评论 0原文

我正在尝试抓取 Youtube 以检索有关一组用户(大约 200 人)的信息。我有兴趣寻找用户之间的关系:

  • 联系人
  • 订阅者
  • 订阅
  • 他们评论的视频

我已设法通过以下来源获取联系信息:

import gdata.youtube
import gdata.youtube.service
from gdata.service import RequestError
from pub_author import KEY, NAME_REGEX
def get_details(name):
    yt_service = gdata.youtube.service.YouTubeService()
    yt_service.developer_key = KEY
    contact_feed = yt_service.GetYouTubeContactFeed(username=name)
    contacts = [ e.title.text for e in contact_feed.entry ]
    return contacts

我似乎无法获取我需要的其他信息。 参考指南说我可以从 < a href="http://gdata.youtube.com/feeds/api/users/username/subscriptions?v=2" rel="nofollow">http://gdata.youtube.com/feeds/api/users/ username/subscriptions?v=2 (对于某些任意用户)。但是,如果我尝试获取其他用户的订阅,则会收到 403 错误并显示以下消息:

用户必须登录才能访问这些订阅。

如果我使用 gdata API:

sub_feed = yt_service.GetYouTubeSubscriptionFeed(username=name)
sub = [ e.title.text for e in contact_feed.entry ]

那么我会得到同样的错误。

我如何在不登录的情况下获得这些订阅?应该是可以的,因为您无需登录 Youtube 网站即可访问此信息。

此外,似乎没有针对特定用户的订阅者的提要。这些信息可以通过 API 获取吗?

编辑

因此,这似乎无法通过 API 完成。我必须以快速而肮脏的方式做到这一点:

for f in `cat users.txt`; do wget "www.youtube.com/profile?user=$f&view=subscriptions" --output-document subscriptions/$f.html; done

然后使用此脚本从下载的 HTML 文件中获取用户名:

"""Extract usernames from a Youtube profile using regex"""
import re
def main():
    import sys
    lines = open(sys.argv[1]).read().split('\n')
    #
    # The html files has two <a href="..."> tags for each user: once for an 
    # image thumbnail, and once for a text link.
    # 
    users = set()
    for l in lines:
        match = re.search('<a href="/user/(?P<name>[^"]+)" onmousedown', l)
        if match:
            users.add(match.group('name'))
    users = list(users)
    users.sort()
    print users
if __name__ == '__main__':
    main()

I'm trying to crawl Youtube to retrieve information about a group of users (approx. 200 people). I'm interested in looking for relationships between the users:

  • contacts
  • subscribers
  • subscriptions
  • what videos they commented on
  • etc

I've managed to get contact information with the following source:

import gdata.youtube
import gdata.youtube.service
from gdata.service import RequestError
from pub_author import KEY, NAME_REGEX
def get_details(name):
    yt_service = gdata.youtube.service.YouTubeService()
    yt_service.developer_key = KEY
    contact_feed = yt_service.GetYouTubeContactFeed(username=name)
    contacts = [ e.title.text for e in contact_feed.entry ]
    return contacts

I can't seem the get the other bits of information I need. The reference guide says that I can grab the XML feed from http://gdata.youtube.com/feeds/api/users/username/subscriptions?v=2 (for some arbitrary user). However, if I try to get other users' subscriptions, I get the a 403 error with the following message:

User must be logged in to access these subscriptions.

If I use the gdata API:

sub_feed = yt_service.GetYouTubeSubscriptionFeed(username=name)
sub = [ e.title.text for e in contact_feed.entry ]

then I get the same error.

How can I get these subscriptions without logging in? It should be possible, as you can access this information without logging in to the Youtube web-site.

Also, there seems to be no feed for the subscribers of particular user. Is this information available through the API?

EDIT

So, it appears this can't be done through the API. I had to do this the quick and dirty way:

for f in `cat users.txt`; do wget "www.youtube.com/profile?user=$f&view=subscriptions" --output-document subscriptions/$f.html; done

Then use this script to get out the usernames from the downloaded HTML files:

"""Extract usernames from a Youtube profile using regex"""
import re
def main():
    import sys
    lines = open(sys.argv[1]).read().split('\n')
    #
    # The html files has two <a href="..."> tags for each user: once for an 
    # image thumbnail, and once for a text link.
    # 
    users = set()
    for l in lines:
        match = re.search('<a href="/user/(?P<name>[^"]+)" onmousedown', l)
        if match:
            users.add(match.group('name'))
    users = list(users)
    users.sort()
    print users
if __name__ == '__main__':
    main()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

风轻花落早 2024-11-20 05:25:53

为了在用户未登录的情况下访问用户的订阅源,用户必须选中其 帐户共享设置

目前,没有直接的方法可以通过 gdata API 获取频道的订阅者。事实上,有一个突出的功能请求已经开放了 3 年多!请参阅检索用户的订阅者列表?

In order to access a user's subscriptions feed without the user being logged in, the user must check the "Subscribe to a channel" checkbox under his Account Sharing settings.

Currently, there is no direct way to get a channel's subscribers through the gdata API. In fact, there has been an outstanding feature request for it that has remained open for over 3 years! See Retrieving a list of a user's subscribers?.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文