来自 Firefox bookmark.html 源代码的 YouTube video_id [几乎存在]

发布于 2024-12-15 08:26:08 字数 9590 浏览 3 评论 0原文

bookmarks.html 看起来像这样:

<DT><A HREF="http://www.youtube.com/watch?v=Gg81zi0pheg" ADD_DATE="1320876124" LAST_MODIFIED="1320878745" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=Gg81zi0pheg</A>
<DT><A HREF="http://www.youtube.com/watch?v=pP9VjGmmhfo" ADD_DATE="1320876156" LAST_MODIFIED="1320878756" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=pP9VjGmmhfo</A>
<DT><A HREF="http://www.youtube.com/watch?v=yTA1u6D1fyE" ADD_DATE="1320876163" LAST_MODIFIED="1320878762" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=yTA1u6D1fyE</A>
<DT><A HREF="http://www.youtube.com/watch?v=4v8HvQf4fgE" ADD_DATE="1320876186" LAST_MODIFIED="1320878767" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=4v8HvQf4fgE</A>
<DT><A HREF="http://www.youtube.com/watch?v=e9zG20wQQ1U" ADD_DATE="1320876195" LAST_MODIFIED="1320878773" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=e9zG20wQQ1U</A>
<DT><A HREF="http://www.youtube.com/watch?v=khL4s2bvn-8" ADD_DATE="1320876203" LAST_MODIFIED="1320878782" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=khL4s2bvn-8</A>
<DT><A HREF="http://www.youtube.com/watch?v=XTndQ7bYV0A" ADD_DATE="1320876271" LAST_MODIFIED="1320876271">Paramore - Walmart Soundcheck 6-For a pessimist(HQ)</A>
<DT><A HREF="http://www.youtube.com/watch?v=xTT2MqgWRRc" ADD_DATE="1320876284" LAST_MODIFIED="1320876284">Paramore - Walmart Soundcheck 5-Pressure(HQ)</A>
<DT><A HREF="http://www.youtube.com/watch?v=J2ZYQngwSUw" ADD_DATE="1320876291" LAST_MODIFIED="1320876291">Paramore - Wal-Mart Soundcheck Interview</A>
<DT><A HREF="http://www.youtube.com/watch?v=9RZwvg7unrU" ADD_DATE="1320878207" LAST_MODIFIED="1320878207">Paramore - 08 - Interview [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=vz3qOYWwm10" ADD_DATE="1320878295" LAST_MODIFIED="1320878295">Paramore - 04 - That&#39;s What You Get [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=yarv52QX_Yw" ADD_DATE="1320878301" LAST_MODIFIED="1320878301">Paramore - 05 - Pressure [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=LRREY1H3GCI" ADD_DATE="1320878317" LAST_MODIFIED="1320878317">Paramore - Walmart Promo</A>

它是 Firefox 的标准书签导出文件。

我将它输入到 bookmarks.py 中,如下所示:

#!/usr/bin/env python

import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup 

url_list = sys.argv[1]
urls = [tag['href'] for tag in 
BeautifulSoup(open(url_list)).findAll('a')] 

print urls

这会返回一个更干净的 url 列表:

[u'http://www.youtube.com/watch?v=Gg81zi0pheg', u'http://www.youtube.com/watch?v=pP9VjGmmhfo', u'http://www.youtube.com/watch?v=yTA1u6D1fyE', u'http://www.youtube.com/watch?v=4v8HvQf4fgE', u'http://www.youtube.com/watch?v=e9zG20wQQ1U', u'http://www.youtube.com/watch?v=khL4s2bvn-8', u'http://www.youtube.com/watch?v=XTndQ7bYV0A', u'http://www.youtube.com/watch?v=xTT2MqgWRRc', u'http://www.youtube.com/watch?v=J2ZYQngwSUw', u'http://www.youtube.com/watch?v=9RZwvg7unrU', u'http://www.youtube.com/watch?v=vz3qOYWwm10', u'http://www.youtube.com/watch?v=yarv52QX_Yw', u'http://www.youtube.com/watch?v=LRREY1H3GCI']

我的下一步是将每个 youtube url 放入 video_info.py 中

#!/usr/bin/python

import urlparse
import sys
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2

youtube_url = sys.argv[1]
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]

print youtube_id

yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'

entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)

print 'Video title: %s' % entry.media.title.text
print 'Video view count: %s' % entry.statistics.view_count

,当此 url “http://www.youtube.com /watch?v=aXrgwC1rsw4" 输出看起来像这样:

aXrgwC1rsw4
Video title: OneRepublic   Good Life  Live Walmart Soundcheck
Video view count: 202

How do I feed the list of urls from bookmarks.py into video_info.py?

*额外的输出点csv 格式和 在将数据传递到 video_info.py 之前,额外检查 bookmarks.html 中的重复项*

感谢您的所有帮助。由于 Stackoverflow,我已经走到了这一步。

David

#

So 结合起来,我现在有了:

#!/usr/bin/env python

import urlparse
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup 

yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'

url_list = sys.argv[1]
urls = [tag['href'] for tag in 
BeautifulSoup(open(url_list)).findAll('a')] 

print urls

youtube_url = urls
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]

#list(set(my_list))

entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)

myyoutubes = []
myyoutubes.append(", ".join([youtube_id, entry.media.title.text,entry.statistics.view_count]))
print "\n".join(myyoutubes)

如何将 url 列表传递给 youtube_url 变量? 它们需要进一步清理,并且一次传递一个,我相信

我现在已经明白了:

#!/usr/bin/env python

import urlparse
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup 

yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'

url_list = sys.argv[1]
urls = [tag['href'] for tag in 
    BeautifulSoup(open(url_list)).findAll('a')] 

for url in urls:
    youtube_url = url

url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]

#list(set(my_list))

entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)

myyoutubes = []
myyoutubes.append(", ".join([youtube_id, entry.media.title.text,entry.statistics.view_count]))
print "\n".join(myyoutubes)

我可以将 bookmarks.html 传递给combined.py,但它只返回第一行。

如何循环遍历 youtube_url 的每一行?

bookmarks.html looks like this:

<DT><A HREF="http://www.youtube.com/watch?v=Gg81zi0pheg" ADD_DATE="1320876124" LAST_MODIFIED="1320878745" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=Gg81zi0pheg</A>
<DT><A HREF="http://www.youtube.com/watch?v=pP9VjGmmhfo" ADD_DATE="1320876156" LAST_MODIFIED="1320878756" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=pP9VjGmmhfo</A>
<DT><A HREF="http://www.youtube.com/watch?v=yTA1u6D1fyE" ADD_DATE="1320876163" LAST_MODIFIED="1320878762" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=yTA1u6D1fyE</A>
<DT><A HREF="http://www.youtube.com/watch?v=4v8HvQf4fgE" ADD_DATE="1320876186" LAST_MODIFIED="1320878767" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=4v8HvQf4fgE</A>
<DT><A HREF="http://www.youtube.com/watch?v=e9zG20wQQ1U" ADD_DATE="1320876195" LAST_MODIFIED="1320878773" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=e9zG20wQQ1U</A>
<DT><A HREF="http://www.youtube.com/watch?v=khL4s2bvn-8" ADD_DATE="1320876203" LAST_MODIFIED="1320878782" ICON_URI="http://s.ytimg.com/yt/favicon-vflZlzSbU.ico" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABEElEQVQ4jaWSPU7DQBCF3wGo7DZp4qtEQqJJlSu4jLgBNS0dRDR0QARSumyPRJogkIJiWiToYhrSEPJR7Hptx/kDRhrNm93nb3ZXFsD4psPRwR4AzbjHxd0ru4a8kEgvbf1NePfzbQdJfro52UcS8fHQDyjWCuDz7QpJTOYLZk5nH0zmi/8Dzg/DEqgCmL1fW/PXNwADf4V7AMbuis24txrw1xBJAlHgMizoLdkI4CVBREEZWTKGK9bKqa3G3QDO2G7Z2ljqAYyxPmPgI4XHpw2A7ES+d/unZzlwM2BNnwEKb1QFGJNPabdg9GB1v2993W71BNOamNZEWpfXy5nW8/3U9UQBkqTyy677F6rrkvQDptjzJ/PSbekAAAAASUVORK5CYII=">http://www.youtube.com/watch?v=khL4s2bvn-8</A>
<DT><A HREF="http://www.youtube.com/watch?v=XTndQ7bYV0A" ADD_DATE="1320876271" LAST_MODIFIED="1320876271">Paramore - Walmart Soundcheck 6-For a pessimist(HQ)</A>
<DT><A HREF="http://www.youtube.com/watch?v=xTT2MqgWRRc" ADD_DATE="1320876284" LAST_MODIFIED="1320876284">Paramore - Walmart Soundcheck 5-Pressure(HQ)</A>
<DT><A HREF="http://www.youtube.com/watch?v=J2ZYQngwSUw" ADD_DATE="1320876291" LAST_MODIFIED="1320876291">Paramore - Wal-Mart Soundcheck Interview</A>
<DT><A HREF="http://www.youtube.com/watch?v=9RZwvg7unrU" ADD_DATE="1320878207" LAST_MODIFIED="1320878207">Paramore - 08 - Interview [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=vz3qOYWwm10" ADD_DATE="1320878295" LAST_MODIFIED="1320878295">Paramore - 04 - That's What You Get [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=yarv52QX_Yw" ADD_DATE="1320878301" LAST_MODIFIED="1320878301">Paramore - 05 - Pressure [ Wal-Mart Soundcheck ]</A>
<DT><A HREF="http://www.youtube.com/watch?v=LRREY1H3GCI" ADD_DATE="1320878317" LAST_MODIFIED="1320878317">Paramore - Walmart Promo</A>

It's a standard bookmarks export file from Firefox.

I feed it into bookmarks.py which looks like this:

#!/usr/bin/env python

import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup 

url_list = sys.argv[1]
urls = [tag['href'] for tag in 
BeautifulSoup(open(url_list)).findAll('a')] 

print urls

This returns a much more clean list of urls:

[u'http://www.youtube.com/watch?v=Gg81zi0pheg', u'http://www.youtube.com/watch?v=pP9VjGmmhfo', u'http://www.youtube.com/watch?v=yTA1u6D1fyE', u'http://www.youtube.com/watch?v=4v8HvQf4fgE', u'http://www.youtube.com/watch?v=e9zG20wQQ1U', u'http://www.youtube.com/watch?v=khL4s2bvn-8', u'http://www.youtube.com/watch?v=XTndQ7bYV0A', u'http://www.youtube.com/watch?v=xTT2MqgWRRc', u'http://www.youtube.com/watch?v=J2ZYQngwSUw', u'http://www.youtube.com/watch?v=9RZwvg7unrU', u'http://www.youtube.com/watch?v=vz3qOYWwm10', u'http://www.youtube.com/watch?v=yarv52QX_Yw', u'http://www.youtube.com/watch?v=LRREY1H3GCI']

my next step is to get each of the youtube urls into video_info.py

#!/usr/bin/python

import urlparse
import sys
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2

youtube_url = sys.argv[1]
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]

print youtube_id

yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'

entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)

print 'Video title: %s' % entry.media.title.text
print 'Video view count: %s' % entry.statistics.view_count

when this url "http://www.youtube.com/watch?v=aXrgwC1rsw4" the output looks like this:

aXrgwC1rsw4
Video title: OneRepublic   Good Life  Live Walmart Soundcheck
Video view count: 202

How do I feed the list of urls from bookmarks.py into video_info.py?

*extra points for output to csv format and
extra extra points of checking of duplicates in bookmarks.html before passing data to video_info.py*

Thanks for all your help guys. Because of Stackoverflow I've gotten this far.

David

#

So combined I now have:

#!/usr/bin/env python

import urlparse
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup 

yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'

url_list = sys.argv[1]
urls = [tag['href'] for tag in 
BeautifulSoup(open(url_list)).findAll('a')] 

print urls

youtube_url = urls
url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]

#list(set(my_list))

entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)

myyoutubes = []
myyoutubes.append(", ".join([youtube_id, entry.media.title.text,entry.statistics.view_count]))
print "\n".join(myyoutubes)

How do I pass the list of urls to the youtube_url variable?
they need to be cleaned up further and passed one at a time I believe

I've got it down to this now:

#!/usr/bin/env python

import urlparse
import gdata.youtube
import gdata.youtube.service
import re
import urlparse
import urllib2
import sys
import BeautifulSoup as bs
from BeautifulSoup import BeautifulSoup 

yt_service = gdata.youtube.service.YouTubeService()
yt_service.developer_key = 'AI39si4yOmI0GEhSTXH0nkiVDf6tQjCkqoys5BBYLKEr-PQxWJ0IlwnUJAcdxpocGLBBCapdYeMLIsB7KVC_OA8gYK0VKV726g'

url_list = sys.argv[1]
urls = [tag['href'] for tag in 
    BeautifulSoup(open(url_list)).findAll('a')] 

for url in urls:
    youtube_url = url

url_data = urlparse.urlparse(youtube_url)
query = urlparse.parse_qs(url_data.query)
youtube_id = query["v"][0]

#list(set(my_list))

entry = yt_service.GetYouTubeVideoEntry(video_id=youtube_id)

myyoutubes = []
myyoutubes.append(", ".join([youtube_id, entry.media.title.text,entry.statistics.view_count]))
print "\n".join(myyoutubes)

I can pass bookmarks.html to combined.py but it only returns the first line.

How to I loop through each line of youtube_url?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

吻泪 2024-12-22 08:26:08

您应该向 BeautifulSoup 提供一个字符串:将

# parse bookmarks.html
with open(sys.argv[1]) as bookmark_file:
    soup = BeautifulSoup(bookmark_file.read())

# extract youtube video urls
video_url_regex = re.compile('http://www.youtube.com/watch')
urls = [link['href'] for link in soup('a', href=video_url_regex)]

非常快速的 url 解析与更长的统计数据下载分开:

# extract video ids from the urls
ids = [] # you could use `set()` and `ids.add()` to avoid duplicates
for video_url in urls:
    url = urlparse.urlparse(video_url)
    video_id = urlparse.parse_qs(url.query).get('v')
    if not video_id: continue # no video_id in the url
    ids.append(video_id[0])

您不需要对只读请求进行身份验证:

# get some statistics for the videos
yt_service = YouTubeService()
yt_service.ssl = True #NOTE: it works for readonly requests
yt_service.debug = True # show requests

将一些统计信息保存到提供的 csv 文件中命令行。如果某些视频导致错误,请不要停止:

writer = csv.writer(open(sys.argv[2], 'wb')) # save to cvs file
for video_id in ids:
    try:
        entry = yt_service.GetYouTubeVideoEntry(video_id=video_id)
    except Exception, e:
        print >>sys.stderr, "Failed to retrieve entry video_id=%s: %s" %(
            video_id, e)
    else:
        title = entry.media.title.text
        print "Title:", title
        view_count = entry.statistics.view_count
        print "View count:", view_count
        writer.writerow((video_id, title, view_count)) # write it

这是完整脚本,按播放即可观看效果书面。

输出

$ python download-video-stats.py neudorfer.html out.csv
send: u'GET https://gdata.youtube.com/feeds/api/videos/Gg81zi0pheg HTTP/1.1\r\nAcc
ept-Encoding: identity\r\nHost: gdata.youtube.com\r\nContent-Type: application/ato
m+xml\r\nUser-Agent: None GData-Python/2.0.15\r\n\r\n'                           
reply: 'HTTP/1.1 200 OK\r\n'
header: X-GData-User-Country: RU
header: Content-Type: application/atom+xml; charset=UTF-8
header: Expires: Thu, 10 Nov 2011 19:31:23 GMT
header: Date: Thu, 10 Nov 2011 19:31:23 GMT
header: Cache-Control: private, max-age=300, no-transform
header: Vary: *
header: GData-Version: 1.0
header: Last-Modified: Wed, 02 Nov 2011 08:58:11 GMT
header: Transfer-Encoding: chunked
header: X-Content-Type-Options: nosniff
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1; mode=block
header: Server: GSE
Title: Paramore - Let The Flames Begin [Wal-Mart Soundcheck]
View count: 27807

out.csv

Gg81zi0pheg,Paramore - Let The Flames Begin [Wal-Mart Soundcheck],27807
pP9VjGmmhfo,Paramore: Wal-Mart Soundcheck,1363078
yTA1u6D1fyE,Paramore-Walmart Soundcheck 7-CrushCrushCrush(HQ),843
4v8HvQf4fgE,Paramore-Walmart Soundcheck 4-That's What You Get(HQ),1429
e9zG20wQQ1U,Paramore-Walmart Soundcheck 8-Interview(HQ),1306
khL4s2bvn-8,Paramore-Walmart Soundcheck 3-Emergency(HQ),796
XTndQ7bYV0A,Paramore-Walmart Soundcheck 6-For a pessimist(HQ),599
xTT2MqgWRRc,Paramore-Walmart Soundcheck 5-Pressure(HQ),963
J2ZYQngwSUw,Paramore - Wal-Mart Soundcheck Interview,10261
9RZwvg7unrU,Paramore - 08 - Interview [Wal-Mart Soundcheck],1674
vz3qOYWwm10,Paramore - 04 - That's What You Get [Wal-Mart Soundcheck],1268
yarv52QX_Yw,Paramore - 05 - Pressure [Wal-Mart Soundcheck],1296
LRREY1H3GCI,Paramore - Walmart Promo,523

You should provide a string to BeautifulSoup:

# parse bookmarks.html
with open(sys.argv[1]) as bookmark_file:
    soup = BeautifulSoup(bookmark_file.read())

# extract youtube video urls
video_url_regex = re.compile('http://www.youtube.com/watch')
urls = [link['href'] for link in soup('a', href=video_url_regex)]

Separate a very fast url parsing from a much longer downloading of the stats:

# extract video ids from the urls
ids = [] # you could use `set()` and `ids.add()` to avoid duplicates
for video_url in urls:
    url = urlparse.urlparse(video_url)
    video_id = urlparse.parse_qs(url.query).get('v')
    if not video_id: continue # no video_id in the url
    ids.append(video_id[0])

You don't need to authenticate for readonly requests:

# get some statistics for the videos
yt_service = YouTubeService()
yt_service.ssl = True #NOTE: it works for readonly requests
yt_service.debug = True # show requests

Save some statistics to a csv file provided on the command-line. Don't stop if some video causes an error:

writer = csv.writer(open(sys.argv[2], 'wb')) # save to cvs file
for video_id in ids:
    try:
        entry = yt_service.GetYouTubeVideoEntry(video_id=video_id)
    except Exception, e:
        print >>sys.stderr, "Failed to retrieve entry video_id=%s: %s" %(
            video_id, e)
    else:
        title = entry.media.title.text
        print "Title:", title
        view_count = entry.statistics.view_count
        print "View count:", view_count
        writer.writerow((video_id, title, view_count)) # write it

Here's a full script, press playback to watch how it was written.

Output

$ python download-video-stats.py neudorfer.html out.csv
send: u'GET https://gdata.youtube.com/feeds/api/videos/Gg81zi0pheg HTTP/1.1\r\nAcc
ept-Encoding: identity\r\nHost: gdata.youtube.com\r\nContent-Type: application/ato
m+xml\r\nUser-Agent: None GData-Python/2.0.15\r\n\r\n'                           
reply: 'HTTP/1.1 200 OK\r\n'
header: X-GData-User-Country: RU
header: Content-Type: application/atom+xml; charset=UTF-8
header: Expires: Thu, 10 Nov 2011 19:31:23 GMT
header: Date: Thu, 10 Nov 2011 19:31:23 GMT
header: Cache-Control: private, max-age=300, no-transform
header: Vary: *
header: GData-Version: 1.0
header: Last-Modified: Wed, 02 Nov 2011 08:58:11 GMT
header: Transfer-Encoding: chunked
header: X-Content-Type-Options: nosniff
header: X-Frame-Options: SAMEORIGIN
header: X-XSS-Protection: 1; mode=block
header: Server: GSE
Title: Paramore - Let The Flames Begin [Wal-Mart Soundcheck]
View count: 27807

out.csv

Gg81zi0pheg,Paramore - Let The Flames Begin [Wal-Mart Soundcheck],27807
pP9VjGmmhfo,Paramore: Wal-Mart Soundcheck,1363078
yTA1u6D1fyE,Paramore-Walmart Soundcheck 7-CrushCrushCrush(HQ),843
4v8HvQf4fgE,Paramore-Walmart Soundcheck 4-That's What You Get(HQ),1429
e9zG20wQQ1U,Paramore-Walmart Soundcheck 8-Interview(HQ),1306
khL4s2bvn-8,Paramore-Walmart Soundcheck 3-Emergency(HQ),796
XTndQ7bYV0A,Paramore-Walmart Soundcheck 6-For a pessimist(HQ),599
xTT2MqgWRRc,Paramore-Walmart Soundcheck 5-Pressure(HQ),963
J2ZYQngwSUw,Paramore - Wal-Mart Soundcheck Interview,10261
9RZwvg7unrU,Paramore - 08 - Interview [Wal-Mart Soundcheck],1674
vz3qOYWwm10,Paramore - 04 - That's What You Get [Wal-Mart Soundcheck],1268
yarv52QX_Yw,Paramore - 05 - Pressure [Wal-Mart Soundcheck],1296
LRREY1H3GCI,Paramore - Walmart Promo,523
漫漫岁月 2024-12-22 08:26:08

为什么不直接合并两个文件呢?另外,您可能希望将其分解为方法,以便以后更容易理解。

另外,对于 csv,您需要积累数据。所以,也许有一个列表,每次都附加一个 csv 行的 youtube 信息:

myyoutubes = []
...
myyoutubes.append(", ".join([youtubeid, entry.media.title.text,entry.statistics.view_count]))
...
"\n".join(myyoutubes)

对于重复项,我通常这样做:
列表(设置(my_list))
集合只有唯一的元素。

Why not just combine the two files? Also, you may want to break it up into methods to make it easier to understand later.

Also, for csv you're going to want to accumulate your data. So, maybe have a list and each time through append a csv line of youtube info:

myyoutubes = []
...
myyoutubes.append(", ".join([youtubeid, entry.media.title.text,entry.statistics.view_count]))
...
"\n".join(myyoutubes)

For duplicates, I normally do this:
list(set(my_list))
Sets only have unique elements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文