使用 BeautifulSoup 在网页中查找特定链接
from BeautifulSoup import BeautifulSoup
import urllib2
import re
user = raw_input('begin here!: ')
base = ("http://1337x.org/search/")
print (base + user)
add_on = "/0/"
total_link = (base + user + add_on)
html_data = urllib2.urlopen(total_link, 'r').read()
soup = BeautifulSoup(html_data)
announce = soup.find('a', attrs={'href': re.compile("^/announcelist")})
print announce
我正在尝试检索 torrent 链接,最好是第一个非赞助链接。从页面然后让它打印链接。我对这个编码相当陌生,所以你能提供的尽可能多的细节将是完美的!非常感谢您的帮助!
from BeautifulSoup import BeautifulSoup
import urllib2
import re
user = raw_input('begin here!: ')
base = ("http://1337x.org/search/")
print (base + user)
add_on = "/0/"
total_link = (base + user + add_on)
html_data = urllib2.urlopen(total_link, 'r').read()
soup = BeautifulSoup(html_data)
announce = soup.find('a', attrs={'href': re.compile("^/announcelist")})
print announce
i am attempting to retrieve a torrent link preferably the first non sponsored link. from a page and then have it print the link. i am rather new at this coding so as much detail as you can give would be perfect! thank you so much for the help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题出在你的正则表达式中。您尝试使用
^
字符来否定正则表达式,但它在您的情况下不起作用。^
仅对一组字符([]
内的一组字符)求反;即使在这种情况下,它也只会在它是第一个字符时才会否定。例如,[^aeiou]
表示“任何字符除了a
、e
、i、<代码>o和<代码>u”。
当您在字符集之外使用
^
时,它会匹配行的开头。例如,^aeiou
匹配以aeiou
字符串开头的行。那么,如何否定正则表达式呢?嗯,我认为最好的方法是使用否定前瞻,这是一个以
(?!
开头并以)
结尾的正则表达式。对于您的情况,这非常简单:(?!/announcelist)
因此,将
re.compile("^/announcelist")
替换为re.compile ("(?!/announcelist)")
它应该可以工作 - 至少在这里工作:)The problem is in your regular expression. You are trying to use the
^
character to negate the regex, but it does not work in your situation. The^
only negates a set of characters (a set of chars inside[]
); even in this case it only negates if it is the first char. For example,[^aeiou]
means "any character excepta
,e
,i
,o
andu
".When you use
^
outside a character set, then it matches the beginning of a line. For example,^aeiou
matches a line which starts with theaeiou
string.So, how would you negate a regex? Well, the best way I see is to use a negative lookahead, which is a regex that starts with
(?!
and ends with)
. For your case, it is pretty easy:(?!/announcelist)
So, replace the
re.compile("^/announcelist")
byre.compile("(?!/announcelist)")
and it should work - at least worked here :)