404 urllib2.urlopen() 错误
我正在尝试使用 urllib2 抓取网站。但是我收到 400 Page not found 错误。这是我的代码:
rec_text = 'Genesis 1:1'
my_text = rec_text.strip()
book = my_text.split()[0]
chapter_verse = my_text.split()[1]
chapter = chapter_verse.split(':')[0]
verse = chapter_verse.split(':')[1]
webpage = urllib2.urlopen('http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm').read()
stuffToSearch = ""
for line in webpage:
stuffToSearch += line
search_for = re.compile(r'<a href="http://kingjbible.com/'+book+'/'+chapter+'.htm">King James Bible</a></span><br>(.*)<p><span class="versiontext"><a href="http://kjv.us/'+book+'/'+chapter+'.htm">')
search_it = re.search(search_for, stuffToSearch)
print(search_it.group(1))
I'm trying to scrape a website using urllib2. However I get a 400 Page not found error. Here is my code:
rec_text = 'Genesis 1:1'
my_text = rec_text.strip()
book = my_text.split()[0]
chapter_verse = my_text.split()[1]
chapter = chapter_verse.split(':')[0]
verse = chapter_verse.split(':')[1]
webpage = urllib2.urlopen('http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm').read()
stuffToSearch = ""
for line in webpage:
stuffToSearch += line
search_for = re.compile(r'<a href="http://kingjbible.com/'+book+'/'+chapter+'.htm">King James Bible</a></span><br>(.*)<p><span class="versiontext"><a href="http://kjv.us/'+book+'/'+chapter+'.htm">')
search_it = re.search(search_for, stuffToSearch)
print(search_it.group(1))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看
bible.cc
网站,大小写似乎很重要。您需要genesis
而不是Genesis
,您可以通过将行更改为book = my_text.split()[0].lower()
来获得它>。编辑:其余部分实际上与错误无关,但有一些其他提示。
您可以通过使用多重赋值来稍微简化代码,其中一个操作会输出两个或多个值。
还有一种方法可以将字符串列表连接在一起,而无需使用 for 循环。使用
join
,调用它的字符串将用作列表元素之间的分隔符(基本上与split
相反)。我想页面检索没有任何问题,尽管我认为
readlines
会比read
稍微更有效。与正则表达式相同;如果您只使用一次,则不需要编译它。然而,您可能很容易想出一个独立于书籍和章节且可以重复使用的表达式。Looking at the
bible.cc
site, it appears that capitalization matters. You needgenesis
and notGenesis
, which you can get by changing the line tobook = my_text.split()[0].lower()
.Edit: The rest of this doesn't actually relate to the error, but has some other tips.
You can streamline your code a bit by using multiple assignment where you have two or more values being output from one operation.
There's also a way to join a list of strings together without having to use a for loop. Use
join
where the string calling it will be used as the separator between the elements of the list (basically the opposite ofsplit
).I guess there's nothing wrong with the page retrieval, though I'd imagine
readlines
would be slightly more efficient thanread
. Same with the regular expression; you don't need to compile it if you're only using it once. You could probably easily come up with an expression that's independent of the book and chapter that can be used repeatedly, however.过程是对的,只是形成的url可能不正确。
为什么不将
'http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm'
分配给某个变量 &在将其发送到urlopen
之前打印它?这样您就可以验证 url 的格式是否正确。
The process is right, just that the url formed might not be correct.
Why don't you assign
'http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm'
this to some variable & print it before sending it tourlopen
?This way you can verify if the url is formed correct.