404 urllib2.urlopen() 错误

发布于 2024-11-28 17:50:26 字数 745 浏览 3 评论 0原文

我正在尝试使用 urllib2 抓取网站。但是我收到 400 Page not found 错误。这是我的代码：

rec_text = 'Genesis 1:1'
my_text = rec_text.strip()
book = my_text.split()[0]
chapter_verse = my_text.split()[1]
chapter = chapter_verse.split(':')[0]
verse = chapter_verse.split(':')[1]
webpage = urllib2.urlopen('http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm').read()
stuffToSearch = ""
for line in webpage:
    stuffToSearch += line
search_for = re.compile(r'<a href="http://kingjbible.com/'+book+'/'+chapter+'.htm">King James Bible</a></span><br>(.*)<p><span class="versiontext"><a href="http://kjv.us/'+book+'/'+chapter+'.htm">')
search_it = re.search(search_for, stuffToSearch)
print(search_it.group(1))

原文

I'm trying to scrape a website using urllib2. However I get a 400 Page not found error. Here is my code:

rec_text = 'Genesis 1:1'
my_text = rec_text.strip()
book = my_text.split()[0]
chapter_verse = my_text.split()[1]
chapter = chapter_verse.split(':')[0]
verse = chapter_verse.split(':')[1]
webpage = urllib2.urlopen('http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm').read()
stuffToSearch = ""
for line in webpage:
    stuffToSearch += line
search_for = re.compile(r'<a href="http://kingjbible.com/'+book+'/'+chapter+'.htm">King James Bible</a></span><br>(.*)<p><span class="versiontext"><a href="http://kjv.us/'+book+'/'+chapter+'.htm">')
search_it = re.search(search_for, stuffToSearch)
print(search_it.group(1))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

昇り龍 2024-12-05 17:50:27

查看 bible.cc 网站，大小写似乎很重要。您需要 genesis 而不是 Genesis，您可以通过将行更改为 book = my_text.split()[0].lower() 来获得它>。

编辑：其余部分实际上与错误无关，但有一些其他提示。

您可以通过使用多重赋值来稍微简化代码，其中一个操作会输出两个或多个值。

rec_text = 'Genesis 1:1'
my_text = rec_text.strip().lower()
book, chapter_verse = my_text.split()
chapter, verse = chapter_verse.split(':')

还有一种方法可以将字符串列表连接在一起，而无需使用 for 循环。使用 join ，调用它的字符串将用作列表元素之间的分隔符（基本上与 split 相反）。

stuffToSearch = "".join(webpage)

我想页面检索没有任何问题，尽管我认为 readlines 会比 read 稍微更有效。与正则表达式相同；如果您只使用一次，则不需要编译它。然而，您可能很容易想出一个独立于书籍和章节且可以重复使用的表达式。

Looking at the bible.cc site, it appears that capitalization matters. You need genesis and not Genesis, which you can get by changing the line to book = my_text.split()[0].lower().

Edit: The rest of this doesn't actually relate to the error, but has some other tips.

You can streamline your code a bit by using multiple assignment where you have two or more values being output from one operation.

rec_text = 'Genesis 1:1'
my_text = rec_text.strip().lower()
book, chapter_verse = my_text.split()
chapter, verse = chapter_verse.split(':')

There's also a way to join a list of strings together without having to use a for loop. Use join where the string calling it will be used as the separator between the elements of the list (basically the opposite of split).

stuffToSearch = "".join(webpage)

I guess there's nothing wrong with the page retrieval, though I'd imagine readlines would be slightly more efficient than read. Same with the regular expression; you don't need to compile it if you're only using it once. You could probably easily come up with an expression that's independent of the book and chapter that can be used repeatedly, however.

回复收藏 0 原文