404 url​​lib2.urlopen() 错误

发布于 2024-11-28 17:50:26 字数 745 浏览 3 评论 0原文

我正在尝试使用 urllib2 抓取网站。但是我收到 400 Page not found 错误。这是我的代码:

rec_text = 'Genesis 1:1'
my_text = rec_text.strip()
book = my_text.split()[0]
chapter_verse = my_text.split()[1]
chapter = chapter_verse.split(':')[0]
verse = chapter_verse.split(':')[1]
webpage = urllib2.urlopen('http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm').read()
stuffToSearch = ""
for line in webpage:
    stuffToSearch += line
search_for = re.compile(r'<a href="http://kingjbible.com/'+book+'/'+chapter+'.htm">King James Bible</a></span><br>(.*)<p><span class="versiontext"><a href="http://kjv.us/'+book+'/'+chapter+'.htm">')
search_it = re.search(search_for, stuffToSearch)
print(search_it.group(1))

I'm trying to scrape a website using urllib2. However I get a 400 Page not found error. Here is my code:

rec_text = 'Genesis 1:1'
my_text = rec_text.strip()
book = my_text.split()[0]
chapter_verse = my_text.split()[1]
chapter = chapter_verse.split(':')[0]
verse = chapter_verse.split(':')[1]
webpage = urllib2.urlopen('http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm').read()
stuffToSearch = ""
for line in webpage:
    stuffToSearch += line
search_for = re.compile(r'<a href="http://kingjbible.com/'+book+'/'+chapter+'.htm">King James Bible</a></span><br>(.*)<p><span class="versiontext"><a href="http://kjv.us/'+book+'/'+chapter+'.htm">')
search_it = re.search(search_for, stuffToSearch)
print(search_it.group(1))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

昇り龍 2024-12-05 17:50:27

查看 bible.cc 网站,大小写似乎很重要。您需要 genesis 而不是 Genesis,您可以通过将行更改为 book = my_text.split()[0].lower() 来获得它>。

编辑:其余部分实际上与错误无关,但有一些其他提示。

您可以通过使用多重赋值来稍微简化代码,其中一个操作会输出两个或多个值。

rec_text = 'Genesis 1:1'
my_text = rec_text.strip().lower()
book, chapter_verse = my_text.split()
chapter, verse = chapter_verse.split(':')

还有一种方法可以将字符串列表连接在一起,而无需使用 for 循环。使用 join ,调用它的字符串将用作列表元素之间的分隔符(基本上与 split 相反)。

stuffToSearch = "".join(webpage)

我想页面检索没有任何问题,尽管我认为 readlines 会比 read 稍微更有效。与正则表达式相同;如果您只使用一次,则不需要编译它。然而,您可能很容易想出一个独立于书籍和章节且可以重复使用的表达式。

Looking at the bible.cc site, it appears that capitalization matters. You need genesis and not Genesis, which you can get by changing the line to book = my_text.split()[0].lower().

Edit: The rest of this doesn't actually relate to the error, but has some other tips.

You can streamline your code a bit by using multiple assignment where you have two or more values being output from one operation.

rec_text = 'Genesis 1:1'
my_text = rec_text.strip().lower()
book, chapter_verse = my_text.split()
chapter, verse = chapter_verse.split(':')

There's also a way to join a list of strings together without having to use a for loop. Use join where the string calling it will be used as the separator between the elements of the list (basically the opposite of split).

stuffToSearch = "".join(webpage)

I guess there's nothing wrong with the page retrieval, though I'd imagine readlines would be slightly more efficient than read. Same with the regular expression; you don't need to compile it if you're only using it once. You could probably easily come up with an expression that's independent of the book and chapter that can be used repeatedly, however.

隔纱相望 2024-12-05 17:50:27

过程是对的,只是形成的url可能不正确。

为什么不将 'http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm' 分配给某个变量 &在将其发送到urlopen之前打印它?

这样您就可以验证 url 的格式是否正确。

The process is right, just that the url formed might not be correct.

Why don't you assign 'http://bible.cc/'+book+'/'+chapter+'-'+verse+'.htm' this to some variable & print it before sending it to urlopen?

This way you can verify if the url is formed correct.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文