Beautifutsoup 4 .find_all（）突然停止工作

发布于 2025-02-06 17:05:43 字数 478 浏览 4 评论 0原文

我正在尝试创建一个使用Google Scholar的自动化科学文献收藏家。一切进展顺利，我得到了我想要的结果，但是突然有些事情破裂了，尽管数据流入了汤，但在第一个.find_all（）之后，它还是空了所有东西。奇怪的是，使用预先下载的.htm文件时不会发生这种情况。

我的代码：

site=requests.get(url)
site1=site.text
soup=bs(site1, 'html.parser')
ri=soup.find_all("div", class_='gs_ri')

以前RI返回了10片HTML代码，从中进一步的过程将我需要的所有内容分开，但是今天早晨，出于我的理解，它开始空空了，以及以前的版本我没有触摸。我可以跟随管道直到

soup=bs(site, 'html.parser')

但不在后来。 “汤”仍然可以恢复一切。

任何帮助将不胜感激，谢谢。

原文

I am trying to create an automated scientific literature collector, that uses google scholar.
All was going well, I was getting the results I wanted, but suddenly something broke and now, despite the data going into soup, it returns everything empty after the first .find_all(). Strangely enough, this does not happen when using a pre-downloaded .htm file.

My code:

site=requests.get(url)
site1=site.text
soup=bs(site1, 'html.parser')
ri=soup.find_all("div", class_='gs_ri')

Previously ri returned 10 pieces of html code from which further processes would separate everything I needed, but today morning, for reasons beyond my comprehension, it started returning empty, as well as the previous version which I did not touch. I can follow the pipeline up until

soup=bs(site, 'html.parser')

but not afterwards. 'Soup' still returns everything in order.

Any help would be much appreciated, thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

想你的星星会说话 2025-02-13 17:05:43

使用requests.get（url）从网站提取信息，然后将对象传递到beautifulsoup并提取内容而不是文本。

这样：

page = requests.get(URL, timeout=5)
soup = BeautifulSoup(page.content, 'html.parser')
ri = soup.find_all("div", class_='gs_ri')

编辑：您想使用内容，因为您想将原始字节流传递给BeautifulSoup。 page.text仅返回可能导致故障的字符串。

Use requests.get(URL) to extract the information from a website and pass the object to BeautifulSoup and extract the content, not the text.

Like this:

page = requests.get(URL, timeout=5)
soup = BeautifulSoup(page.content, 'html.parser')
ri = soup.find_all("div", class_='gs_ri')

Edit: You want to use the content, because you want to pass the raw byte stream to BeautifulSoup. page.text only returns a string which can cause malfunctions.

回复收藏 0 原文

~没有更多了~