当前位置：文江博客话题详情

Python debugging for-loop python-3.x web-scraping

使用断点调试 Python for 循环

发布于 2025-01-16 01:24:21 字数 2474 浏览 0 评论 0原文

我正在使用 bs4 为抓取项目编写 Python 脚本。

在抓取了几页后，脚本给了我一个“IndexError”。为了解决这个问题，我修改了代码并将增量变量放入“for 循环”中，特别是在包含页面的更高级别类别中。这样我至少减少了 1/2 的时间。
现在，有了这个“绕过解决方案”，我发现还是要等待很多时间。
这里就想到了断点和调试的解决方法。但这是主要问题：如何调试 for 循环从头开始而不丢失增量变量？

考虑我正在使用 PyCharm/Spyder。

阅读了很多这个主题，但没有找到任何解决方案
查看很多 youtube vids 教程
在间距中放置和删除断点。

from bs4 import BeautifulSoup
import requests
from fake_headers import Headers
branked_c0,link_ranked_c0,q_a_c0,condition_soldnumber_c0,price_tag_c0,link_image_c0,link_product_c0,number_sales_seller_c0,name_product_c0=[],[],[],[],[],[],[],[],[]
original_link="https://mercadolibre.cl/categorias#menu=categories"
req=requests.get(original_link,headers=Headers().generate())
soup=BeautifulSoup(req.text,'html.parser')
main=soup.select(".categories__container")[12] #bloque de categorias ##
sub_links=[]
end_="_Desde_{}"
for sbc in main:
    title_main=sbc.nextSibling
    print(type(title_main))
    print(len(title_main.contents))
    for i in range(0, len(title_main.contents)):
        print(i)
        subaru=title_main.contents[int(i)]
        print(subaru)
        for xo in subaru("a"):
             nombre_subcat = xo.string
             link_subcat = xo.get("href")
             print(link_subcat)
             sub_links.append(link_subcat)
             print(sub_links)
             for link in sub_links:
                 req = requests.get(link.format(0), headers=Headers().generate())
                 print(str(link) + str(nombre_subcat) + " = link s/nº")
                 soup = BeautifulSoup(req.text, 'html.parser')
                 try:
                     page_count=soup.select(".andes-pagination__page-count")[0]
                     total_count=(page_count.text.split(" "))[-1]
                 except:
                     page_count = 0
                 print(str(page_count) + " = número de página de la subcategoría")
                 # -#print(str(total_count)+"= número total de páginas de la categoría")
                 count = 51
                 for page_no in range(int(total_count)):  
                     req = requests.get(link.format(count),headers=Headers().generate())
                     soup = BeautifulSoup(req.text, 'html.parser')
                     sbc = soup.select(".ui-search-item__group.ui-search-item__group--title")```

I'm writing a Python Script for a scraping project, using bs4.

The script gives me an "IndexError", after a few pages scraped. In order to solve that I modified the code and put an incremental variable in a "for loop", specifically in a higher level category who contain the pages. With this I reduced in at least 1/2 the amount of time.
Now, with this "bypass solution", I found that still have to wait a lot of time.
Here it comes to me the solution of breakpoints and debugging. But here it's the main problem: how to debug a for loop to start from the beginning without losing the incremental variable?

Consider I'm using PyCharm/Spyder.

Read a lot of this topics, without find any solution
View a lot of youtube vids tutorial
Put and remove breakpoints in the gutter.

from bs4 import BeautifulSoup
import requests
from fake_headers import Headers
branked_c0,link_ranked_c0,q_a_c0,condition_soldnumber_c0,price_tag_c0,link_image_c0,link_product_c0,number_sales_seller_c0,name_product_c0=[],[],[],[],[],[],[],[],[]
original_link="https://mercadolibre.cl/categorias#menu=categories"
req=requests.get(original_link,headers=Headers().generate())
soup=BeautifulSoup(req.text,'html.parser')
main=soup.select(".categories__container")[12] #bloque de categorias ##
sub_links=[]
end_="_Desde_{}"
for sbc in main:
    title_main=sbc.nextSibling
    print(type(title_main))
    print(len(title_main.contents))
    for i in range(0, len(title_main.contents)):
        print(i)
        subaru=title_main.contents[int(i)]
        print(subaru)
        for xo in subaru("a"):
             nombre_subcat = xo.string
             link_subcat = xo.get("href")
             print(link_subcat)
             sub_links.append(link_subcat)
             print(sub_links)
             for link in sub_links:
                 req = requests.get(link.format(0), headers=Headers().generate())
                 print(str(link) + str(nombre_subcat) + " = link s/nº")
                 soup = BeautifulSoup(req.text, 'html.parser')
                 try:
                     page_count=soup.select(".andes-pagination__page-count")[0]
                     total_count=(page_count.text.split(" "))[-1]
                 except:
                     page_count = 0
                 print(str(page_count) + " = número de página de la subcategoría")
                 # -#print(str(total_count)+"= número total de páginas de la categoría")
                 count = 51
                 for page_no in range(int(total_count)):  
                     req = requests.get(link.format(count),headers=Headers().generate())
                     soup = BeautifulSoup(req.text, 'html.parser')
                     sbc = soup.select(".ui-search-item__group.ui-search-item__group--title")```

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

暂无简介

文章

评论

25 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

饮湿

文章 0 评论 0

明月

文章 0 评论 0

02

文章 0 评论 0

hs1283

文章 0 评论 0

风向决定发型

文章 0 评论 0

落花浅忆

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文