使用断点调试 Python for 循环

发布于 2025-01-16 01:24:21 字数 2474 浏览 0 评论 0原文

我正在使用 bs4 为抓取项目编写 Python 脚本。

  1. 在抓取了几页后,脚本给了我一个“IndexError”。为了解决这个问题,我修改了代码并将增量变量放入“for 循环”中,特别是在包含页面的更高级别类别中。这样我至少减少了 1/2 的时间。

  2. 现在,有了这个“绕过解决方案”,我发现还是要等待很多时间。

  3. 这里就想到了断点和调试的解决方法。但这是主要问题:如何调试 for 循环从头开始而不丢失增量变量?

考虑我正在使用 PyCharm/Spyder。

  1. 阅读了很多这个主题,但没有找到任何解决方案

  2. 查看很多 youtube vids 教程

  3. 在间距中放置和删除断点。

from bs4 import BeautifulSoup
import requests
from fake_headers import Headers
branked_c0,link_ranked_c0,q_a_c0,condition_soldnumber_c0,price_tag_c0,link_image_c0,link_product_c0,number_sales_seller_c0,name_product_c0=[],[],[],[],[],[],[],[],[]
original_link="https://mercadolibre.cl/categorias#menu=categories"
req=requests.get(original_link,headers=Headers().generate())
soup=BeautifulSoup(req.text,'html.parser')
main=soup.select(".categories__container")[12] #bloque de categorias ##
sub_links=[]
end_="_Desde_{}"
for sbc in main:
    title_main=sbc.nextSibling
    print(type(title_main))
    print(len(title_main.contents))
    for i in range(0, len(title_main.contents)):
        print(i)
        subaru=title_main.contents[int(i)]
        print(subaru)
        for xo in subaru("a"):
             nombre_subcat = xo.string
             link_subcat = xo.get("href")
             print(link_subcat)
             sub_links.append(link_subcat)
             print(sub_links)
             for link in sub_links:
                 req = requests.get(link.format(0), headers=Headers().generate())
                 print(str(link) + str(nombre_subcat) + " = link s/nº")
                 soup = BeautifulSoup(req.text, 'html.parser')
                 try:
                     page_count=soup.select(".andes-pagination__page-count")[0]
                     total_count=(page_count.text.split(" "))[-1]
                 except:
                     page_count = 0
                 print(str(page_count) + " = número de página de la subcategoría")
                 # -#print(str(total_count)+"= número total de páginas de la categoría")
                 count = 51
                 for page_no in range(int(total_count)):  
                     req = requests.get(link.format(count),headers=Headers().generate())
                     soup = BeautifulSoup(req.text, 'html.parser')
                     sbc = soup.select(".ui-search-item__group.ui-search-item__group--title")```

I'm writing a Python Script for a scraping project, using bs4.

  1. The script gives me an "IndexError", after a few pages scraped. In order to solve that I modified the code and put an incremental variable in a "for loop", specifically in a higher level category who contain the pages. With this I reduced in at least 1/2 the amount of time.

  2. Now, with this "bypass solution", I found that still have to wait a lot of time.

  3. Here it comes to me the solution of breakpoints and debugging. But here it's the main problem: how to debug a for loop to start from the beginning without losing the incremental variable?

Consider I'm using PyCharm/Spyder.

  1. Read a lot of this topics, without find any solution

  2. View a lot of youtube vids tutorial

  3. Put and remove breakpoints in the gutter.

from bs4 import BeautifulSoup
import requests
from fake_headers import Headers
branked_c0,link_ranked_c0,q_a_c0,condition_soldnumber_c0,price_tag_c0,link_image_c0,link_product_c0,number_sales_seller_c0,name_product_c0=[],[],[],[],[],[],[],[],[]
original_link="https://mercadolibre.cl/categorias#menu=categories"
req=requests.get(original_link,headers=Headers().generate())
soup=BeautifulSoup(req.text,'html.parser')
main=soup.select(".categories__container")[12] #bloque de categorias ##
sub_links=[]
end_="_Desde_{}"
for sbc in main:
    title_main=sbc.nextSibling
    print(type(title_main))
    print(len(title_main.contents))
    for i in range(0, len(title_main.contents)):
        print(i)
        subaru=title_main.contents[int(i)]
        print(subaru)
        for xo in subaru("a"):
             nombre_subcat = xo.string
             link_subcat = xo.get("href")
             print(link_subcat)
             sub_links.append(link_subcat)
             print(sub_links)
             for link in sub_links:
                 req = requests.get(link.format(0), headers=Headers().generate())
                 print(str(link) + str(nombre_subcat) + " = link s/nº")
                 soup = BeautifulSoup(req.text, 'html.parser')
                 try:
                     page_count=soup.select(".andes-pagination__page-count")[0]
                     total_count=(page_count.text.split(" "))[-1]
                 except:
                     page_count = 0
                 print(str(page_count) + " = número de página de la subcategoría")
                 # -#print(str(total_count)+"= número total de páginas de la categoría")
                 count = 51
                 for page_no in range(int(total_count)):  
                     req = requests.get(link.format(count),headers=Headers().generate())
                     soup = BeautifulSoup(req.text, 'html.parser')
                     sbc = soup.select(".ui-search-item__group.ui-search-item__group--title")```

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文