Python Selenium Chrome Webdriver错误:无效

发布于 2025-02-09 12:28:21 字数 2570 浏览 1 评论 0原文

我正在应用一个函数,该函数使用硒将URL刮擦到熊猫数据框架上。我正在刮擦许多网站(按10 4 的顺序)。成功刮擦了50个左右的网站后,我得到无效的错误。我只在计算结束时明确关闭驱动程序,所以我很困惑为什么会遇到此错误。

这是用于参考的代码示例:

这是刮擦每个单独网站的代码

def scrape_all_text(url, keyword, wd):
  try:
    print(str(url))
    if (str(url).startswith("http://") or str(url).startswith("https://")):
      wd.get(str(url))
    else:
      wd.get("http://" + str(url))
    text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
    print(f"KEYWORD: {keyword}, TEXT: {text}")
    return text
  except WebDriverException as e:
     print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
     return None

,这是应该刮擦我网站的子集并将其屈服于我的生成器。

def split_and_scrape(split_percent, df, col_to_add, scrape_func):
  num_splits = math.ceil(np.reciprocal(split_percent))
  entries_per_split = int(len(df.index) * split_percent)
  split_df_list = np.array_split(df, num_splits)
  for i, split in enumerate(split_df_list):
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    wd.set_page_load_timeout(20)
    print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
    split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
    wd.close()
    yield split

这是我一段时间后遇到的错误:

InvalidSessionIdException                 Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
      2 # wd.set_page_load_timeout(20)
      3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
      5   split.to_csv(f"page_contents_{i}.csv")

3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    245                 alert_text = value['alert'].get('text')
    246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 247         raise exception_class(message, screen, stacktrace)
    248 
    249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:

InvalidSessionIdException: Message: invalid session id

对于上下文,我在Google Colab中运行此代码,尽管我不确定这与我遇到的错误有关

I am applying a function that scrapes a url using selenium to a pandas dataframe. I am scraping many websites (on the order of 104). After a 50 or so websites are scraped successfully, I get an InvalidSessionId error. I only close the driver explicitly at the end of my computation, so I am confused why I am getting this error.

Here is the code sample for reference:

This is the code that scrapes each individual website

def scrape_all_text(url, keyword, wd):
  try:
    print(str(url))
    if (str(url).startswith("http://") or str(url).startswith("https://")):
      wd.get(str(url))
    else:
      wd.get("http://" + str(url))
    text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
    print(f"KEYWORD: {keyword}, TEXT: {text}")
    return text
  except WebDriverException as e:
     print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
     return None

This is the generator that is supposed to scrape a subset of my websites and yield them to me.

def split_and_scrape(split_percent, df, col_to_add, scrape_func):
  num_splits = math.ceil(np.reciprocal(split_percent))
  entries_per_split = int(len(df.index) * split_percent)
  split_df_list = np.array_split(df, num_splits)
  for i, split in enumerate(split_df_list):
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    wd.set_page_load_timeout(20)
    print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
    split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
    wd.close()
    yield split

This is the error I run into after a while:

InvalidSessionIdException                 Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
      2 # wd.set_page_load_timeout(20)
      3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
      5   split.to_csv(f"page_contents_{i}.csv")

3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    245                 alert_text = value['alert'].get('text')
    246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 247         raise exception_class(message, screen, stacktrace)
    248 
    249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:

InvalidSessionIdException: Message: invalid session id

For context, I am running this code in Google Colab, although I am not sure why this would be relevant to the error I'm getting

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文