Python Selenium Chrome Webdriver错误：无效

发布于 2025-02-09 12:28:21 字数 2570 浏览 1 评论 0原文

我正在应用一个函数，该函数使用硒将URL刮擦到熊猫数据框架上。我正在刮擦许多网站（按10 ⁴的顺序）。成功刮擦了50个左右的网站后，我得到无效的错误。我只在计算结束时明确关闭驱动程序，所以我很困惑为什么会遇到此错误。

这是用于参考的代码示例：

这是刮擦每个单独网站的代码

def scrape_all_text(url, keyword, wd):
  try:
    print(str(url))
    if (str(url).startswith("http://") or str(url).startswith("https://")):
      wd.get(str(url))
    else:
      wd.get("http://" + str(url))
    text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
    print(f"KEYWORD: {keyword}, TEXT: {text}")
    return text
  except WebDriverException as e:
     print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
     return None

，这是应该刮擦我网站的子集并将其屈服于我的生成器。

def split_and_scrape(split_percent, df, col_to_add, scrape_func):
  num_splits = math.ceil(np.reciprocal(split_percent))
  entries_per_split = int(len(df.index) * split_percent)
  split_df_list = np.array_split(df, num_splits)
  for i, split in enumerate(split_df_list):
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    wd.set_page_load_timeout(20)
    print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
    split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
    wd.close()
    yield split

这是我一段时间后遇到的错误：

InvalidSessionIdException                 Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
      2 # wd.set_page_load_timeout(20)
      3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
      5   split.to_csv(f"page_contents_{i}.csv")

3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    245                 alert_text = value['alert'].get('text')
    246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 247         raise exception_class(message, screen, stacktrace)
    248 
    249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:

InvalidSessionIdException: Message: invalid session id

对于上下文，我在Google Colab中运行此代码，尽管我不确定这与我遇到的错误有关

原文

I am applying a function that scrapes a url using selenium to a pandas dataframe. I am scraping many websites (on the order of 10⁴). After a 50 or so websites are scraped successfully, I get an InvalidSessionId error. I only close the driver explicitly at the end of my computation, so I am confused why I am getting this error.

Here is the code sample for reference:

This is the code that scrapes each individual website

def scrape_all_text(url, keyword, wd):
  try:
    print(str(url))
    if (str(url).startswith("http://") or str(url).startswith("https://")):
      wd.get(str(url))
    else:
      wd.get("http://" + str(url))
    text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
    print(f"KEYWORD: {keyword}, TEXT: {text}")
    return text
  except WebDriverException as e:
     print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
     return None

This is the generator that is supposed to scrape a subset of my websites and yield them to me.

def split_and_scrape(split_percent, df, col_to_add, scrape_func):
  num_splits = math.ceil(np.reciprocal(split_percent))
  entries_per_split = int(len(df.index) * split_percent)
  split_df_list = np.array_split(df, num_splits)
  for i, split in enumerate(split_df_list):
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    wd.set_page_load_timeout(20)
    print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
    split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
    wd.close()
    yield split

This is the error I run into after a while:

InvalidSessionIdException                 Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
      2 # wd.set_page_load_timeout(20)
      3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
      5   split.to_csv(f"page_contents_{i}.csv")

3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    245                 alert_text = value['alert'].get('text')
    246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 247         raise exception_class(message, screen, stacktrace)
    248 
    249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:

InvalidSessionIdException: Message: invalid session id

For context, I am running this code in Google Colab, although I am not sure why this would be relevant to the error I'm getting

分享到QQ

分享到微博