Python Selenium Chrome Webdriver错误:无效
我正在应用一个函数,该函数使用硒将URL刮擦到熊猫数据框架上。我正在刮擦许多网站(按10 4 的顺序)。成功刮擦了50个左右的网站后,我得到无效的
错误。我只在计算结束时明确关闭驱动程序,所以我很困惑为什么会遇到此错误。
这是用于参考的代码示例:
这是刮擦每个单独网站的代码
def scrape_all_text(url, keyword, wd):
try:
print(str(url))
if (str(url).startswith("http://") or str(url).startswith("https://")):
wd.get(str(url))
else:
wd.get("http://" + str(url))
text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
print(f"KEYWORD: {keyword}, TEXT: {text}")
return text
except WebDriverException as e:
print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
return None
,这是应该刮擦我网站的子集并将其屈服于我的生成器。
def split_and_scrape(split_percent, df, col_to_add, scrape_func):
num_splits = math.ceil(np.reciprocal(split_percent))
entries_per_split = int(len(df.index) * split_percent)
split_df_list = np.array_split(df, num_splits)
for i, split in enumerate(split_df_list):
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.set_page_load_timeout(20)
print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
wd.close()
yield split
这是我一段时间后遇到的错误:
InvalidSessionIdException Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
2 # wd.set_page_load_timeout(20)
3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
5 split.to_csv(f"page_contents_{i}.csv")
3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
245 alert_text = value['alert'].get('text')
246 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 247 raise exception_class(message, screen, stacktrace)
248
249 def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:
InvalidSessionIdException: Message: invalid session id
对于上下文,我在Google Colab中运行此代码,尽管我不确定这与我遇到的错误有关
I am applying a function that scrapes a url using selenium to a pandas dataframe. I am scraping many websites (on the order of 104). After a 50 or so websites are scraped successfully, I get an InvalidSessionId
error. I only close the driver explicitly at the end of my computation, so I am confused why I am getting this error.
Here is the code sample for reference:
This is the code that scrapes each individual website
def scrape_all_text(url, keyword, wd):
try:
print(str(url))
if (str(url).startswith("http://") or str(url).startswith("https://")):
wd.get(str(url))
else:
wd.get("http://" + str(url))
text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
print(f"KEYWORD: {keyword}, TEXT: {text}")
return text
except WebDriverException as e:
print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
return None
This is the generator that is supposed to scrape a subset of my websites and yield them to me.
def split_and_scrape(split_percent, df, col_to_add, scrape_func):
num_splits = math.ceil(np.reciprocal(split_percent))
entries_per_split = int(len(df.index) * split_percent)
split_df_list = np.array_split(df, num_splits)
for i, split in enumerate(split_df_list):
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.set_page_load_timeout(20)
print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
wd.close()
yield split
This is the error I run into after a while:
InvalidSessionIdException Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
2 # wd.set_page_load_timeout(20)
3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
5 split.to_csv(f"page_contents_{i}.csv")
3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
245 alert_text = value['alert'].get('text')
246 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 247 raise exception_class(message, screen, stacktrace)
248
249 def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:
InvalidSessionIdException: Message: invalid session id
For context, I am running this code in Google Colab, although I am not sure why this would be relevant to the error I'm getting
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论