使用 ThreadPoolExecutor 时的 Selenium 浏览器干扰
如果这段代码变得复杂,我们深表歉意。我在线程方面没有太多经验,我一直把所有东西都扔到墙上看看什么能粘住。
我的目标是运行基于 Selenium 的递归网络爬虫脚本的两个并行实例。每个抓取工具都运行一个单独的 ChromeDriver 实例。每个浏览器单独启动,但当它们开始爬行时,每个浏览器实例都开始爬行另一个浏览器的链接,并在整个持续时间内来回切换。
我尝试过在各个点添加锁,但这些似乎没有帮助。最终,我希望每个浏览器仅运行一个爬网程序实例,并在完成后关闭浏览器。有没有办法可以单独运行它们而不会交叉干扰?
使用多重处理时不会出现此问题,但使用此问题会导致其他复杂情况爬虫类使用不可选取的对象,例如用于日志记录目的的 sqllite 连接器(或者这似乎是问题)。
对于其他上下文,我从 tkinter GUI 运行这些爬网程序,该 GUI 显示每个爬网程序所在的当前 URL 作为状态/进度指示器。
此处的任何帮助或见解将不胜感激。
爬虫类:
class Crawler:
def __init__(self, config, browser):
self.config_path = config['path']
self.name = config['name']
self.startURIs = config['startURIs']
self.URIs = config['URIs']
self.maxDepth = config['maxDepth']
self.new_content = 0
self.regex_query = self.CreateURIPattern()
self.options = webdriver.ChromeOptions()
self.download_dir = config['download_dir']
self.browser = browser
def Crawl(self, URL, maxDepth, q):
queue = Queue()
lock = threading.Lock()
print(self.config_path, URL, threading.currentThread().getName())
lock.acquire()
queue.put(URL)
browser = self.browser
q.put([self.config_path, URL])
self.crawled_links.append(URL)
self.browser.get(queue.get())
raw_links = browser.find_elements_by_tag_name('a')
for link in raw_links:
href = link.get_attribute('href')
if href is not None and href not in list(self.links.keys()):
if href.endswith('#') and link.get_attribute('onClick') is not None:
link = re.search(r'(?:https*:\/\/[\w_-]+(?:(?:\.[\w_-]+)+)[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])',link.get_attribute('onClick'), re.X).group()
else:
link = href
if re.search(self.regex_query, link) is not None and re.search(r'.*#', link) is None:
self.links[link] = {'Referring URL': URL, 'Depth level': maxDepth - 1}
for key, value in list(self.links.items()):
if key not in self.crawled_links and value['Depth level'] > 0:
try:
self.Crawl(key, value['Depth level'], q)
except Exception as e:
print(e)
pass
else:
pass
lock.release()
ThreadPool
class Runner():
def __init__(self, config, browser):
self.config = config
self.browser = browser
def run_crawler(self, browser, q):
with open(self.config, 'rb') as f:
data = json.load(f)
data['path'] = '/'.join(self.config.split('/')[-3:-1])
time.sleep(3)
try:
c = Crawler(data, browser)
for URI in c.startURIs:
c.Crawl(URI, c.maxDepth, q)
done_message = (f'\nCRAWLING COMPLETE: {c.name}.\n{c.new_content} files added.\nCrawler took {self.timer(t1, time.time())}.\n')
print(done_message)
c.browser.quit()
except Exception as e:
print(e)
try:
c.browser.quit()
except:
pass
编辑: 显示类实例、chrome 会话 ID、线程号和 URL 的示例输出。在第四行中,您可以看到线程 192084 已开始获取用于线程 192083(瑞士医学周刊)的链接
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/past-issues
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/cover/januaryfebruary-2022
Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/archive
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://smw.ch/issue-1/edn/smw.2022.0910
Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/issue-1/edn/smw.2022.0708
apologies if this code has become convoluted. I don't have much experience with threading and I've been throwing everything at the wall to see what sticks.
My goal is to run two parallel instances of a recursive, Selenium-based web crawler script. Each crawler runs a separate instance of ChromeDriver. The browsers each launch separately, but as they start crawling, each browser instance begins to crawl the other's links, switching back and forth throughout the duration.
I've tried adding locks at various points, but these don't seem to help. Ultimately I want each browser to run only one instance of a crawler and close the browser once complete. Is there a way that I can run these separately without cross-interference?
This problem doesn't occur when using multiprocessing, but using this causes other complications crawler class uses un-pickable objects such as an sqllite connector for logging purposes (or so that would seem to be the problem).
For additional context, I'm running these crawlers from a tkinter GUI which displays the current URL that each crawler is on as a status/progress indicator.
Any help or insight here would be much appreciated.
Crawler class:
class Crawler:
def __init__(self, config, browser):
self.config_path = config['path']
self.name = config['name']
self.startURIs = config['startURIs']
self.URIs = config['URIs']
self.maxDepth = config['maxDepth']
self.new_content = 0
self.regex_query = self.CreateURIPattern()
self.options = webdriver.ChromeOptions()
self.download_dir = config['download_dir']
self.browser = browser
def Crawl(self, URL, maxDepth, q):
queue = Queue()
lock = threading.Lock()
print(self.config_path, URL, threading.currentThread().getName())
lock.acquire()
queue.put(URL)
browser = self.browser
q.put([self.config_path, URL])
self.crawled_links.append(URL)
self.browser.get(queue.get())
raw_links = browser.find_elements_by_tag_name('a')
for link in raw_links:
href = link.get_attribute('href')
if href is not None and href not in list(self.links.keys()):
if href.endswith('#') and link.get_attribute('onClick') is not None:
link = re.search(r'(?:https*:\/\/[\w_-]+(?:(?:\.[\w_-]+)+)[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])',link.get_attribute('onClick'), re.X).group()
else:
link = href
if re.search(self.regex_query, link) is not None and re.search(r'.*#', link) is None:
self.links[link] = {'Referring URL': URL, 'Depth level': maxDepth - 1}
for key, value in list(self.links.items()):
if key not in self.crawled_links and value['Depth level'] > 0:
try:
self.Crawl(key, value['Depth level'], q)
except Exception as e:
print(e)
pass
else:
pass
lock.release()
ThreadPool
class Runner():
def __init__(self, config, browser):
self.config = config
self.browser = browser
def run_crawler(self, browser, q):
with open(self.config, 'rb') as f:
data = json.load(f)
data['path'] = '/'.join(self.config.split('/')[-3:-1])
time.sleep(3)
try:
c = Crawler(data, browser)
for URI in c.startURIs:
c.Crawl(URI, c.maxDepth, q)
done_message = (f'\nCRAWLING COMPLETE: {c.name}.\n{c.new_content} files added.\nCrawler took {self.timer(t1, time.time())}.\n')
print(done_message)
c.browser.quit()
except Exception as e:
print(e)
try:
c.browser.quit()
except:
pass
EDIT:
Sample Output displaying Class instance, chrome session ID, thread number, and URL. In the fourth row, you can see that thread 192084 has begun picking up links intended for Thread 192083 (Swiss Medical Weekly)
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/past-issues
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/cover/januaryfebruary-2022
Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/archive
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://smw.ch/issue-1/edn/smw.2022.0910
Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/issue-1/edn/smw.2022.0708
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论