使用 ThreadPoolExecutor 时的 Selenium 浏览器干扰

发布于 2025-01-12 07:27:00 字数 4548 浏览 4 评论 0原文

如果这段代码变得复杂，我们深表歉意。我在线程方面没有太多经验，我一直把所有东西都扔到墙上看看什么能粘住。

我的目标是运行基于 Selenium 的递归网络爬虫脚本的两个并行实例。每个抓取工具都运行一个单独的 ChromeDriver 实例。每个浏览器单独启动，但当它们开始爬行时，每个浏览器实例都开始爬行另一个浏览器的链接，并在整个持续时间内来回切换。

我尝试过在各个点添加锁，但这些似乎没有帮助。最终，我希望每个浏览器仅运行一个爬网程序实例，并在完成后关闭浏览器。有没有办法可以单独运行它们而不会交叉干扰？

使用多重处理时不会出现此问题，但使用此问题会导致其他复杂情况爬虫类使用不可选取的对象，例如用于日志记录目的的 sqllite 连接器（或者这似乎是问题）。

对于其他上下文，我从 tkinter GUI 运行这些爬网程序，该 GUI 显示每个爬网程序所在的当前 URL 作为状态/进度指示器。

此处的任何帮助或见解将不胜感激。

爬虫类：

class Crawler:
    def __init__(self, config, browser):
        self.config_path = config['path']
        self.name = config['name']
        self.startURIs = config['startURIs']
        self.URIs = config['URIs']
        self.maxDepth = config['maxDepth']
        self.new_content = 0
        self.regex_query = self.CreateURIPattern()
        self.options = webdriver.ChromeOptions()
        self.download_dir = config['download_dir']
        self.browser = browser


    def Crawl(self, URL, maxDepth, q):
            queue = Queue()
            lock = threading.Lock()

            print(self.config_path, URL, threading.currentThread().getName())
            
            lock.acquire()

            queue.put(URL)

            browser = self.browser

            q.put([self.config_path, URL])

            self.crawled_links.append(URL)

            self.browser.get(queue.get())

            raw_links = browser.find_elements_by_tag_name('a')
            
            for link in raw_links:
                href = link.get_attribute('href')
                
                if href is not None and href not in list(self.links.keys()):
                    if href.endswith('#') and link.get_attribute('onClick') is not None:
                        link = re.search(r'(?:https*:\/\/[\w_-]+(?:(?:\.[\w_-]+)+)[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])',link.get_attribute('onClick'), re.X).group()
                    else:
                        link = href

                    if re.search(self.regex_query, link) is not None and re.search(r'.*#', link) is None:
                        self.links[link] = {'Referring URL': URL, 'Depth level': maxDepth - 1}


            for key, value in list(self.links.items()):
                if key not in self.crawled_links and value['Depth level'] > 0:
                    try:
                        self.Crawl(key, value['Depth level'], q)
                    except Exception as e:
                        print(e)
                        pass
                else:
                    pass
            
            lock.release()

ThreadPool

class Runner():
    def __init__(self, config, browser):
        self.config = config
        self.browser = browser

    def run_crawler(self, browser, q):

        with open(self.config, 'rb') as f:
            data = json.load(f)
            data['path'] = '/'.join(self.config.split('/')[-3:-1])

            time.sleep(3)

            try:
                c = Crawler(data, browser)

                for URI in c.startURIs: 
                    c.Crawl(URI, c.maxDepth, q)
                    done_message = (f'\nCRAWLING COMPLETE: {c.name}.\n{c.new_content} files added.\nCrawler took {self.timer(t1, time.time())}.\n')
                    print(done_message)
                    c.browser.quit()

            except Exception as e:
                print(e)
                try:
                    c.browser.quit()
                except:
                    pass

编辑： 显示类实例、chrome 会话 ID、线程号和 URL 的示例输出。在第四行中，您可以看到线程 192084 已开始获取用于线程 192083（瑞士医学周刊）的链接

British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/past-issues

British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/cover/januaryfebruary-2022

Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/archive

British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://smw.ch/issue-1/edn/smw.2022.0910

Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/issue-1/edn/smw.2022.0708

原文

apologies if this code has become convoluted. I don't have much experience with threading and I've been throwing everything at the wall to see what sticks.

My goal is to run two parallel instances of a recursive, Selenium-based web crawler script. Each crawler runs a separate instance of ChromeDriver. The browsers each launch separately, but as they start crawling, each browser instance begins to crawl the other's links, switching back and forth throughout the duration.

I've tried adding locks at various points, but these don't seem to help. Ultimately I want each browser to run only one instance of a crawler and close the browser once complete. Is there a way that I can run these separately without cross-interference?

This problem doesn't occur when using multiprocessing, but using this causes other complications crawler class uses un-pickable objects such as an sqllite connector for logging purposes (or so that would seem to be the problem).

For additional context, I'm running these crawlers from a tkinter GUI which displays the current URL that each crawler is on as a status/progress indicator.

Any help or insight here would be much appreciated.

Crawler class:

class Crawler:
    def __init__(self, config, browser):
        self.config_path = config['path']
        self.name = config['name']
        self.startURIs = config['startURIs']
        self.URIs = config['URIs']
        self.maxDepth = config['maxDepth']
        self.new_content = 0
        self.regex_query = self.CreateURIPattern()
        self.options = webdriver.ChromeOptions()
        self.download_dir = config['download_dir']
        self.browser = browser


    def Crawl(self, URL, maxDepth, q):
            queue = Queue()
            lock = threading.Lock()

            print(self.config_path, URL, threading.currentThread().getName())
            
            lock.acquire()

            queue.put(URL)

            browser = self.browser

            q.put([self.config_path, URL])

            self.crawled_links.append(URL)

            self.browser.get(queue.get())

            raw_links = browser.find_elements_by_tag_name('a')
            
            for link in raw_links:
                href = link.get_attribute('href')
                
                if href is not None and href not in list(self.links.keys()):
                    if href.endswith('#') and link.get_attribute('onClick') is not None:
                        link = re.search(r'(?:https*:\/\/[\w_-]+(?:(?:\.[\w_-]+)+)[\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])',link.get_attribute('onClick'), re.X).group()
                    else:
                        link = href

                    if re.search(self.regex_query, link) is not None and re.search(r'.*#', link) is None:
                        self.links[link] = {'Referring URL': URL, 'Depth level': maxDepth - 1}


            for key, value in list(self.links.items()):
                if key not in self.crawled_links and value['Depth level'] > 0:
                    try:
                        self.Crawl(key, value['Depth level'], q)
                    except Exception as e:
                        print(e)
                        pass
                else:
                    pass
            
            lock.release()

ThreadPool

class Runner():
    def __init__(self, config, browser):
        self.config = config
        self.browser = browser

    def run_crawler(self, browser, q):

        with open(self.config, 'rb') as f:
            data = json.load(f)
            data['path'] = '/'.join(self.config.split('/')[-3:-1])

            time.sleep(3)

            try:
                c = Crawler(data, browser)

                for URI in c.startURIs: 
                    c.Crawl(URI, c.maxDepth, q)
                    done_message = (f'\nCRAWLING COMPLETE: {c.name}.\n{c.new_content} files added.\nCrawler took {self.timer(t1, time.time())}.\n')
                    print(done_message)
                    c.browser.quit()

            except Exception as e:
                print(e)
                try:
                    c.browser.quit()
                except:
                    pass

EDIT:
Sample Output displaying Class instance, chrome session ID, thread number, and URL. In the fourth row, you can see that thread 192084 has begun picking up links intended for Thread 192083 (Swiss Medical Weekly)

British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/past-issues

British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/cover/januaryfebruary-2022

Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/archive

British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://smw.ch/issue-1/edn/smw.2022.0910

Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/issue-1/edn/smw.2022.0708

分享到QQ

分享到微博