如何锁定 MySQL 表的读/写,以便我可以选择然后插入,而无需其他程序读/写数据库?

发布于 2024-11-18 22:46:42 字数 1001 浏览 4 评论 0 原文

我正在并行运行网络爬虫的许多实例。

每个爬虫从表中选择一个域,将该 URL 和开始时间插入到日志表中,然后开始爬行该域。

其他并行爬虫在选择自己的域进行爬行之前,会检查日志表以了解哪些域已经被爬行。

我需要防止其他爬网程序选择刚刚被另一个爬网程序选择但还没有日志条目的域。我对如何执行此操作的最佳猜测是,当一个爬网程序选择一个域并在日志表中插入一行(两个查询)时,锁定数据库以防止所有其他读/写操作。

到底是怎么做到这一点的呢?恐怕这非常复杂并且依赖于许多其他事情。请帮助我开始。


这段代码似乎是一个很好的解决方案(但是请参阅下面的错误):

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

但我不断收到以下 mysql 错误:

You can't specify target table 'crawlLog' for update in FROM clause

有没有办法在没有此问题的情况下完成相同的操作?我尝试了几种不同的方法。包括这个:

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)

I am running many instances of a webcrawler in parallel.

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.

Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.

I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).

How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.


This code seems like a good solution (see the error below, however):

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

but I keep getting the following mysql error:

You can't specify target table 'crawlLog' for update in FROM clause

Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

遗忘曾经 2024-11-25 22:46:42

您可以使用 MySQL LOCK TABLES 命令锁定表,如下所示:

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

请参阅:

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

You can lock tables using the MySQL LOCK TABLES command like this:

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

See:

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

倥絔 2024-11-25 22:46:42

好吧,表锁是解决这个问题的一种方法;但这使得并行请求变得不可能。如果表是 InnoDB,您可以使用 SELECT 强制行锁定...用于事务中的更新

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

请注意,您需要在 domainname (或您在 WHERE 子句中使用的任何列)上建立索引才能使其正常工作,但这通常是有意义的,我认为无论如何您都会拥有它。

Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

自演自醉 2024-11-25 22:46:42

您可能不想锁定表。如果你这样做,你将不得不担心当其他爬虫尝试写入数据库时​​捕获错误 - 这就是你在说“......非常复杂并且依赖于许多其他事情”时所想的。

相反,您可能应该将查询组包装在 MySQL 事务中(请参阅 http: //dev.mysql.com/doc/refman/5.0/en/commit.html)像这样:

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

或者类似的东西。

[编辑]我刚刚意识到 - 您可能可以在单个查询中完成您需要的所有操作,甚至不必担心事务。像这样的东西:

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.

You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

Or something close to that.

[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.
最好是你 2024-11-25 22:46:42

我从@Eljakim的回答中得到了一些灵​​感,并开始 这个新线程我想出了一个很棒的技巧。它不涉及锁定任何东西并且非常简单。

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1

I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1
若相惜即相离 2024-11-25 22:46:42

我不会使用锁定或事务。

最简单的方法是在日志记录表中插入一条记录(如果该记录尚不存在),然后检查该记录。

假设您有包含爬虫的 tblcrawels (cra_id) 和包含 URL 的 tblurl (url_id),以及一个表 tbllogging (log_cra_id, log_url_id ) 作为您的日志文件。

如果爬虫 1 要开始爬取 url 2,您将运行以下查询:

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

下一步是检查该记录是否已插入。

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

如果得到任何结果,则爬虫 1 可以爬取此 url。如果您没有得到任何结果,这意味着另一个爬网程序已插入同一行并且已经在爬网。

I wouldn't use locking, or transactions.

The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.

You would run the following query if crawler 1 wants to start crawling url 2:

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

The next step is to check whether this record has been inserted.

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

空宴 2024-11-25 22:46:42

最好使用行锁或基于事务的查询,以便其他并行请求上下文可以访问该表。

It's better to use row lock or transactional based query so that other parallel request context can access the table.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文