我正在并行运行网络爬虫的许多实例。
每个爬虫从表中选择一个域,将该 URL 和开始时间插入到日志表中,然后开始爬行该域。
其他并行爬虫在选择自己的域进行爬行之前,会检查日志表以了解哪些域已经被爬行。
我需要防止其他爬网程序选择刚刚被另一个爬网程序选择但还没有日志条目的域。我对如何执行此操作的最佳猜测是,当一个爬网程序选择一个域并在日志表中插入一行(两个查询)时,锁定数据库以防止所有其他读/写操作。
到底是怎么做到这一点的呢?恐怕这非常复杂并且依赖于许多其他事情。请帮助我开始。
这段代码似乎是一个很好的解决方案(但是请参阅下面的错误):
INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
(
SELECT companies.id FROM companies
LEFT OUTER JOIN crawlLog
ON companies.id = crawlLog.companyId
WHERE crawlLog.companyId IS NULL
LIMIT 1
),
now()
)
但我不断收到以下 mysql 错误:
You can't specify target table 'crawlLog' for update in FROM clause
有没有办法在没有此问题的情况下完成相同的操作?我尝试了几种不同的方法。包括这个:
INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
(
SELECT id
FROM companies
WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
),
now()
)
I am running many instances of a webcrawler in parallel.
Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.
Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.
I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).
How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.
This code seems like a good solution (see the error below, however):
INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
(
SELECT companies.id FROM companies
LEFT OUTER JOIN crawlLog
ON companies.id = crawlLog.companyId
WHERE crawlLog.companyId IS NULL
LIMIT 1
),
now()
)
but I keep getting the following mysql error:
You can't specify target table 'crawlLog' for update in FROM clause
Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:
INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
(
SELECT id
FROM companies
WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
),
now()
)
发布评论
评论(6)
您可以使用 MySQL
LOCK TABLES
命令锁定表,如下所示:请参阅:
http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html
You can lock tables using the MySQL
LOCK TABLES
command like this:See:
http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html
好吧,表锁是解决这个问题的一种方法;但这使得并行请求变得不可能。如果表是 InnoDB,您可以使用 SELECT 强制行锁定...用于事务中的更新。
请注意,您需要在
domainname
(或您在 WHERE 子句中使用的任何列)上建立索引才能使其正常工作,但这通常是有意义的,我认为无论如何您都会拥有它。Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.
Please note that you will need an index on
domainname
(or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.您可能不想锁定表。如果你这样做,你将不得不担心当其他爬虫尝试写入数据库时捕获错误 - 这就是你在说“......非常复杂并且依赖于许多其他事情”时所想的。
相反,您可能应该将查询组包装在 MySQL 事务中(请参阅 http: //dev.mysql.com/doc/refman/5.0/en/commit.html)像这样:
或者类似的东西。
[编辑]我刚刚意识到 - 您可能可以在单个查询中完成您需要的所有操作,甚至不必担心事务。像这样的东西:
You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."
Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:
Or something close to that.
[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:
我从@Eljakim的回答中得到了一些灵感,并开始 这个新线程我想出了一个很棒的技巧。它不涉及锁定任何东西并且非常简单。
I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.
我不会使用锁定或事务。
最简单的方法是在日志记录表中插入一条记录(如果该记录尚不存在),然后检查该记录。
假设您有包含爬虫的
tblcrawels (cra_id)
和包含 URL 的tblurl (url_id)
,以及一个表tbllogging (log_cra_id, log_url_id )
作为您的日志文件。如果爬虫 1 要开始爬取 url 2,您将运行以下查询:
下一步是检查该记录是否已插入。
如果得到任何结果,则爬虫 1 可以爬取此 url。如果您没有得到任何结果,这意味着另一个爬网程序已插入同一行并且已经在爬网。
I wouldn't use locking, or transactions.
The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.
Assume you have
tblcrawels (cra_id)
that is filled with your crawlers andtblurl (url_id)
that is filled with the URLs, and a tabletbllogging (log_cra_id, log_url_id)
for your logfile.You would run the following query if crawler 1 wants to start crawling url 2:
The next step is to check whether this record has been inserted.
If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.
最好使用行锁或基于事务的查询,以便其他并行请求上下文可以访问该表。
It's better to use row lock or transactional based query so that other parallel request context can access the table.