如何存储整个网页以供以后解析?

发布于 2024-09-07 18:00:43 字数 909 浏览 9 评论 0原文

我最近做了很多网页解析,我的过程通常如下所示:

  1. 获取要解析的链接列表 将
  2. 列表导入数据库
  3. 下载每个链接的整个网页并存储到 mysql
  4. 为每个抓取会话添加索引
  5. 抓取相关部分(内容,元数据,等等)
  6. 步骤 4,5 - 冲洗/重复 - 因为想要刮擦差异是很常见的。稍后来自同一页面的内容或修改您的 xpath 或擦除所述内容或其他内容。
  7. 将抓取数据库导出到真实数据库并删除网页列和抓取索引

现在,最简单的答案当然是在下载网页的同时进行抓取,但我认为这不太适合模块化设计,因为我希望能够进一步发展这个过程。

让我举一些我经常遇到的问题的例子: 对于 50k 页(行),我有大约 6gig 的数据库。请记住,我们将整个网页存储到一列中,并从中提取相关数据并将其存储到另一列中。

在具有 6 GB 内存的四核处理器上,将一个索引放到表上可能需要 7-10 分钟。上帝禁止你搞砸一些事情,然后看着 mysqld 的 cpu 和所有的 ram 跳到 70%。这就是为什么我有第 4 步——我所做的每个操作都会在执行之前在列上添加一个索引——所以如果我想获取元数据,我会在标题列上添加一个索引,然后更新每个操作标题不为空的行。

我应该声明,我不会一次性完成所有行——这往往会让我很糟糕——因为它应该是——你正在将 6gig 加载到内存中。 ;)

我认为这个问题的解决方案是——获取总计数,然后一次迭代 100 左右的偏移量。

不过,我认为这里也存在一些存储问题。我应该将原始网页存储在文件系统上吗?我考虑过将页面存储到面向文档的数据库中,例如 mongo 或 couch。

编辑 这里要明确一点——提出的任何解决方案都应该考虑到 50k 页面只是一个用户的一批这一事实。我还没有尝试拥有多个用户,但我希望能够一次存储多个批次。

I've been doing a lot of parsing of webpages lately and my process usually looks something like this:

  1. Obtain list of links to Parse
  2. Import list into database
  3. Download Entire Webpage for each link and store into mysql
  4. Add Index for each scraping session
  5. Scrape relevant sections (content, metas, whatever)
  6. Steps 4,5 -- Rinse/Repeat -- as it is common to want to scrape diff. content from the same page later on or modify your xpath or scrub said content or whatever.
  7. Export Scraping Database to Real Database and Remove Webpage column and scraping indexes

Now, the easiest answer is of course do the scraping at the same time that you are downloading the webpage but I don't think this lends itself to modular design very well as I'd like to be able to grow this process a bit more.

Let me give you some examples of the problems I keep running into: For 50k pages (rows) I have around a 6gig database. Remember, we are storing the ENTIRE webpage into one column and extracting relevant data from it and storing that into a different column.

Throwing an index on the table can take 7-10 minutes on a quad core with 6 gig of ram. God forbid you screw up on something and watch mysqld jump to 70% cpu and ALL of your ram. This is wnhy I have step 4 -- every operation I do I'll throw an index on the column before I do it -- so if I want to grab metas I'd throw an index on say the title column and then update each row where title is not null.

I should state that I do NOT do all rows in one go -- that tends to really screw me over bad -- as it should -- you are loading 6gig into memory. ;)

What I suppose the solution to this problem is -- grab a total count and then iterate through an offset of a 100 or so at a time.

Still -- I think there are some storage problems here as well. Should I be storing the original webpages on the filesystem instead? I have thought about storing the pages into a document oriented database such as mongo or couch.

EDIT
Just to be clear here -- any solution presented should take into account the fact that 50k pages is just ONE BATCH by ONE USER. I'm not trying to have multiple users quite yet but I would like the ability to store more than a couple of batches at a time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

国产ˉ祖宗 2024-09-14 18:00:43

为什么不在插入数据之前将索引添加到表中?这样,当行添加到表中时,就会构建索引。

Why don't you add the index to the table BEFORE inserting your data? This way the index is built as the rows are added to the table.

吾性傲以野 2024-09-14 18:00:43

如果您有更多硬件来解决问题,您可以使用 分片

我还建议您考虑从您正在捕获的网页中删除无用的信息(例如页面结构标签、JavaScript、样式等),并在适当的情况下压缩结果。

If you have more hardware to throw at the problem, you can start distributing your database over multiple servers using via sharding.

I would also suggest you consider removing useless information from the webpages you're capturing (e.g. page structure tags, JavaScript, styling, etc), and perhaps compressing the results if appropriate.

萧瑟寒风 2024-09-14 18:00:43

您可以使用现有的网络爬虫,例如wget 或许多其他之一。这可以将文件下载到硬盘,然后您可以解析文件并将有关它们的信息存储在数据库中。

You could use an existing web crawler such as wget or one of the many others. This can download the files to the hard disk and then you can parse the files afterwards and store information about them in the database.

烟燃烟灭 2024-09-14 18:00:43

谢谢大家帮我想这个!

我将在这里尝试一种混合方法:

1)将页面下拉到文件系统上的树结构。

2) 将内容放入不包含任何完整网页的通用内容表中(这意味着我们的平均 63k 列现在可能是 k 的 1/10。

详细信息

1)我的树结构容纳网页将如下所示:

-- usr_id1k
|   |-- user1
|   |   |-- job1
|   |   |   |-- pg_id1k
|   |   |   |   |-- p1
|   |   |   |   |-- p2
|   |   |   |   `-- p3
|   |   |   |-- pg_id2k
|   |   |   `-- pg_id3k
|   |   |-- job2
|   |   `-- job3
|   |-- user2
|   `-- user3
|-- usr_id2k
`-- usr_id3k

2)不是为每个“作业”创建一个表然后将其导出,我们将有几个不同的表 - 第一个是“内容”表。

content_type, Integer # fkey to content_types table
user_id, Integer # fkey to users table
content, Text # actual content, no full webpages

....其他东西,如created_at、updated_at、perms等...

Thanks for helping me think this out everyone!

I'm going to try a hybrid approach here:

1) Pull down pages to a tree structure on the filesystem.

2) Put content into a generic content table that does not contain any full webpage (this means that our average 63k column is now maybe a 1/10th of a k.

THE DETAILS

1) My tree structure for housing the webpages will look like this:

-- usr_id1k
|   |-- user1
|   |   |-- job1
|   |   |   |-- pg_id1k
|   |   |   |   |-- p1
|   |   |   |   |-- p2
|   |   |   |   `-- p3
|   |   |   |-- pg_id2k
|   |   |   `-- pg_id3k
|   |   |-- job2
|   |   `-- job3
|   |-- user2
|   `-- user3
|-- usr_id2k
`-- usr_id3k

2) Instead of creating a table for each 'job' and then exporting it we'll have a couple different tables -- the primary one being a 'content' table.

content_type, Integer # fkey to content_types table
user_id, Integer # fkey to users table
content, Text # actual content, no full webpages

.... other stuff like created_at, updated_at, perms, etc...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文