使用 PHP 或 Python 的网页抓取技术

发布于 2024-10-09 19:32:55 字数 349 浏览 3 评论 0原文

我需要抓取大约 100 个提供的内容非常相似的网站。

我的第一个疑问。应该可以编写通用脚本来抓取所有 100 个网站,或者在抓取技术中只能为特定网站编写脚本。 (愚蠢的问题。)。我想我应该问一下哪种可能性更容易。为每个网站编写 100 个不同的脚本很困难。

第二个问题。我的主要语言是 PHP,但在 Stackoverflow 上搜索后,我发现最先进的爬虫之一是 Python 中的“Beautiful Soup”。应该可以用 PHP 调用 Python 中的“Beautiful Soup”吗?或者应该更好地用 Python 来完成所有脚本?

给我一些关于我该怎么走的线索。

抱歉我的英语不好。

此致,

I need to scrape about 100 websites that are very similar in the content that they provide.

My first doubt. Should be possible to write a generic script to scrape all the 100 websites or in scraping techniques is only possible to write scripts for particular websites. (Dumb question.). I think I should ask what possibility is easier. Write 100 different scripts for each website is hard.

Second question. My primary language is PHP, but after searching here on Stackoverflow I found that one of the most advanced scrapers is "Beautiful Soup" in Python. Should be possible to make calls in PHP to "Beautiful Soup" in Python? Or should be better to do all the script in Python?

Give me some clues on how should I go.

Sorry for my weak english.

Best Regards,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

微暖i 2024-10-16 19:32:55

因为我更喜欢 PHP 而不是 Python,所以我曾经使用 phpQuery 从网站上抓取数据。它工作得很好,我很快就想出了一个 scaper,使用 CSS 选择器(在 SelectorGadget 的帮助下) ) 选择元素并获取其 ->text()

但我发现它有点慢(因为我必须抓取数千页),所以最后我将其更改为使用正则表达式来抓取数据。 d:

Because I prefer PHP rather than Python, I once used phpQuery to scrape data from websites. It works pretty well, and I came up with a scaper pretty quickly, using CSS selectors (with the help of SelectorGadget) to select elements and get the ->text() of it.

But I found it to be a bit slow (since I had to scrape thousands of pages), so in the end I changed it to use regex to scrape data instead. D:

哆啦不做梦 2024-10-16 19:32:55

1.) 一个爬虫可以处理 100 个站点?这取决于您的要求。如果您只需要特定信息,则需要考虑 100 个不同的网站及其布局。不过,一些通用功能可以共享。

2.) BeautifulSoup 是一个 HTML/XML 解析器,而不是一个屏幕抓取器本身。如果爬虫是用 python 编写的,那么它将是该任务的首选。从 php 调用 python 是可以完成的,但它肯定不如单语言解决方案那么干净。这就是为什么我建议您为了原型而研究 python 和 BeautifulSoup。

旁注: http://scrapy.org/ 是另一个 python 库,专门设计的

抓取网站并从其页面中提取结构化数据。

1.) One scraper for 100 sites? It depends on your requirements. If you only need specific information, you'll need to consider 100 different websites, and their layouts. Some generic functionality could be shared, though.

2.) BeautifulSoup is an HTML/XML parser, not a screen scraper per se. It would be a top choice for the task, if the scraper would be written in python. Calling python from php can be done, but it is certainly not as clean as a single-language solution. Which is why I'd suggest you look into python and BeautifulSoup for the sake of a prototype.

Sidenote: http://scrapy.org/ is another python library, especially designed

to crawl websites and extract structured data from their pages.

韵柒 2024-10-16 19:32:55

我已经用几种方法做到了这一点。

1:使用 grep、sed 和 awk。这与 2: regex 大致相同。这些方法非常直接,但只要网站的 HTML 结构发生变化就会失败。

3:PHP 的 XML/HTML 解析器 DomDocument。这比正则表达式可靠得多,但我发现使用它很烦人(我讨厌 PHP 数组和对象的混合)。如果您想使用 PHP,PHPQuery 可能是一个很好的解决方案,正如 Thai 建议的那样。

4:Python 和 BeautifulSoup。我对 BeautifulSoup 赞不绝口,这就是我推荐的方法。我发现我的 Python 代码感觉更干净,而且 BeautifulSoup 使用起来非常简单和高效。也有很好的文档。

您必须为每个站点专门化您的脚本。这取决于您想要提取哪种信息。如果它是像 body title 这样的标准内容,当然您不必更改任何内容,但您想要的信息可能更具体?

I've done this a few ways.

1: with grep, sed, and awk. This is about the same as 2: regex. These methods are very direct, but fail whenever the HTML structure of the site changes.

3: PHP's XML/HTML parser DomDocument. This is far more reliable than regex, but I found it annoying to work with (I hate the mixture of PHP arrays and objects). If you want to use PHP, PHPQuery is probably a good solution, as Thai suggested.

4: Python and BeautifulSoup. I can't say enough good things about BeautifulSoup, and this is the method I recommend. I found my code feels cleaner in Python, and BeautifulSoup was very easy and efficient to work with. Good documentation, too.

You will have to specialize your script for each site. It depends on what sort of information you wish to extract. If it was something standard like body title, of course you wouldn't have to change anything, but it's likely the info you want is more specific?

萌逼全场 2024-10-16 19:32:55

我们使用 Python 对 RSS 提要进行类似的操作——我们使用 ElementTree,因为 RSS 通常可以保证格式良好。 Beautiful Soup 可能更适合解析 HTML。

就处理 100 个不同的站点而言,尝试编写一个适用于大多数站点的抽象,并将页面转换为您可以使用的通用数据结构。然后重写部分抽象来处理与规范不同的各个站点。

抓取器通常受 I/O 限制——研究像 eventlet 或 gevent 这样的协程库,以利用一些 I/O 并行性并加速整个过程。

We do something sort of like this with RSS feeds using Python -- we use ElementTree since RSS is usually guaranteed to be well-formed. Beautiful Soup is probably better suited for parsing HTML.

Insofar as dealing with 100 different sites, try to write an abstraction that works on most of them and transforms the page into a common data-structure you can work with. Then override parts of the abstraction to handle individual sites which differ from the norm.

Scrapers are usually I/O bound -- look into coroutine libraries like eventlet or gevent to exploit some I/O parallelism and speed up the whole process.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文