当前位置：文江博客话题详情

将图像从一个站点抓取到另一个站点

发布于 2024-10-28 15:45:15 字数 1301 浏览 6 评论 0原文

我是新来的，总体来说对网络开发还很陌生。我的背景是 3D 建模和设计，但我最近启动了一个项目，我认为该项目可能是 3D 社区的一个很好的资源。

我的页面主要设计和编码在这里：顶行，但我即将结束知识。上部和下部（艺术家聚光灯）部分将手动更新，所以我不担心这些。

我遇到问题的部分是“其余最佳”标题下的中间部分。我想要做的是从七个著名的 CG 论坛中抓取图像（和链接），并将它们显示在我布置的内容区域中。每个论坛的页面顶部都有一个部分，显示五到六张精选图像。

例如，如果您查看 CGSociety：他们的顶行有六件特色作品。我想获取三个最新的并将它们显示在我的 CGSociety 内容框中并包含原始线程的链接。获得链接也很重要，因为该网站的全部目的是为值得的艺术家带来曝光度。

这些图像始终位于相同的位置，并且始终具有可预测的路径，一直到图像名称：

即：http://features.cgsociety.org/cgtalk/plugs/"featured image".jpg

我不知道它是否相关，但是图像的 xpath 也可靠。对于 CGSociety 来说，图像基本上是由最终括号中包含的数字来确定的。

/x:html/x:body/x:div[4]/x:div/x:div/x:table[1]/x:tbody/x:tr/x:td[1]/x:a/x:img

我读过很多不同的堆栈溢出线程，但其中很多都超出了我的理解范围。我没有太多的编程经验，但我怀疑我想做的事情并不是那么复杂。

所以这是我的主要问题：

这种抓取的最佳（最简单）方法是什么？我不断看到 Python 与 Beautiful Soup 或 lxml 的提及，但其他人推荐了 PHP 与 cURL 和 xPath。
是否有一种特定的方法可以将源论坛的压力降到最低？这些论坛都有数万（或数十万）会员，因此这可能不是一个大问题，但如果可能的话，我希望在不直接热链接的情况下做到这一点。
我的方向正确吗？

另外：我知道抓取是一个合法的灰色地带。我计划征求每个相关论坛的许可，但我希望有一个工作模型可以在我提出要求时向他们展示。

任何帮助将非常非常感谢。我认为如果我能让它正常工作，这将是一个很酷的网站。

原文

I'm new here, and quite new to web development in general. My background is in 3D modeling and design, but I recently started a project that I think could be a nice resource for the 3D community.

I've got the page mostly designed and coded here: The Top Row, but I'm just about at the end of my knowledge. The upper section, and lower (artist spotlight) section will be manually updated, so I'm not worried about those.

The portion I'm having trouble with is the middle section under the "Best of the Rest" heading. What I want to do is scrape images (and links) from seven prominent CG forums and display them in the content areas I have laid out. Each of the forums have a section at the top of their page that displays five or six featured images.

If you look at CGSociety, for example: They have a top row with six featured pieces. I want to take the three newest and display them in my CGSociety content box with links to the original threads. It's important I get the links too, since the whole point of the website is to generate exposure for artists that deserve it.

The images are always in the same locations and always have a predictable path all the way up to the image name:

i.e: http://features.cgsociety.org/cgtalk/plugs/"featured image".jpg

I don't know if it's relevant, but the xpath for the images is reliable too. For CGSociety, the image is basically determined by the number contained in the final set of brackets.

/x:html/x:body/x:div[4]/x:div/x:div/x:table[1]/x:tbody/x:tr/x:td[1]/x:a/x:img

I've read so many different stack overflow threads, but so much of it is going over my head. I don't have much programming experience, but I suspect what I'm trying to do isn't really all that complicated.

So here are my main questions:

What's the best (easiest) method for this kind of scraping? I keep seeing Python with Beautiful Soup or lxml mentioned, but someone else recommended PHP with cURL and xPath.
Is there a particular method that will put the least possible strain on the source forums? These forums all have membership in the tens (or hundreds) of thousands, so this probably isn't a huge concern, but I'd love to do this without directly hotlinking if possible.
Am I even headed in the right direction?

Also: I know scraping is a legal grey-area. I plan on asking permission from each of the forums involved, but I want to have a working model to show them when I ask.

Any help will be very very much appreciated. I think this could be a cool site if I can get it working.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

彻夜缠绵 2024-11-04 15:45:15

我稍微刷新了我的 lxml 知识，并为您编写了一些代码，从该页面中抓取您想要的内容：

import lxml.html

images = []

html = lxml.html.parse("http://forums.cgsociety.org/")
table = html.xpath("//div[@class='page']/div[1]/table[1]")[0]

for cell in table.iterfind(".//td"):
    image = {}
    image['img_url'] = cell.find('a/img').get('src')
    image['link_url'] = cell.find('a').get('href')
    images.append(image)

images 现在包含：

[{'img_url': 'http://features.cgsociety.org/cgtalk/plugs/meind_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=975814&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/plugimg.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=975032&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/cg_portfolio_elmoooo.jpg',
  'link_url': 'http://elmoooo.cgsociety.org/gallery/?z=0&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/suck_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=973971&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/cry_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=972537&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/gerrard_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=972012&utm_medium=plugblock&utm_source=cgtalk'}]

如果您愿意，请随时向我发送电子邮件（您可以在我的个人资料中找到它）想要更多帮助。

I refreshed my lxml knowledge a bit and wrote you some code that scrapes what you wanted from that page:

import lxml.html

images = []

html = lxml.html.parse("http://forums.cgsociety.org/")
table = html.xpath("//div[@class='page']/div[1]/table[1]")[0]

for cell in table.iterfind(".//td"):
    image = {}
    image['img_url'] = cell.find('a/img').get('src')
    image['link_url'] = cell.find('a').get('href')
    images.append(image)

images now contains:

[{'img_url': 'http://features.cgsociety.org/cgtalk/plugs/meind_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=975814&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/plugimg.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=975032&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/cg_portfolio_elmoooo.jpg',
  'link_url': 'http://elmoooo.cgsociety.org/gallery/?z=0&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/suck_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=973971&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/cry_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=972537&utm_medium=plugblock&utm_source=cgtalk'},
 {'img_url': 'http://features.cgsociety.org/cgtalk/plugs/gerrard_p.jpg',
  'link_url': 'http://forums.cgsociety.org/showthread.php?s=&threadid=972012&utm_medium=plugblock&utm_source=cgtalk'}]

Feel free to send me an email (you can find it on my profile) if you'd like some more help.

回复收藏 0 原文