bing 视频搜索如何从如此多的不同网站中提取视频?
他们是在反编译闪存还是类似的东西?我无法想象他们是如何做到的。
Are they decompiling the flash or something like this? I can't imagine how they have done it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
只是猜测,但他们可以看到 Flash SWF 文件连接到什么(即,根据 SWF 文件发出的 HTTP 请求查找 FLV url)。一旦他们这样做了,他们就可以做以下两件事之一:
1)将 URL 排队到一个进程中,该进程:i)下载 FLV,ii)将 FLV 剪切为 10 秒长,iii)添加淡入/淡出,iv ) 保存。
或者
2) 他们可以每次使用原始 URL 直接连接到 FLV,并且仅播放 10 秒。然后,他们可以在视频上方添加淡入/淡出等效果。不过,
我怀疑他们是否会使用第二种方法,因为它可能会给人们的服务器带来烦人的峰值,并且可能会增加延迟。第一种方法允许 Bing 服务器缓存视频,并将其托管在专用于视频流的可靠位置。
更新
想想看,还有另一种方法可以做到这一点:
我知道在 PHP 中您可以即时反编译已编译的 SWF。它相当快,并且这是提取任何网址的简单方法。当然,微软不会使用 PHP,但我很确定他们有一个用 C++ 编写的等效库(我相当确定他们使用 C++)。
但即使他们正在寻找对 FLV 的 HTTP 请求,他们也可能会在轻量级“浏览器”中运行爬虫。浏览器需要渲染 Flash,以便发出 HTTP 请求,然后记录所有出站请求。如果您运行自己的服务器,这并不是一项太困难的任务,您只需在其中设置一个后台进程并搜索日志以查找 FLV 请求即可。创建自己的浏览器来执行此操作可能听起来令人畏惧,但实际上非常简单:在 C# 中,您可以向 URL 发出 HttpRequest,扫描文档中的任何链接,对链接进行排队,请求每个链接,然后以这种方式循环(确保您不访问您已经访问过的链接)。在 PHP 中,您可以卷曲 URL 并执行相同的操作。每当您找到 SWF 链接时,您都可以将其添加到不同的队列中,该队列可以渲染 Flash(或对其进行反编译),并找到指向 FLV URL 的任何链接,然后根据需要对这些链接进行排队。
Just speculation, but they could see what the Flash SWF file is connecting to (ie. Finding the FLV url, based on the HTTP request made by the SWF file). Once they do that, they could do one of two things:
1) Queue the url to a process which: i) Downloads the FLV, ii) Snips the FLV to be 10 seconds long, iii) Adds fade in/fade out, iv) Saves.
or
2) They could just connect directly to the FLV each time using the original url, and play only 10 seconds. They could then add effects like the fade in/fade out over top the video. Though,
I'm doubtful they'd use the second method, as it could cause annoying spikes to people's servers, and it could potentially increase lag. The first method allows the Bing servers to cache the videos, and host them in one reliable location that's dedicated to video streaming.
Update
Come to think of it, there's another method to do this:
I know that in PHP you can decompile a compiled SWF on the fly. It's rather quick, and this would be an easy way of extracting any urls. Of course, Microsoft wouldn't be using PHP, but I'm pretty sure they have an equivalent library written in C++ (I'm fairly sure they use C++).
But even if they were looking for HTTP requests to an FLV, they'd probably have a crawler running in a light-weight "browser." The browser would need to render the flash so that it then makes the HTTP request, and it would then log all out bound requests. This isn't too difficult a task if you're running your own server, you can just have a background process that sits there and scours the logs looking for FLV requests. Creating your own browser to do this may sound daunting, but it's actually pretty simple(ish): In C# you could make an HttpRequest to a URL, scan the document for any links, queue the links, request each link, and loop that way (making sure you don't visit links you've already visited). In PHP, you could curl the URL and do the same. Anytime you find a SWF link, you then add that to a different queue, one which could render the flash (or decompile it), and find any links to FLV urls, and then you queue those as needed.