We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 9 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(1)
唯一可能被可靠分类的 URL 是那些指向不同媒介的 URL(即 http://foo.jpg)。 com/foo.jpg 肯定是一张图像)。否则,您必须分析页面的内容。
这可能有点棘手,因为 Flash 可能包含照片、视频或两者都不包含,而不提供有关 Flash 对象内容的任何可搜索线索。只要付出足够的努力,这个问题显然是可以克服的(谷歌做到了!),但我不知道有任何开源资源提供媒体相关领域的库。这些数据是程序员无数个小时的努力的结果——这种努力通常是为了寻求投资回报(ROI)。举个例子, ClueWeb09 只是下载页面的数据集,用于测试搜索算法——没有真正排序或分类。
“有时候,没有帮助就是答案。”
The only URLs that may be even somewhat reliably classified, are those that point to a distinct medium (i.e. http://foo.com/foo.jpg is most certainly an image). Otherwise, you must analyze the content of the page.
This can be a bit tricky, as Flash may contain a photo, video, or neither, without providing any searchable clue as to the content of the flash object. With enough effort, this can obviously be overcome (Google does it!), but I'm not aware of any open source resources that provide a library of media-related domains. Such data result from countless programmer-hours of effort -- an effort that typically seeks a return on investment (ROI). Case in point, ClueWeb09 is just a dataset of downloaded pages, used to test search algorithms -- not really sorted or categorized.
"Sometimes no help is the answer."