MapReduce编程

发布于 2024-11-07 19:28:33 字数 2677 浏览 1 评论 0原文

我已经在java上完成了这段代码,它收集有关照片的各种信息并将结果提取到文本文件中。 我想将此程序转换为与 MapReduce 模型一起运行。 我是 MapReduce 编程的新手。任何帮助将不胜感激! 谢谢

import java.io.*;
import java.util.*;
import java.net.*;

import javax.xml.parsers.ParserConfigurationException;

import org.xml.sax.SAXException;

import com.aetrion.flickr.people.User;
import com.aetrion.flickr.photos.Photo;
import com.aetrion.flickr.photos.PhotoList;
import com.aetrion.flickr.photos.PhotosInterface;
import com.aetrion.flickr.photos.SearchParameters;
import com.aetrion.flickr.photosets.PhotosetsInterface;
import com.aetrion.flickr.test.TestInterface;
import com.aetrion.flickr.people.PeopleInterface;
import com.aetrion.flickr.groups.*;
import com.aetrion.flickr.groups.pools.*;


import com.aetrion.flickr.*;

public class example2{

public example2() {



}

/**
* @param args
* @throws FlickrException
* @throws SAXException
* @throws IOException
* @throws ParserConfigurationException
*/

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException, SAXException, FlickrException, ParserConfigurationException { // TODO Auto-generated method stub

FileWriter out = new FileWriter("photos.txt");

//Set api key
String key="apikey";
String svr="www.flickr.com";
REST rest=new REST();
rest.setHost(svr);

//initialize Flickr object with key and rest
Flickr flickr=new Flickr(key,rest);
Flickr.debugStream=false;

//initialize SearchParameter object, this object stores the search keyword
SearchParameters searchParams=new SearchParameters();
searchParams.setSort(SearchParameters.INTERESTINGNESS_DESC);
searchParams.setGroupId("group_id");

//Initialize PhotosInterface object
PhotosInterface photosInterface=flickr.getPhotosInterface();
//Execute search with entered tags
PhotoList photoList=photosInterface.search(searchParams,500,1);

if(photoList!=null){
//Get search result and check the size of photo result
for(int i=0;i<photoList.size();i++){
//get photo object
Photo photo=(Photo)photoList.get(i);

System.out.print(photo.getId()+"\t");
out.write(photo.getId()+"\t");

System.out.print(photo.getOwner().getId()+"\t");
out.write(photo.getOwner().getId()+"\t");

Photo photo1=photosInterface.getPhoto(photo.getId());


if(photo1.getGeoData() != null ){
System.out.print("latitute="+photo1.getGeoData().getLatitude()+"\t");
out.write(photo1.getGeoData().getLatitude()+"\t");

System.out.print("longitude="+photo1.getGeoData().getLongitude()+"\t");
out.write(photo1.getGeoData().getLongitude()+"\t");
}
else {System.out.print(photo1.getGeoData()+"\t");
out.write(photo1.getGeoData()+"\t\t"+photo1.getGeoData());}
System.out.println("");
out.write("\n");



}
out.close();
}
}}

I 've done this code on java which collects various information about photos and extracts the results to a text file.
I would like to convert this program to function with the MapReduce model.
I am a newbie on MapReduce programming. Any help would be very appreciated!!
Thank you

import java.io.*;
import java.util.*;
import java.net.*;

import javax.xml.parsers.ParserConfigurationException;

import org.xml.sax.SAXException;

import com.aetrion.flickr.people.User;
import com.aetrion.flickr.photos.Photo;
import com.aetrion.flickr.photos.PhotoList;
import com.aetrion.flickr.photos.PhotosInterface;
import com.aetrion.flickr.photos.SearchParameters;
import com.aetrion.flickr.photosets.PhotosetsInterface;
import com.aetrion.flickr.test.TestInterface;
import com.aetrion.flickr.people.PeopleInterface;
import com.aetrion.flickr.groups.*;
import com.aetrion.flickr.groups.pools.*;


import com.aetrion.flickr.*;

public class example2{

public example2() {



}

/**
* @param args
* @throws FlickrException
* @throws SAXException
* @throws IOException
* @throws ParserConfigurationException
*/

@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException, SAXException, FlickrException, ParserConfigurationException { // TODO Auto-generated method stub

FileWriter out = new FileWriter("photos.txt");

//Set api key
String key="apikey";
String svr="www.flickr.com";
REST rest=new REST();
rest.setHost(svr);

//initialize Flickr object with key and rest
Flickr flickr=new Flickr(key,rest);
Flickr.debugStream=false;

//initialize SearchParameter object, this object stores the search keyword
SearchParameters searchParams=new SearchParameters();
searchParams.setSort(SearchParameters.INTERESTINGNESS_DESC);
searchParams.setGroupId("group_id");

//Initialize PhotosInterface object
PhotosInterface photosInterface=flickr.getPhotosInterface();
//Execute search with entered tags
PhotoList photoList=photosInterface.search(searchParams,500,1);

if(photoList!=null){
//Get search result and check the size of photo result
for(int i=0;i<photoList.size();i++){
//get photo object
Photo photo=(Photo)photoList.get(i);

System.out.print(photo.getId()+"\t");
out.write(photo.getId()+"\t");

System.out.print(photo.getOwner().getId()+"\t");
out.write(photo.getOwner().getId()+"\t");

Photo photo1=photosInterface.getPhoto(photo.getId());


if(photo1.getGeoData() != null ){
System.out.print("latitute="+photo1.getGeoData().getLatitude()+"\t");
out.write(photo1.getGeoData().getLatitude()+"\t");

System.out.print("longitude="+photo1.getGeoData().getLongitude()+"\t");
out.write(photo1.getGeoData().getLongitude()+"\t");
}
else {System.out.print(photo1.getGeoData()+"\t");
out.write(photo1.getGeoData()+"\t\t"+photo1.getGeoData());}
System.out.println("");
out.write("\n");



}
out.close();
}
}}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

清醇 2024-11-14 19:28:33

我不确定这对于 Hadoop 来说是否是一个好的用例,除非您有大量的搜索结果需要处理,并且处理占整个程序的很大一部分。搜索本身不能并行执行:只能在 for 循环中进行处理。

如果您想在 Hadoop 中并行处理一个搜索,您首先必须在 Hadoop 外部执行搜索**,并将结果输出到文本文件(例如 ID 列表)。然后,编写一个映射器,它获取 ID、获取照片,并执行当前在 for 循环中执行的处理,发出带有获取的属性的字符串(您当前正在将其打印到 System.out) )。 Hadoop 将为结果列表中的每个 ID 单独运行此映射器。

我认为这不值得,除非您计划进行其他处理。需要考虑的一些替代方案:

  • 使用map-reduce 并行执行大量不同的搜索。您的程序本质上不会发生变化——它只会在映射函数而不是 main() 循环中运行。或者您的搜索可能发生在映射器中,发出结果,并且您的处理可能发生在减速器中。您的输入将是搜索词列表。

  • 忘记map-reduce,只需使用线程池并行运行处理即可。查看java.util.concurrent 中的各种Executors


** 另一种让整个过程在 Hadoop“内部”运行的黑客方法是在映射函数内运行搜索,逐个发出结果。使用包含一行文本(虚拟值)的输入文件,以便您的映射器仅运行一次。然后在减速器而不是映射器中进行图像获取。


更新

如果您有一堆不同的组 ID 需要搜索,那么您可以使用第一种“替代”方法。

正如您所建议的,使用组 ID 和 API 密钥作为输入。每行放一个,用制表符或您可以轻松解析的内容分隔。如果您希望它们在不同的映射器中运行,您还必须将它们分成不同的文件。如果您只有与节点一样多的组 ID,您可能只想在每个文件中放入一行。将 TextInputFormat 用于 Hadoop 作业。包含组 ID 和 API 密钥的行将是 - 使用 value.toString().split("\t") 将其分成两部分。

然后,在映射器内运行整个搜索。对于每个结果,使用 context.write(key, value) (或 output.collect(key, value),具体取决于您的版本)写入照片 ID 作为键和以您的属性作为值的字符串。这两者都必须转换为 Hadoop 的 Text 对象。

我不会为此提供大量代码——调整 Hadoop MapReduce 教程。唯一真正的区别:

  • 使用job.setOutputValueClass(Text),并更改其中显示IntWritable的位置
    映射器类签名:

    公共静态类Map
        扩展 Mapper {
    
  • 只需禁用减速器。取出reducer类,并更改它:

    job.setMapperClass(Map.class);
    job.setCombinerClass(Reduce.class);
    job.setReducerClass(Reduce.class);
    

    进入此:

    job.setMapperClass(Map.class);
    job.setNumReduceTasks(0);
    

如果您对使其工作有具体问题,请随时询问。不过,请先投入一些研究工作。

I'm not sure this is a good use case for Hadoop, unless you have tons of search results to process, and the processing accounts for a significant portion of the overall program. The search itself can't be performed in parallel: only the processing in your for loop.

If you want to process one search in parallel in Hadoop, you will first have to perform the search outside Hadoop** and output the results to a text file--a list of IDs, for instance. Then, write a mapper that takes an ID, fetches the photo, and does the processing you currently do in your for loop, emitting the string with your fetched attributes (which you are currently printing to System.out). Hadoop will run this mapper individually for every ID in your list of results.

I don't imagine this is going to be worth it, unless there is some other processing you are planning on doing. Some alternatives to consider:

  • Use map-reduce to perform lots of different searches in parallel. Your program would be essentially unchanged--it would just run inside a map function instead of the main() loop. Or your search could happen in the mapper, emitting the results, and your processing could happen in the reducer. Your input would be a list of search terms.

  • Forget about map-reduce, and just run the processing in parallel using a thread pool. Check out the various Executors in java.util.concurrent.


** An alternative, hackish way to make the whole thing run "inside" Hadoop would be to run the search inside a map function, emitting the results one by one. Use an input file that has one line of text--a dummy value--so your mapper just runs once. Then do the image fetching in a reducer instead of the mapper.


Update:

If you have a bunch of different Group IDs to search, then you can use the first "alternative" approach.

Like you suggested, use the Group IDs and API keys as input. Put one on each line, separated by a tab or something that you can easily parse. You will also have to split them up into different files if you want them to run in different mappers. If you only have as many Group IDs as nodes, you will probably just want to put one line in each file. Use TextInputFormat for your Hadoop job. The line with the Group ID and API key will be the value--use value.toString().split("\t") to separate it into the two parts.

Then, run your whole search inside the mapper. For each result, use context.write(key, value) (or output.collect(key, value), depending on your version) to write a photo ID as the key and the string with your attributes as the value. Both of these will have to be converted into Hadoop's Text objects.

I'm not going to give wholesale code for this--it should be very easy to just adapt the Hadoop MapReduce tutorial. The only real differences:

  • Use job.setOutputValueClass(Text), and change where it says IntWritable in the
    mapper class signature:

    public static class Map
        extends Mapper<LongWritable, Text, Text, Text> {
    
  • Just disable the reducer. Take out the reducer class, and change this:

    job.setMapperClass(Map.class);
    job.setCombinerClass(Reduce.class);
    job.setReducerClass(Reduce.class);
    

    into this:

    job.setMapperClass(Map.class);
    job.setNumReduceTasks(0);
    

If you have specific questions about getting this to work, feel free to ask. Do put some research effort into it first, though.

初吻给了烟 2024-11-14 19:28:33

我不同意蒂姆·耶茨的回答。您的搜索可以很好地并行化。
我的方法如下:

  1. 实现一个映射器,它将大部分搜索查询作为输入(您必须自己对它们进行分块,因为这些东西不是序列文件),然后执行查询并将结果写入文件系统。键是 ID,值是您的附加信息。
  2. 实现一个减少,使数据成为你想要的任何东西(第一步的输出)。

我已经使用 YouTube API 完成了此操作,因此它可以很好地并行化。但您必须注意配额限制。您可以使用 Thread.sleep(PERIOD_OF_TIME) 很好地处理它们。
这仅适用于您有大量搜索查询的情况,如果您有用户输入搜索字符串,则 MapReduce 不是(最佳)方法。

I don't agree with Tim Yates answer. Your search can be very well parallized.
My approach would be the following:

  1. Implement a Mapper that will take a chunk of your search queries as input (you have to chunk them for yourself because these things aren't sequencefiles), then do the query stuff and write the result into the filesystem. Key would be the ID and Value would be your additional information.
  2. Implement a reduce that makes anything you want with the data (output of first step).

I've already did this with the YouTube API, so it works well parallized. BUT you have to watch out for quota limits. You can handle them with Thread.sleep(PERIOD_OF_TIME) quite good.
This only applies if you have a bunch of search queries, if you have a user that will input a searchstring, MapReduce isn't the (optimal) way to go.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文