使用作业文件运行时,Apache Nutch不读取新的配置文件

发布于 2025-02-06 20:09:22 字数 1387 浏览 2 评论 0原文

我已经配置了Apache Nutch 1.x用于网络爬行。有一个要求,我应该为每个域中的索尔尔文档添加一些额外信息。配置是一个JSON文件。我已经为此开发了遵循代码,并在本地模式下进行了成功。我已经更新了索引基础插件。代码片段如下:

this.enable_extra_domain  = conf.getBoolean("domain.extraInfo.enable", false);
    if (this.enable_extra_domain) {
         String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
         readDomainFile(domainExtraInfo);
         LOG.info("domain.extraInfo.enable is enabled. Using " + domainExtraInfo + " for input.");
    }
    else {
        LOG.info("domain.extraInfo.enable is disabled.");
    }

以及完成读取文件的功能如下所示,

private void readDomainFile(String domainExtraInfo) {
    // Instance of our Domain map with extra info
    website_records = new HashMap<String, List<Object>>();
    
    JSONParser jsonParser = new JSONParser();
    try (FileReader reader = new FileReader(domainExtraInfo))
    {
        Object obj = jsonParser.parse(reader);
        JSONArray DomainList = (JSONArray) obj;
  
        DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );
        
    }
    catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

当我在本地模式下运行时,该代码成功工作。但是,当我用.job文件运行nutch以在EMR(或其他Hadoop群集)上运行时,我面对java.io.filenotfoundexception。问题在哪里?在部署时,我将新的配置文件在conf文件夹中以conf文件夹为单位,将其添加到.job文件中

I have configured Apache Nutch 1.x for web crawling. There is a requirement that I should add some extra information to Solr document for each domain that is indexed. Configuration is a JSON file. I have developed following code for this and tested in local mode successfully. I have updated index-basic plugin. Code snippet is as follows:

this.enable_extra_domain  = conf.getBoolean("domain.extraInfo.enable", false);
    if (this.enable_extra_domain) {
         String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
         readDomainFile(domainExtraInfo);
         LOG.info("domain.extraInfo.enable is enabled. Using " + domainExtraInfo + " for input.");
    }
    else {
        LOG.info("domain.extraInfo.enable is disabled.");
    }

And the function where reading file is done is as below

private void readDomainFile(String domainExtraInfo) {
    // Instance of our Domain map with extra info
    website_records = new HashMap<String, List<Object>>();
    
    JSONParser jsonParser = new JSONParser();
    try (FileReader reader = new FileReader(domainExtraInfo))
    {
        Object obj = jsonParser.parse(reader);
        JSONArray DomainList = (JSONArray) obj;
  
        DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );
        
    }
    catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

This code work successfully when I run it in local mode. But when I run Nutch with .job file to run on EMR (or other Hadoop cluster), I faced java.io.filenotfoundexception. Where is the problem ? I have my new configuration file in conf folder in local mode while in deploy, it is added in .job file

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

赤濁 2025-02-13 20:09:22

我在部署时以本地模式将我的新配置文件在conf文件夹中,将其添加到.job文件

在分布式模式下将其添加到.job文件中,需要从部署到Hadoop群集节点的作业文件中读取文件。最简单的方法是使用 hadoop配置类,例如 getConfresourCeasReader(字符串名称)。注意:参数“名称”是没有目录零件的文件名(“ domain-extra.json”)。您会在nutch源代码中找到很多示例,例如。 in URL过滤器之一

I have my new configuration file in conf folder in local mode while in deploy, it is added in .job file

In distributed mode the file needs to be read from the job file deployed to the Hadoop cluster nodes. The easiest way is to use the methods provided by the Hadoop Configuration class, for example getConfResourceAsReader(String name). Note: the argument "name" is the file name without the directory part ("domain-extra.json"). You'll find a lot of examples in the Nutch source code, eg. in one of the URL filters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文