使用作业文件运行时，Apache Nutch不读取新的配置文件

发布于 2025-02-06 20:09:22 字数 1387 浏览 2 评论 0原文

我已经配置了Apache Nutch 1.x用于网络爬行。有一个要求，我应该为每个域中的索尔尔文档添加一些额外信息。配置是一个JSON文件。我已经为此开发了遵循代码，并在本地模式下进行了成功。我已经更新了索引基础插件。代码片段如下：

this.enable_extra_domain  = conf.getBoolean("domain.extraInfo.enable", false);
    if (this.enable_extra_domain) {
         String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
         readDomainFile(domainExtraInfo);
         LOG.info("domain.extraInfo.enable is enabled. Using " + domainExtraInfo + " for input.");
    }
    else {
        LOG.info("domain.extraInfo.enable is disabled.");
    }

以及完成读取文件的功能如下所示，

private void readDomainFile(String domainExtraInfo) {
    // Instance of our Domain map with extra info
    website_records = new HashMap<String, List<Object>>();
    
    JSONParser jsonParser = new JSONParser();
    try (FileReader reader = new FileReader(domainExtraInfo))
    {
        Object obj = jsonParser.parse(reader);
        JSONArray DomainList = (JSONArray) obj;
  
        DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );
        
    }
    catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

当我在本地模式下运行时，该代码成功工作。但是，当我用.job文件运行nutch以在EMR（或其他Hadoop群集）上运行时，我面对java.io.filenotfoundexception。问题在哪里？在部署时，我将新的配置文件在conf文件夹中以conf文件夹为单位，将其添加到.job文件中

原文

I have configured Apache Nutch 1.x for web crawling. There is a requirement that I should add some extra information to Solr document for each domain that is indexed. Configuration is a JSON file. I have developed following code for this and tested in local mode successfully. I have updated index-basic plugin. Code snippet is as follows:

this.enable_extra_domain  = conf.getBoolean("domain.extraInfo.enable", false);
    if (this.enable_extra_domain) {
         String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
         readDomainFile(domainExtraInfo);
         LOG.info("domain.extraInfo.enable is enabled. Using " + domainExtraInfo + " for input.");
    }
    else {
        LOG.info("domain.extraInfo.enable is disabled.");
    }

And the function where reading file is done is as below

private void readDomainFile(String domainExtraInfo) {
    // Instance of our Domain map with extra info
    website_records = new HashMap<String, List<Object>>();
    
    JSONParser jsonParser = new JSONParser();
    try (FileReader reader = new FileReader(domainExtraInfo))
    {
        Object obj = jsonParser.parse(reader);
        JSONArray DomainList = (JSONArray) obj;
  
        DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );
        
    }
    catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

This code work successfully when I run it in local mode. But when I run Nutch with .job file to run on EMR (or other Hadoop cluster), I faced java.io.filenotfoundexception. Where is the problem ? I have my new configuration file in conf folder in local mode while in deploy, it is added in .job file

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赤濁 2025-02-13 20:09:22

我在部署时以本地模式将我的新配置文件在conf文件夹中，将其添加到.job文件
中

在分布式模式下将其添加到.job文件中，需要从部署到Hadoop群集节点的作业文件中读取文件。最简单的方法是使用 hadoop配置类，例如 getConfresourCeasReader（字符串名称）。注意：参数“名称”是没有目录零件的文件名（“ domain-extra.json”）。您会在nutch源代码中找到很多示例，例如。 in URL过滤器之一。