如果我有一个需要文件路径的构造函数,我如何“伪造”文件路径?如果它被打包到一个罐子里呢?
这个问题的背景是,我正在尝试在我编写的 Pig 脚本中使用 maxmind java api...但是,我认为了解其中任何一个都不需要回答这个问题。
maxmind API 有一个构造函数,它需要一个名为 GeoIP.dat 的文件的路径,该文件是一个逗号分隔的文件,其中包含所需的信息。
我有一个包含 API 的 jar 文件,以及一个实例化该类并使用它的包装类。我的想法是将GeoIP.dat文件打包到jar中,然后将其作为jar文件中的资源进行访问。问题是我不知道如何构造构造函数可以使用的路径。
看看 API,这就是他们加载文件的方式:
public LookupService(String databaseFile) throws IOException {
this(new File(databaseFile));
}
public LookupService(File databaseFile) throws IOException {
this.databaseFile = databaseFile;
this.file = new RandomAccessFile(databaseFile, "r");
init();
}
我只粘贴它,因为我不反对编辑 API 本身来使其工作(如有必要),但不知道如何复制这样的功能。理想情况下,我希望将其转换为文件形式,否则编辑 API 将是一件很麻烦的事情。
这可能吗?
The context of this question is that I am trying to use the maxmind java api in a pig script that I have written... I do not think that knowing about either is necessary to answer the question, however.
The maxmind API has a constructor which requires a path to a file called GeoIP.dat, which is a comma delimited file which has the information it needs.
I have a jar file which contains the API, as well as a wrapping class which instantiates the class and uses it. My idea is to package the GeoIP.dat file into the jar, and then access it as a resource in the jar file. The issue is that I do not know how to construct a path that the constructor can use.
Looking at the API, this is how they load the file:
public LookupService(String databaseFile) throws IOException {
this(new File(databaseFile));
}
public LookupService(File databaseFile) throws IOException {
this.databaseFile = databaseFile;
this.file = new RandomAccessFile(databaseFile, "r");
init();
}
I only paste that because I am not averse to editing the API itself to make this work, if necessary, but do not know how I could replicate the functionality I as such. Ideally I'd like to get it into the file form, though, or else editing the API will be quite a chore.
Is this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
尝试:
Try:
将数据转储到临时文件,并将临时文件提供给它。
dump your data to a temp file, and feed the temp file to it.
一种推荐的方法是使用分布式缓存而不是试图将它捆绑到一个罐子里。
如果您压缩 GeoIP.dat 并将其复制到 hdfs://host:port/path/GeoIP.dat.zip 上。然后将这些选项添加到 Pig 命令中:
并且
LookupService LookupService = new LookupService("./GeoIP.dat");
应该在您的 UDF 中工作,因为该文件将出现在每个节点上本地任务中。One recommended way is to use the Distributed Cache rather than trying to bundle it into a jar.
If you zip GeoIP.dat and copy it on hdfs://host:port/path/GeoIP.dat.zip. Then add these options to the Pig command:
And
LookupService lookupService = new LookupService("./GeoIP.dat");
should work in your UDF as the file will be present locally to the tasks on each node.这对我有用。
假设您有一个包含 GeoLiteCity.dat 的包 org.foo.bar.util
This works for me.
Assuming you have a package org.foo.bar.util that contains GeoLiteCity.dat
使用
classloader.getResource(...)
方法在类路径中进行文件查找,这将从 JAR 文件中提取它。这意味着您必须更改现有代码才能覆盖加载。有关如何执行此操作的详细信息在很大程度上取决于您现有的代码和环境。在某些情况下,子类化并向框架注册子类可能会起作用。在其他情况下,您可能必须确定类路径中类加载的顺序,并将相同签名的类“较早”放置在类路径中。
Use the
classloader.getResource(...)
method to do the file lookup in the classpath, which will pull it from the JAR file.This means you will have to alter the existing code to override the loading. The details on how to do that depend heavily on your existing code and environment. In some cases subclassing and registering the subclass with the framework might work. In other cases, you might have to determine the ordering of class loading along the classpath and place an identically signed class "earlier" in the classpath.
以下是我们如何使用 maxmind geoIP;
我们将 GeoIPCity.dat 文件放入云中,并在启动进程时使用云位置作为参数。
我们获取
GeoIPCity.data
文件并创建新的LookupService
的代码是:这是我们用来运行进程的命令的缩写版本
$HADOOP_HOME/ bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/ Analytics/libjars/geoiplookup.jar
运行 MindMax 组件的关键组件是
-files
和-libjars
。这些是 GenericOptionsParser 中的通用选项。-files <以逗号分隔的文件列表>指定要复制到 MapReduce 集群的逗号分隔文件
-libjars <以逗号分隔的 jar 列表>指定要包含在类路径中的逗号分隔的 jar 文件。
我假设 Hadoop 使用
GenericOptionsParser
因为我在项目中的任何地方都找不到对它的引用。 :)如果您将
GeoIPCity.dat
放在 can 上并使用-files
参数指定它,它将被放入本地缓存中,然后映射器可以进入该缓存setup
函数。它不必位于setup
中,但只需每个映射器完成一次,因此是放置它的绝佳位置。然后使用
-libjars
参数指定 geoiplookup.jar (或任何你所说的),它将能够使用它。我们不会将 geoiplookup.jar 放在云端。我假设 hadoop 将根据需要分发 jar。我希望一切都有意义。我对 hadoop/mapreduce 相当熟悉,但我没有在项目中编写使用 maxmind geoip 组件的部分,所以我必须做一些挖掘才能很好地理解它,以便完成我在这里的解释。
编辑:
-files
和-libjars
的附加描述-files files 参数用于通过 Hadoop 分布式缓存分发文件。在上面的示例中,我们通过 Hadoop 分布式缓存分发 Max Mind geo-ip 数据文件。我们需要访问 Max Mind geo-ip 数据文件,以将用户的 IP 地址映射到适当的国家、地区、城市、时区。 API 要求数据文件存在于本地,这在分布式处理环境中是不可行的(我们无法保证集群中的哪些节点将处理数据)。为了将适当的数据分发到处理节点,我们使用 Hadoop 分布式缓存基础设施。 GenericOptionsParser 和 ToolRunner 使用 –file 参数自动促进这一点。请注意,我们分发的文件应在云 (HDFS) 中可用。
-libjars –libjars 用于分发 map-reduce 作业所需的任何其他依赖项。与数据文件一样,我们还需要将依赖库复制到将运行作业的集群中的节点。 GenericOptionsParser 和 ToolRunner 使用 –libjars 参数自动促进这一点。
Here's how we use the maxmind geoIP;
We put the
GeoIPCity.dat
file into the cloud and use the cloud location as an argument when we launch the process.The code where we get the
GeoIPCity.data
file and create a newLookupService
is:Here is an abbreviated version of command we use to run our process
$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar
The critical components of this for running the MindMax component is the
-files
and-libjars
. These are generic options in the GenericOptionsParser.-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
I'm assuming that Hadoop uses the
GenericOptionsParser
because I can find no reference to it anywhere in my project. :)If you put the
GeoIPCity.dat
on the could and specify its using the-files
argument, it will be put into the local cache which the mapper can then get in thesetup
function. It doesn't have to be insetup
but only needs to be done once per mapper so is an excellent place to put it.Then use the
-libjars
argument to specify the geoiplookup.jar (or whatever you've called yours) and it will be able to use it. We don't put the geoiplookup.jar on the cloud. I'm rolling with the assumption that hadoop will distribute the jar as it needs to.I hope that all makes sense. I am getting fairly familiar with hadoop/mapreduce, but I didnt' write the pieces that use the maxmind geoip component in the project, so I've had to do a little digging to understand it well enough to do the explanation I have here.
EDIT: Additional description for the
-files
and-libjars
-files The files argument is used to distribute files through Hadoop Distributed Cache. In the example above, we are distributing the Max Mind geo-ip data file through the Hadoop Distributed Cache. We need access to the Max Mind geo-ip data file to map the user’s ip address to appropriate country, region, city, timezone. The API requires that data file be present locally which is not feasible in a distributed processing environment (we will not be guaranteed which nodes in the cluster will process the data). To distribute the appropriate data to the processing node, we use the Hadoop Distributed Cache infrastructure. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –file argument. Please note that the file we distribute should be available in the cloud (HDFS).
-libjars The –libjars is used to distribute any additional dependencies required by the map-reduce jobs. Like the data file, we also need to copy the dependent libraries to the nodes in the cluster where the job will be run. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –libjars argument.