Java 模式：数据挖掘任务的工程数据流

发布于 2024-12-14 22:23:17 字数 1109 浏览 8 评论 0原文

我是一名数据挖掘者，因此，我花了很多时间以各种方式转换原始数据，以实现预测模型的消费。例如，读取某种格式的文件、标记化、语法化并投影为某种数字表示形式。多年来，我开发了一套丰富的方法来完成我能想到的大多数数据处理任务，但除了最基本的方式之外，我没有一种很好的方法来配置这些组件 - 通常我所做的是很多对源代码中依赖于特定任务的特定方法的调用。我现在正在尝试将我的库重构为更好的东西，但我不太确定这是什么。

我当前的想法是，有一个函数对象列表，每个函数对象定义一些方法（例如，操作（...）），按顺序调用，每个方法要么通过引用处理某些数据流的内容，要么消耗前一个函数对象。这很接近我想要的，但是由于输入和输出的数据类型会有所不同，因此使用泛型变得非常困难。要使用上面的示例，我想通过这个处理数据的“管道”传递一些内容，例如：

input: string filename
filename -> collection of strings
collection<string> -> (stemming, stopword removal) -> collection of strings
collection<string> -> (tokenize) -> collection of string arrays
collection<string[]> -> (gram-ify) -> augment individual token strings with n-grams -> collection of string arrays
collection<string[]> -> projection into numeric vectors -> collection< double[] >

这是一个简单的示例，但想象我有 100 个这样的组件，并且我想将它们添加到某些数据流中。这满足了我易于配置的要求 - 我可以轻松构建一个管道工厂来读取一些 yaml 文件并构建它。然而，组件的设计模式却困扰了我一段时间？合适的接口是什么样的？似乎在这里做事情的唯一简单方法是传递对象，本质上是消除对象（或者传递一些将对象作为成员变量的上下文对象），然后检查输入的兼容性，抛出运行时异常。这两种选择似乎都同样糟糕。然而，我觉得我已经接近一个非常好的和灵活的系统了。你们能帮我把它推过栅栏吗？

原文

I am a data miner, an as such, I spend a lot of time transforming raw data in various ways to enable consumption by predictive models. For instance, read a file in a certain format, tokenize, gram-ify, and project into some numeric representation. Over the years I have developed a rich set of methods to do most of the data processing tasks i can think of, but I dont have a nice way of configuring these components in all but the most rudimentary ways- typically what i do is a lot of calls to specific methods in the source code that is dependent on a specific task. I'm now trying to refactor my libraries into something that's much nicer, but i'm not too sure what this is.

My current thinking is, have a list of function objects, each defining some method (say, operate( ... ) ), that are called in sequence, each either processing the contents of some data flow by reference, or consuming the output of the previous function object. This is close to what I want, but because the type of data being input and output will vary, using generics becomes very difficult. To use my above example, i'd like to pass something through this "pipeline" that processes data like:

input: string filename
filename -> collection of strings
collection<string> -> (stemming, stopword removal) -> collection of strings
collection<string> -> (tokenize) -> collection of string arrays
collection<string[]> -> (gram-ify) -> augment individual token strings with n-grams -> collection of string arrays
collection<string[]> -> projection into numeric vectors -> collection< double[] >

this is a simple example, but imagine i have 100s of such components, and i'd like to add them to some data flow. this meets my easy to configure requirement- i could easily built a pipeline factory that reads some yaml file and builds this out. however, the design patterns of the components has been stumping me for a while? what do the appropriate interfaces look like? it seems like the only easy way to do things here is have objects get passed, essentially doing away with objects (or have some context object get passed that has a Object as a member variable), then checking for compatibility at input, throwing runtime exceptions. both options seem equally bad. however, i feel like i'm close to a really nice and flexible system here. can you guys help me push this over the fence?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

王权女流氓 2024-12-21 22:23:17

apache 基金会有一个名为 pipelines 的项目https://commons.apache.org/sandbox/pipeline/。也许它会有用。我认为那里有更多基于管道的项目。浏览该网站可能会很有用。

回复收藏 0 原文

浅蓝的眸勾画不出的柔情 2024-12-21 22:23:17

我认为一个更灵活的工具将您的库联系在一起将是一个好方法。例如，一种新的动态语言将非常适合这一点。

Clojure 非常适合内置的诸如 map、pmap、reduce 过滤器等工具。Clojure 的集合都实现了 java.util Collection 库的接口，因此您可以将更高级别的 Clojure 函数应用到现有的 Java 代码中，或者您还可以将 Clojure 数据结构直接传递给 Java 代码（只要 Java 代码不希望修改它）。

该语言的轻量级和动态特性使得可以轻松地将事物快速组合在一起，而无需太多开销。

回复收藏 0 原文

分分钟 2024-12-21 22:23:17

我读你的例子可能太字面意思了；这意味着该解决方案可能不适用于您的实际问题。

public interface Interface1 {
  public List<String> operate(List<String> list);
}

public interface InterfaceBridge {
  public List<List<String>> operate(List<String> list);
}

public interface Interface2 {
  public List<List<String>> operate(List<List<String>> list);
}

显然你应该选择更好的接口名称。然后你可以用以下内容来组合它们：

public class Interface1Composite implements Interface1 {
  List<Interface1> components = new ArrayList<>();

  public Interface1Composite(Interface1... components) {
    for (Interface1 i1 : components)
      this.components.add(i1);
  }

  @Override 
  public List<String> operate(List<String> list) {
    for (Interface1 i1 : components)
      list = i1.operate(list);
    return list;
  }

我想这几乎就是你已经在做的事情。我只是通过使用 3 种类型的接口来简化，而不是尝试使用泛型。但正如我之前所说，我不知道您是否可以将其应用于您的问题。

I might be reading your example too literally; meaning that this solution might not be applicable to your real problem.

public interface Interface1 {
  public List<String> operate(List<String> list);
}

public interface InterfaceBridge {
  public List<List<String>> operate(List<String> list);
}

public interface Interface2 {
  public List<List<String>> operate(List<List<String>> list);
}

You should obviously pick better interface names. You can then compose them with:

public class Interface1Composite implements Interface1 {
  List<Interface1> components = new ArrayList<>();

  public Interface1Composite(Interface1... components) {
    for (Interface1 i1 : components)
      this.components.add(i1);
  }

  @Override 
  public List<String> operate(List<String> list) {
    for (Interface1 i1 : components)
      list = i1.operate(list);
    return list;
  }

I guess it's pretty much what you are already doing. I just simplified by having 3 types of interfaces instead of trying to use generics. But as I said earlier, I don't know if you can apply that to your problem.

回复收藏 0 原文

~没有更多了~