在数据库之前处理数据的应用程序设计

发布于 2024-07-06 17:20:13 字数 844 浏览 15 评论 0原文

我在 Excel 文件(和 csv 文件)中收集了大量数据。 数据需要放入数据库(mysql)中。 然而,在进入数据库之前,需要对其进行处理。例如,如果第 1 列小于第 3 列,则将 4 添加到第 2 列。在持久化信息之前,必须遵循很多规则。

完成这项任务需要遵循什么好的设计? (使用java)

附加说明

该过程需要自动化。 从某种意义上说,我不必手动进入并更改数据。 我们讨论的是数千行数据,每行 15 列信息。

目前,我已经建立了一种责任链设计。 每个规则一个类(Java)。 当一条规则完成后,它会调用下一条规则。

更多信息

通常每个数据表大约有 5000 行。 速度并不是一个大问题,因为 这么大的输入并不经常发生。

我考虑过流口水,但我不确定这项任务对于流口水来说是否足够复杂。

规则示例:

  1. 所有货币(特定列中的数据)不得包含货币符号。

  2. 类别名称必须统一(例如,书柜 = 书柜)

  3. 参赛日期不能是将来的日期

  4. 文本输入只能包含[AZ 0-9 \s]

等..
此外,如果任何信息列无效,则需要在以下情况下报告: 处理完成 (或者可能停止处理)。

我当前的解决方案有效。 不过我认为还有改进的空间,所以我正在寻找 关于如何改进它以及/或其他人如何处理类似的理想 情况。

我考虑过(非常简单地)使用流口水,但我不确定这项工作是否复杂到足以利用流口水。

I have a large collection of data in an excel file (and csv files). The data needs to be placed into a database (mysql). However, before it goes into the database it needs to be processed..for example if columns 1 is less than column 3 add 4 to column 2. There are quite a few rules that must be followed before the information is persisted.

What would be a good design to follow to accomplish this task? (using java)

Additional notes

The process needs to be automated. In the sense that I don't have to manually go in and alter the data. We're talking about thousands of lines of data with 15 columns of information per line.

Currently, I have a sort of chain of responsibility design set up. One class(Java) for each rule. When one rule is done, it calls the following rule.

More Info

Typically there are about 5000 rows per data sheet. Speed isn't a huge concern because
this large input doesn't happen often.

I've considered drools, however I wasn't sure the task was complicated enough for drols.

Example rules:

  1. All currency (data in specific columns) must not contain currency symbols.

  2. Category names must be uniform (e.g. book case = bookcase)

  3. Entry dates can not be future dates

  4. Text input can only contain [A-Z 0-9 \s]

etc..
Additionally if any column of information is invalid it needs to be reported when
processing is complete
(or maybe stop processing).

My current solution works. However I think there is room for improvement so I'm looking
for ideals as to how it can be improved and or how other people have handled similar
situations.

I've considered (very briefly) using drools but I wasn't sure the work was complicated enough to take advantage of drools.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

桃酥萝莉 2024-07-13 17:20:13

如果我不想一步完成此操作(正如 Oli 提到的),我可能会使用 管道和过滤器设计。 由于您的规则相对简单,我可能会做几个基于委托的类。 例如(C# 代码,但 Java 应该非常相似...也许有人可以翻译?):

interface IFilter {
   public IEnumerable<string> Filter(IEnumerable<string> file) {
   }
}

class PredicateFilter : IFilter {
   public PredicateFilter(Predicate<string> predicate) { }

   public IEnumerable<string> Filter(IEnumerable<string> file) {
      foreach (string s in file) {
         if (this.Predicate(s)) {
            yield return s;
         }
      }
   }
}

class ActionFilter : IFilter {
  public ActionFilter(Action<string> action) { }

  public IEnumerable<string> Filter(IEnumerable<string> file) {
      foreach (string s in file) {
         this.Action(s);
         yield return s;
      }
  }
}

class ReplaceFilter : IFilter {
  public ReplaceFilter(Func<string, string> replace) { }

  public IEnumerable<string> Filter(IEnumerable<string> file) {
     foreach (string s in file) {
        yield return this.Replace(s);
     }
  }
}

从那里,您可以直接使用委托过滤器,或者根据具体情况对它们进行子类化。 然后,将它们注册到管道中,管道将让它们通过每个过滤器。

If I didn't care to do this in 1 step (as Oli mentions), I'd probably use a pipe and filters design. Since your rules are relatively simple, I'd probably do a couple delegate based classes. For instance (C# code, but Java should be pretty similar...perhaps someone could translate?):

interface IFilter {
   public IEnumerable<string> Filter(IEnumerable<string> file) {
   }
}

class PredicateFilter : IFilter {
   public PredicateFilter(Predicate<string> predicate) { }

   public IEnumerable<string> Filter(IEnumerable<string> file) {
      foreach (string s in file) {
         if (this.Predicate(s)) {
            yield return s;
         }
      }
   }
}

class ActionFilter : IFilter {
  public ActionFilter(Action<string> action) { }

  public IEnumerable<string> Filter(IEnumerable<string> file) {
      foreach (string s in file) {
         this.Action(s);
         yield return s;
      }
  }
}

class ReplaceFilter : IFilter {
  public ReplaceFilter(Func<string, string> replace) { }

  public IEnumerable<string> Filter(IEnumerable<string> file) {
     foreach (string s in file) {
        yield return this.Replace(s);
     }
  }
}

From there, you could either use the delegate filters directly, or subclass them for the specifics. Then, register them with a Pipeline that will pass them through each filter.

呢古 2024-07-13 17:20:13

我觉得你的方法还可以。 特别是如果您在每个处理器上使用相同的接口。

您还可以考虑称为 Drules 的东西,目前是 Jboss 规则。 前一段时间,我将其用于我的应用程序的规则密集部分,我喜欢它的是,业务逻辑可以用电子表格或 DSL 来表达,然后编译为 java(运行时,我认为有也是一个编译时选项)。 它使规则更加简洁,因此更具可读性。 它也很容易学习(2天左右)。

这是开源 Jboss-rules 的链接。 在 jboss.com,如果更符合您公司的口味,您无疑可以购买官方维护的版本。

I think your method is OK. Especially if you use the same interface on every processor.

You could also look to somethink called Drules, currently Jboss-rules. I used that some time ago for a rule-heavy part of my app and what I liked about it is that the business logic can be expressed in for instance a spreadsheet or DSL which then get's compiled to java (run-time and I think there's also a compile-time option). It makes rules a bit more succint and thus readable. It's also very easy to learn (2 days or so).

Here's a link to the opensource Jboss-rules. At jboss.com you can undoubtedly purchase an offically maintained version if that's more to your companies taste.

债姬 2024-07-13 17:20:13

只需创建一个函数来执行每个规则,并为每个值调用每个适用的函数。 我不明白这如何需要任何异国情调的架构。

Just create a function to enforce each rule, and call every applicable function for each value. I don't see how this requires any exotic architecture.

风和你 2024-07-13 17:20:13

每条规则一个类? 真的吗?也许我不理解这些规则的数量或复杂性,但我会(半伪代码):

public class ALine {
    private int col1;
    private int col2;
    private int coln;
    // ...

    public ALine(string line) {
         // read row into private variables
         // ...

         this.Process();
         this.Insert();
    }

    public void Process() {
         // do all your rules here working with the local variables
    }

    public void Insert() {
        // write to DB
    }
}

foreach line in csv
    new ALine(line);

A class for each rule? Really? Perhaps I'm not understanding the quantity or complexity of these rules, but I would (semi-pseudo-code):

public class ALine {
    private int col1;
    private int col2;
    private int coln;
    // ...

    public ALine(string line) {
         // read row into private variables
         // ...

         this.Process();
         this.Insert();
    }

    public void Process() {
         // do all your rules here working with the local variables
    }

    public void Insert() {
        // write to DB
    }
}

foreach line in csv
    new ALine(line);
心舞飞扬 2024-07-13 17:20:13

您为每个规则使用类的方法听起来确实有点重,但它的优点是在新规则出现时易于修改和扩展。

至于加载数据,批量加载是要走的路。 我读过一些信息,表明它可能比使用插入语句加载快 3 个数量级。 您可以在此处找到一些相关信息

Your methodology of using classes for each rule does sound a bit heavy weight but it has the advantage of being easy to modify and expand should new rules come along.

As for loading the data bulk loading is the way to go. I have read some informaiton which suggests it may be as much as 3 orders of magnitude faster than loading using insert statements. You can find some information on it here

大海や 2024-07-13 17:20:13

将数据批量加载到临时表中,然后使用 sql 应用您的规则。
使用临时表作为插入实际表的基础。
删除临时表。

Bulk load the data into a temp table, then use sql to apply your rules.
use the temp table, as a basis for the insert into real table.
drop the temp table.

笙痞 2024-07-13 17:20:13

你可以看到,所有不同的答案都来自他们自己的经验和观点。

由于我们对系统的复杂性和行数了解不多,因此我们倾向于根据之前所做的事情提供建议。

如果您想将实施范围缩小到 1/2 解决方案,请尝试提供更多详细信息。

祝你好运

you can see that all the different answers are coming from their own experience and perspective.

Since we don't know much about the complexity and number of rows in your system, we tend to give advice based on what we have done earlier.

If you want to narrow down to a 1/2 solutions for your implementation, try giving more details.

Good luck

走过海棠暮 2024-07-13 17:20:13

这可能不是您想听到的,无论如何这都不是“有趣的方式”,但是有一种更简单的方法可以做到这一点。

只要您的数据是逐行评估的...您可以在 Excel 文件中设置另一个工作表,并使用电子表格样式函数进行必要的转换,引用原始数据表中的数据。 对于更复杂的函数,您可以使用 Excel 中嵌入的 vba 来编写自定义操作。

我已经多次使用这种方法并且效果非常好; 只是不太性感。

It may not be what you want to hear, it isn't the "fun way" by any means, but there is a much easier way to do this.

So long as your data is evaluated line by line... you can setup another worksheet in your excel file and use spreadsheet style functions to do the necessary transforms, referencing the data from the raw data sheet. For more complex functions you can use the vba embedded in excel to write out custom operations.

I've used this approach many times and it works really well; its just not very sexy.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文