可定制字符串过滤器的设计

发布于 2024-11-07 01:21:10 字数 1803 浏览 1 评论 0原文

假设我在 my_dir/my_subdir 中有大量文件名,并以某种方式格式化:

data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00

例如 data11_7TeVdata_type00179691 运行编号,NTUP_PHOTON 数据格式。

我想编写一个接口来做这样的事情:

dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");                     
// don't to the filtering, be lazy
cout << dataset.count();                          // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS");         // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();

我想要惰性行为,例如在 countget_list 等任何操作之前不需要进行真正的过滤。如果过滤已经完成,我不想重做。 我刚刚学习一些有关设计模式的知识,我想我可以使用:

  • 一个抽象基类 AbstractFilter ,它实现 filter* 方法
  • factory 来从每次调用 filterdecorator 使用的被调用方法决定
  • *方法我返回一个装饰类,例如:

AbstractFilter::filter_run(string arg) {
    decorator = factory.get_decorator_run(arg);  // if arg is "> 00179691" returns FilterRunGreater(00179691)
    return decorator(this);
}

  • proxy 构建一个正则表达式来过滤文件名,但不这样做过滤

我也在学习 jQuery 并且我正在使用类似的链接机制。

有人可以给我一些提示吗?有没有地方可以解释这样的设计?设计必须非常灵活,特别是要处理文件名中的新格式。

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:

data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00

For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.

I want to write an interface to do something like this:

dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");                     
// don't to the filtering, be lazy
cout << dataset.count();                          // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS");         // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();

I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:

  • an abstract base class AbstractFilter that implement filter* methods
  • factory to decide from the called method which decorator use
  • every time I call a filter* method I return a decorated class, for example:

AbstractFilter::filter_run(string arg) {
    decorator = factory.get_decorator_run(arg);  // if arg is "> 00179691" returns FilterRunGreater(00179691)
    return decorator(this);
}

  • proxy that build a regex to filter the filenames, but don't do the filtering

I'm also learning jQuery and I'm using a similar chaining mechanism.

Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

简单气质女生网名 2024-11-14 01:21:10

我相信您使设计模式方面过于复杂化并掩盖了潜在的匹配/索引问题。从磁盘获取完整的目录列表预计比在 RAM 中过滤它返回的文件名要昂贵几个数量级,并且前者需要在执行 count() 之前完成或任何数据集上的get_list()(尽管您可以对数据集进行一些更惰性的迭代器操作)。

如前所述,真正的功能挑战可能在于对文件名建立索引,以便您可以快速重复地找到匹配项。但是,即使这也不太可能,因为您可能从获取文件名数据集到实际打开这些文件,这又慢了几个数量级。因此,索引的优化可能不会对整个程序的性能产生任何明显的影响。

但是,假设您将所有匹配的目录条目读取到数组 A 中。

现在,对于过滤,似乎您的要求通常可以使用 std::multimap find()lower_bound()upper_bound()。最通用的处理方法是为数据类型、运行编号、数据格式、p 值、m 值、tid 提供单独的多重映射等等,映射到 A 中的索引列表。然后,您可以使用现有的 STL 算法来查找各个过滤器结果所共有的索引。

如果您碰巧对您的数据和过滤需求有未明确的见解/限制(这很可能),则可以进行很多优化。例如:

  • 如果您知道某个特定的过滤器将始终被使用,并立即将潜在匹配项减少到可管理的数量(例如 <~100),那么您可以首先使用它并诉诸强力搜索以进行后续过滤。

另一种可能性是将各个文件名的属性提取到结构中: std::string data_type; std::vector; p; 等,然后编写一个支持诸如“p include 924 and data_type == 'XYZ'”之类的谓词的表达式求值器,尽管它本身适合强力比较,而不是更快的基于索引的匹配。

我知道您说过您不想使用外部库,但是如果您的需求确实属于更复杂的范围,那么内存数据库和类似 SQL 的查询功能可能会为您省去很多麻烦。

I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).

As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.

But, lets say you read all the matching directory entries into an array A.

Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.

There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:

  • if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.

Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.

I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.

白况 2024-11-14 01:21:10

我会使用策略模式。您的 DataManager 正在构造一个 DataSet 类型,并且该 DataSet 分配了一个 FilteringPolicy。默认值可以是 NullFilteringPolicy,这意味着没有过滤器。如果调用DataSet成员函数filter_type(string t),它就会用一个新的过滤策略类替换掉过滤策略类。新的可以通过 filter_type 参数在工厂构建。像filter_run()这样的方法可以用来在FilterPolicy上添加过滤条件。在 NullFilterPolicy 情况下,它只是无操作。这对我来说似乎很简单,我希望这会有所帮助。

编辑:
要解决方法链接问题,您只需 return *this; 即可。例如,返回对 DataSet 类的引用。这意味着您可以将 DataSet 方法链接在一起。这就是当你实现operator>>时c++ iostream库所做的事情。或运算符<<。

I would use a strategy pattern. Your DataManager is constructing a DataSet type, and the DataSet has a FilteringPolicy assigned. The default can be a NullFilteringPolicy which means no filters. If the DataSet member function filter_type(string t) is called, it swaps out the filter policy class with a new one. The new one can be factory constructed via the filter_type param. Methods like filter_run() can be used to add filtering conditions onto the FilterPolicy. In the NullFilterPolicy case it's just no-ops. This seems straghtforward to me, I hope this helps.

EDIT:
To address the method chaining you simply need to return *this; e.g. return a reference to the DataSet class. This means you can chain DataSet methods together. It's what the c++ iostream libraries do when you implement operator>> or operator<<.

稍尽春風 2024-11-14 01:21:10

首先,我认为您的设计非常聪明,并且非常适合您尝试建模的行为。

无论如何,我的理解是,您正在尝试构建一种“领域特定语言”,通过该语言您可以链接表示动作的“动词”(各种过滤方法),或连接“实体”(其中可变性由不同的表示)可能存在的命名格式,尽管您没有对此说什么)。

在这方面,Martin Flowler 的书“领域特定语言”中有一个非常有趣的讨论。只是为了让您了解一下它的含义,在这里您可以找到有关“方法链”模式的有趣讨论,定义为:

“使修饰符方法返回宿主对象,以便可以在单个表达式中调用多个修饰符。”

正如您所看到的,此模式描述了您在设计中放置的链接机制。

这里列出了在定义此类 DSL 时发现的所有有趣模式。同样,您会很容易地发现有几种专门的模式,您也在设计中暗示它们或将其描述为更通用的模式(如装饰器)。其中包括:正则表达式表词法分析器、方法链接、表达式生成器等。还有更多可以帮助您进一步指定设计的工具。

总而言之,我可以说我在您的规范中看到了“命令处理器”模式的位置,但我非常有信心通过部署 Fowler 提出的强大抽象,您将能够提出具有更加具体和精确的设计,涵盖了目前被 GoF 模式集的“通用性”隐藏的问题的各个方面。

确实,对于您所描述的问题来说,这可能是“矫枉过正”,但作为面向模式设计的练习,它可能非常有洞察力。

First of all, I think that your design is pretty smart and lends itself well to the kind of behavior you are trying to model.

Anyway, my understanding is that you are trying and building a sort of "Domain Specific Language", whereby you can chain "verbs" (the various filtering methods) representing actions on, or connecting "entities" (where the variability is represented by different naming formats that could exist, although you do not say anything about this).

In this respect, a very interesting discussion is found in Martin Flowler's book "Domain Specific Languages". Just to give you a taste of what it is about, here you can find an interesting discussion about the "Method Chaining" pattern, defined as:

“Make modifier methods return the host object so that multiple modifiers can be invoked in a single expression.”

As you can see, this pattern describes the very chaining mechanism you are positing in your design.

Here you have a list of all the patterns that were found interesting in defining such DSLs. Again, you will be easily find there several specialized patterns that you are also implying in your design or describing as way of more generic patterns (like the decorator). A few of them are: Regex Table Lexer, Method Chaining, Expression Builder, etc. And many more that could help you further specify your design.

All in all, I could add my grain of salt by saying that I see a place for a "command processor" pattern in your specificaiton, but I am pretty confident that by deploying the powerful abstractions that Fowler proposes you will be able to come up with a much more specific and precise design, covering aspect of the problem that right now are simply hidden by the "generality" of the GoF pattern set.

It is true that this could be "overkill" for a problem like the one you are describing, but as an exercise in pattern oriented design it can be very insightful.

笑着哭最痛 2024-11-14 01:21:10

我建议从 boost 迭代器库开始 - 例如 过滤器迭代器

(当然,boost 包含一个非常好的正则表达式库。)

I'd suggest starting with the boost iterator library - eg the filter iterator.

(And, of course, boost includes a very nice regex library.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文