可定制字符串过滤器的设计

发布于 2024-11-07 01:21:10 字数 1803 浏览 1 评论 0原文

假设我在 my_dir/my_subdir 中有大量文件名，并以某种方式格式化：

data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00

例如 data11_7TeV 是 data_type，00179691 运行编号，NTUP_PHOTON 数据格式。

我想编写一个接口来做这样的事情：

dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");                     
// don't to the filtering, be lazy
cout << dataset.count();                          // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS");         // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();

我想要惰性行为，例如在 count 或 get_list 等任何操作之前不需要进行真正的过滤。如果过滤已经完成，我不想重做。我刚刚学习一些有关设计模式的知识，我想我可以使用：

一个抽象基类 AbstractFilter ，它实现 filter* 方法
factory 来从每次调用 filter 时 decorator 使用的被调用方法决定
*方法我返回一个装饰类，例如：

AbstractFilter::filter_run(string arg) {
    decorator = factory.get_decorator_run(arg);  // if arg is "> 00179691" returns FilterRunGreater(00179691)
    return decorator(this);
}

proxy 构建一个正则表达式来过滤文件名，但不这样做过滤

我也在学习 jQuery 并且我正在使用类似的链接机制。

有人可以给我一些提示吗？有没有地方可以解释这样的设计？设计必须非常灵活，特别是要处理文件名中的新格式。

原文

Suppose I've tons of filenames in my_dir/my_subdir, formatted in a some way:

data11_7TeV.00179691.physics_Egamma.merge.NTUP_PHOTON.f360_m796_p541_tid319627_00
data11_7TeV.00180400.physics_Egamma.merge.NTUP_PHOTON.f369_m812_p541_tid334757_00
data11_7TeV.00178109.physics_Egamma.merge.D2AOD_DIPHO.f351_m765_p539_p540_tid312017_00

For example data11_7TeV is the data_type, 00179691 the run number, NTUP_PHOTON the data format.

I want to write an interface to do something like this:

dataset = DataManager("my_dir/my_subdir").filter_type("data11_7TeV").filter_run("> 00179691").filter_tag("m = 796");                     
// don't to the filtering, be lazy
cout << dataset.count();                          // count is an action, do the filtering
vector<string> dataset_list = dataset.get_list(); // don't repeat the filtering
dataset.save_filter("file.txt", "ALIAS");         // save the filter (not the filenames), for example save the regex
dataset2 = DataManagerAlias("file.txt", "ALIAS"); // get the saved filter
cout << dataset2.filter_tag("p = 123").count();

I want lazy behaviour, for example no real filtering has to be done before any action like count or get_list. I don't want to redo the filtering if it is already done.
I'm just learning something about design pattern, and I think I can use:

an abstract base class AbstractFilter that implement filter* methods
factory to decide from the called method which decorator use
every time I call a filter* method I return a decorated class, for example:

AbstractFilter::filter_run(string arg) {
    decorator = factory.get_decorator_run(arg);  // if arg is "> 00179691" returns FilterRunGreater(00179691)
    return decorator(this);
}

proxy that build a regex to filter the filenames, but don't do the filtering

I'm also learning jQuery and I'm using a similar chaining mechanism.

Can someone give me some hints? Is there some place where a design like this is explained? The design must be very flexible, in particular to handle new format in the filenames.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

简单气质女生网名 2024-11-14 01:21:10

我相信您使设计模式方面过于复杂化并掩盖了潜在的匹配/索引问题。从磁盘获取完整的目录列表预计比在 RAM 中过滤它返回的文件名要昂贵几个数量级，并且前者需要在执行 count() 之前完成或任何数据集上的get_list()（尽管您可以对数据集进行一些更惰性的迭代器操作）。

如前所述，真正的功能挑战可能在于对文件名建立索引，以便您可以快速重复地找到匹配项。但是，即使这也不太可能，因为您可能从获取文件名数据集到实际打开这些文件，这又慢了几个数量级。因此，索引的优化可能不会对整个程序的性能产生任何明显的影响。

但是，假设您将所有匹配的目录条目读取到数组 A 中。

现在，对于过滤，似乎您的要求通常可以使用 std::multimap find()、lower_bound() 和 upper_bound()。最通用的处理方法是为数据类型、运行编号、数据格式、p 值、m 值、tid 提供单独的多重映射等等，映射到 A 中的索引列表。然后，您可以使用现有的 STL 算法来查找各个过滤器结果所共有的索引。

如果您碰巧对您的数据和过滤需求有未明确的见解/限制（这很可能），则可以进行很多优化。例如：

如果您知道某个特定的过滤器将始终被使用，并立即将潜在匹配项减少到可管理的数量（例如 <~100），那么您可以首先使用它并诉诸强力搜索以进行后续过滤。

另一种可能性是将各个文件名的属性提取到结构中： std::string data_type; std::vector; p; 等，然后编写一个支持诸如“p include 924 and data_type == 'XYZ'”之类的谓词的表达式求值器，尽管它本身适合强力比较，而不是更快的基于索引的匹配。

我知道您说过您不想使用外部库，但是如果您的需求确实属于更复杂的范围，那么内存数据库和类似 SQL 的查询功能可能会为您省去很多麻烦。

I believe you're over-complicating the design-pattern aspect and glossing over the underlying matching/indexing issues. Getting the full directory listing from disk can be expected to be orders of magnitude more expensive than the in-RAM filtering of filenames it returns, and the former needs to have completed before you can do a count() or get_list() on any dataset (though you could come up with some lazier iterator operations over the dataset).

As presented, the real functional challenge could be in indexing the filenames so you can repeatedly find the matches quickly. But, even that's unlikely as you presumably proceed from getting the dataset of filenames to actually opening those files, which is again orders of magnitude slower. So, optimisation of the indexing may not make any appreciable impact to your overall program's performance.

But, lets say you read all the matching directory entries into an array A.

Now, for filtering, it seems your requirements can generally be met using std::multimap find(), lower_bound() and upper_bound(). The most general way to approach it is to have separate multimaps for data type, run number, data format, p value, m value, tid etc. that map to a list of indices in A. You can then use existing STL algorithms to find the indices that are common to the results of your individual filters.

There are a lot of optimisations possible if you happen to have unstated insights / restrictions re your data and filtering needs (which is very likely). For example:

if you know a particular filter will always be used, and immediately cuts the potential matches down to a manageable number (e.g. < ~100), then you could use it first and resort to brute force searches for subsequent filtering.

Another possibility is to extract properties of individual filenames into a structure: std::string data_type; std::vector<int> p; etc., then write an expression evaluator supporting predicates like "p includes 924 and data_type == 'XYZ'", though by itself that lends itself to brute-force comparisons rather than faster index-based matching.

I know you said you don't want to use external libraries, but an in-memory database and SQL-like query ability may save you a lot of grief if your needs really are at the more elaborate end of the spectrum.

回复收藏 0 原文

白况 2024-11-14 01:21:10

我会使用策略模式。您的 DataManager 正在构造一个 DataSet 类型，并且该 DataSet 分配了一个 FilteringPolicy。默认值可以是 NullFilteringPolicy，这意味着没有过滤器。如果调用DataSet成员函数filter_type(string t)，它就会用一个新的过滤策略类替换掉过滤策略类。新的可以通过 filter_type 参数在工厂构建。像filter_run()这样的方法可以用来在FilterPolicy上添加过滤条件。在 NullFilterPolicy 情况下，它只是无操作。这对我来说似乎很简单，我希望这会有所帮助。

编辑：
要解决方法链接问题，您只需 return *this; 即可。例如，返回对 DataSet 类的引用。这意味着您可以将 DataSet 方法链接在一起。这就是当你实现operator>>时c++ iostream库所做的事情。或运算符<<。

回复收藏 0 原文