如何快速将一组布尔值与许多其他布尔值集进行比较(顺序很重要)?
我在业余时间从事的一个项目遇到了问题。我正在使用 Google App Engine(Java 版本),但这个问题并不特定于该平台,如果其他语言/平台可以解决问题,我会考虑其他语言/平台。
下面说明了这个问题:
假设我有一个包含数千个食谱以及每个食谱的成分的数据存储。 (为了便于说明,请忘记测量。)我希望能够输入我手头上的成分列表,然后快速检索我至少拥有 XX% 成分的所有食谱(假设75%)。我愿意为了速度而牺牲一些准确性和一些结果,但确实想要一定程度的准确性。得到“快速结果”后我可以做更彻底的比较。
我尝试解决方案:通过分析食谱数据库,我编制了一份清单,其中包含 200 种常见食品成分(鸡蛋、面粉、盐、糖、迷迭香等)。几乎所有食谱的成分都包含在这个主列表中:
Common Food Ingredients: [ eggs , flour , salt , sugar , cinnamon ... ]
然后,我浏览每个单独的食谱并将成分与这个主列表进行比较,最后为每个食谱提供一组 200 个布尔值:
Recipe #106: [ T , T , F , T , F ... ]
Recipe #107: [ F , T , T , T , F ... ]
我会将此信息存储为食谱。 (到目前为止,这都是数据准备工作,我有足够的时间来做这些工作。)
现在,我输入手头的配料清单。我会与主列表进行相同的比较:
My ingredients on hand: [ F , F , T , T , F ... ]
这就是我陷入困境的地方。如何快速将这组布尔值与食谱集进行比较,以便我可以识别出我至少拥有 75% 成分的食谱?
或者(这将是圣杯),在数据准备期间,不是将布尔值集本身与每个配方一起存储,而是可以执行计算来给我一个稍后可以过滤掉的单个值? (例如,“从食谱中选择*,其中master_list_boolean_metric <= 29”)
或者我是否以错误的方式处理这个问题? (任何一般或具体的指导,我们将不胜感激。)我想避免的是在每个食谱和我的“现有”成分列表之间逐个成分地进行缓慢的比较。
或者……也许不可能很快做到这一点?
I'm running into a problem with a project I'm working on in my spare time. I'm using Google App Engine (Java version), but this question is not specific to that platform, and I would consider other languages/platforms if they could solve the problem.
The following illustrates the problem:
Suppose I have a datastore with thousands of recipes, and the ingredients for each recipe. (For the sake of this illustration, forget about measurements.) I want to be able to enter a list of ingredients that I have on hand, and then quickly retrieve all recipes for which I have at least XX% of the ingredients (let's say 75%). I'm willing to sacrifice some accuracy and some results for speed, but do want a certain degree of accuracy. I can do a more thorough comparison after I get the "quick results."
My attempt at a solution: Analyzing the database of recipes, I compile a list of, say, 200 common food ingredients (eggs, flour, salt, sugar, rosemary, etc). Almost all the ingredients for the recipes are contained within this master list:
Common Food Ingredients: [ eggs , flour , salt , sugar , cinnamon ... ]
Then, I go through each individual recipe and compare the ingredients to this master list, and end up with a set of 200 booleans for each recipe:
Recipe #106: [ T , T , F , T , F ... ]
Recipe #107: [ F , T , T , T , F ... ]
I would store this information with the recipes. (Up to this point, it's all data prep work, which I have all the time in the world to do.)
Now, I enter my list of ingredients on hand. I would do the same comparison with the master list:
My ingredients on hand: [ F , F , T , T , F ... ]
And this is where I'm stuck. How can I quickly compare this set of booleans against the sets for the recipes so I can identify recipes for which I have at least 75% of the ingredients?
Or (and this would be the holy grail), during the data preparation, instead of storing the set of booleans themselves with each recipe, is there a calculation I can perform that will give me a single value I can later filter off of? (E.g., "SELECT * FROM recipes WHERE master_list_boolean_metric <= 29")
Or am I going about this the wrong way? (Any guidance, general or specific, would be appreciated.) What I want to avoid is doing a slow comparison, ingredient by ingredient, between each recipe and my list of "on-hand" ingredients.
Or... perhaps it isn't possible to do this quickly?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用 BitSet。
将每种成分存储为一位,与您拥有的成分进行 AND 运算,然后根据基数进行过滤()
use BitSet.
store each ingredient as one bit, do an AND with the ingredients you have, and then filter on cardinality()