用于菜谱编程分类的算法
我有兴趣根据菜谱各种属性的统计分析以编程方式对菜谱进行分类。换句话说,我想将菜谱分类为 Breakfast
、Lunch
、Dinner
或 Dessert
,无需任何用户输入。
我可用的属性有:
- 菜谱标题(例如鸡肉沙拉)
- 菜谱描述(描述菜谱的任意文本)
- 烹饪方法(准备此菜谱所涉及的步骤)
- 准备和烹饪时间
- 每个食谱中的成分及其数量
好消息是,我有一组大约 10,000 个已经分类的食谱样本,我可以使用这些数据教授我的算法。我的想法是寻找模式,例如“糖浆”这个词在早餐食谱或任何需要超过 1 杯糖的食谱中是否出现统计上更频繁 90% 的可能性是甜点。我想,如果我从多个维度分析配方,然后适当调整权重,我就能得到相当准确的结果。
在解决这个问题时,有哪些好的算法可供研究?像 k-NN 这样的东西会有帮助吗?或者有没有更适合这项任务的东西?
I'm interested in classifying recipes programmatically based on a statistical analysis of various properties of the recipe. In other words, I want to classify a recipe as Breakfast
, Lunch
, Dinner
or Dessert
without any user input.
The properties I have available are:
- The recipe title (such as chicken salad)
- The recipe description (arbitrary text describing the recipe)
- The cooking method (the steps involved in preparing this recipe)
- Prep and cook times
- Each ingredient in the recipe, and its amount
The good news is I have a sample set of about 10,000 recipes that are already classified, and I can use these data to teach my algorithm. My idea is to look for patterns, such as if the word syrup appears statistically more frequently in breakfast recipes, or any recipe that calls for over 1 cup of sugar is 90% likely to be a dessert. I figure if I analyze the recipe across several dimensions, and then tweak the weights as appropriate, I can get something that's decently accurate.
What would be some good algorithms to investigate while approaching this problem? Would something like k-NN be helpful, or are there ones betters suited to this task?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果我要做的话,我会尝试按照LiKao的建议去做。我首先会关注成分。我会为食谱的成分部分中出现的单词建立一个字典,并以监督的方式清理列表,以删除非成分术语,例如数量和单位。
然后我会求助于贝叶斯定理:您的数据库允许您计算早餐和晚餐中含有鸡蛋的概率……;您将预先计算这些先验概率。然后给定一个包含鸡蛋和果酱的未知食谱,您可以计算这顿饭是早餐的概率(后验)。
您稍后可以使用其他术语和/或考虑数量(每人鸡蛋的数量)来丰富...
If I were to do it, I would try to do it like suggested by LiKao. I would first focus on the ingredients. I would establish a dictionnary of the words appearing in the Ingredients sections of the recipes, and cleanup the list in a supervised way to remove non-ingredient terms such as quantities and units.
Then I would resort to the Bayes theorem: your database allows you to compute the probability of having Eggs in a Breakfast and in a Dinner...; you will precompute those a priori probabilities. Then given an unknown recipy containing both Eggs and Marmalade, you can compute the probability of the meal being a Breakfast, a posteriori.
You can later enrich with other terms and/or taking quantities into account (number of Eggs per person)...
尝试各种众所周知的机器学习算法。我建议首先使用贝叶斯分类器,因为它很容易实现并且通常效果很好。如果这不起作用,请尝试更复杂的方法,例如神经网络或支持向量机。
主要问题是决定一组特征作为方法的输入。为此,您应该查看哪些信息是唯一的。例如,如果您有一个标题为“鸡肉沙拉”的食谱,那么“鸡肉”部分就不会引起太大兴趣,因为它也存在于配料中,并且更容易从那里收集。因此,您应该尝试找到一组提供新信息的关键字(即沙拉部分)。尝试为此找到一组好的关键字。这可能可以以某种方式自动化,但如果您手动完成,您可能会更好,因为它只需要完成一次。
描述也是如此。找到正确的特征集始终是此类任务中最困难的部分。
一旦你有了一组特征,只需在它们上训练你的算法,看看它的表现如何。如果您在机器学习方面没有太多经验,请查看正确测试 ML 算法的不同方法(例如,忽略 N 测试等)。
Try various well known machine learning algorithms. I would suggest first using a Bayesian Classifier, since it is easy to implement and often works fairly well. If this does not work, then try something more complex, e.g. Neural Nets or SVMs.
The main Problem will be deciding on a set of features as input into your method. For this you will should look at which information is unique. For example if you have a recipe titled "Chicken Salad" the "chicken" part will not be of much interest because it is also present in the ingredients and simpler to gather from there. So you should try to find a set of keywords which are giving new information (i.e. the Salad part). Try to find a good set of keywords for this. This probably can be automatized somehow, but more likely you will be better of if you do it by hand, since it only needs to be done once.
The same goes for the description. Finding the correct set of features is always the hardest part for such a task.
Once you have your set of features, just train your algorithm on them and see how well it does. If you do not have much experience with Machine Learning have a look at the different methods to correctly test a ML algorithm (e.g. Leave N out testing etc).
我认为 NN 对此可能有点过分了。我会尝试使用单个感知器“网络”对每种类型的膳食(早餐,晚餐)进行分类,并让它遍历输入并调整权重向量。数据集中找到的每个有意义的单词都可以作为网络的输入。我希望这足以满足您的需求。我之前用这个方法成功地对文本进行了分类。
I think NN is probably an overkill for this. I would try classifying using a single perceptron "network" for each type of meal(Breakfast,Dinner), and let it go over the input and adjust the weight vector. every meaningful word found in the dataset can be the inputs of the network.. I would expect that to be enough for your needs. I used this method successfully to classify text before.