对产品列表进行分类的算法? 拿2

发布于 2024-07-16 20:16:58 字数 796 浏览 7 评论 0原文

几周前,我问了一个与此类似的问题,但我没有正确地提出问题。 所以我在这里重新问这个问题,并提供更多细节,我想得到一个更加面向人工智能的答案。

我有一个代表或多或少相同的产品的列表。 例如,在下面的列表中,它们都是希捷硬盘。

  1. 希捷硬盘 500Go
  2. 希捷硬盘 120Go 适用于笔记本电脑
  3. 希捷 Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s 硬盘 希捷
  4. 全新闪亮 500Go 硬盘
  5. 希捷 Barracuda 7200.12
  6. 希捷 FreeAgent Desk 500GB 外置硬盘 银色 7200RPM USB2.0 零售
  7. GE 空间创客Laudry
  8. Mazda3 2010
  9. Mazda3 2009 2.3L

对于人类来说,硬盘3和硬盘5是相同的。 我们可以更进一步,假设产品 1、3、4 和 5 是相同的,并将产品 2 和 6 放入其他类别。

在我之前的问题中,有人建议我使用特征提取。 当我们有一个预定义描述(所有硬盘驱动器)的小数据集时,它的效果非常好,但是所有其他类型的描述又如何呢? 我不想开始为我的应用程序可能面临的所有描述编写基于正则表达式的特征提取器,它无法扩展。 是否有任何机器学习算法可以帮助我实现这一目标?我可以获得的描述范围非常广泛,在第一行,它可能是一台冰箱,然后在下一行,一个硬盘。 我应该尝试采用神经网络路径吗? 我的输入应该是什么?

感谢您的帮助!

I asked a question similar to this one a couple of weeks ago, but I did not ask the question correctly. So I am re-asking here the question with more details and I would like to get a more AI oriented answer.

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.

  1. Seagate Hard Drive 500Go
  2. Seagate Hard Drive 120Go for laptop
  3. Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
  4. New and shinny 500Go hard drive from Seagate
  5. Seagate Barracuda 7200.12
  6. Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail
  7. GE Spacemaker Laudry
  8. Mazda3 2010
  9. Mazda3 2009 2.3L

For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.

In my previous question, someone suggested to me to use feature extraction. It works very well when we have a small dataset of predefined descriptions (all hard drives), but what about all the other kind of description? I don't want to start to write regex based feature extractors for all the descriptions my application could face, it doesn't scale. Is there any machine learning algorithm that could help me to achieve this? The range of description that I can get is very wide, on line 1, it could be a fridge, and then on the next line, a hard drive. Should I try to take the Neural Network path? What should be my inputs?

Thank you for the help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蘸点软妹酱 2024-07-23 20:16:58

我会研究一些贝叶斯分类方法。 这将涉及训练分类器识别特定的单词,以指示产品属于您的某个类别的概率。 例如,经过训练后,它可以识别出如果产品描述中有“Seagate”,则有 99% 的可能性是硬盘,而如果有“Mazda”,则有 97% 的可能性是汽车。 像“新”这样的词可能最终不会对任何分类做出太大贡献,而这正是您希望它发挥作用的方式。

这样做的缺点是,它通常需要相当大的训练数据集才能开始正常工作,但您可以对其进行设置,以便它在生产过程中继续修改其百分比(如果您发现它对某些内容进行了错误分类) ),最终会变得非常有效。

贝叶斯技术最近在垃圾邮件过滤应用中大量使用,因此最好做一些阅读它在那里的使用方式。

I would look at some Bayesian classification methods. It would involve training the classifier to recognize particular words as indicating probability that a product belongs to one of your classes. For example, after being trained, it could recognize that if a product description has "Seagate" in it, there's a 99% chance that it's a hard drive, whereas if it has "Mazda" there's a 97% chance it's a car. A word like "new" probably would end up not contributing much to any classification, which is the way you want it to work.

The downside to this would be that it typically requires fairly large corpora of training data before it starts to work well, but you can set it up so that it continues to modify its percentages while being in production (if you notice that it classified something incorrectly), and it will eventually become very effective.

Bayesian techniques are used quite heavily recently for spam-filtering applications, so it might be good to do some reading on ways it's been used there.

总以为 2024-07-23 20:16:58

您应该同时查看聚类分类。 您的类别似乎是开放式的,因此表明聚类可能更适合该问题。
至于输入表示,您可以尝试提取单词和字符 n-grams。 您的相似性度量可能是常见 n 元语法的计数,或者更复杂的东西< /a>. 您可能需要手动标记生成的集群。

You should look at both clustering and classification. Your categories seem open-ended and thus suggest that clustering may fit the problem better.
As for input representation, you can try your luck with extracting word and character n-grams. Your similarity measure may be the count of common n-grams, or something more sophisticated. You may need to label the resulting clusters manually.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文