AI多标签分类用于识别单个产品

发布于 2025-02-07 09:22:58 字数 653 浏览 0 评论 0原文

我正在研究AI项目,以识别PDF-DOC的文本。我想标记示例以训练AI模型,但我站在十字路口上,不知道该选择哪种方法。这是有关用例的一些背景。

从现在开始,PDF-DOC出现在多个页面上。这些单独的页面代表包装中存在的产品。这些产品的布局始终相同,但是标签(如何保存在源系统中)可能会有很大差异。示例:产品房,汽车,汽车,踏板车和船可以在一包中存在。每个产品需要保存的信息都不同。例如汽车,电动机和踏板车的许可证号,但House的M2。

有350多种不同的产品。因此,有太多可能的组合。对于这个项目,我只想识别7种不同的产品。因此,最好将数据包标记为一个整体并在此上训练模型。还是最好先将数据包分为单个产品,然后将单个产品提供给相应的模型。

  • A =不要将数据包分为单个产品。整个数据包的火车模型。
  • B =将数据包分为单个产品。每种产品都将获得单个模型。

有一个图像可以帮助澄清上面的文本:

选项A或选项B可视化

”在此处输入图像描述”

I'm working on an AI-project for recognizing text from PDF-docs. I want to label the examples to train the AI-model, but I am standing on a cross road and don't know what method to choose. Here is some background about the use case.

The PDF-docs exists out of multiple pages, from now on called a packet. These individual pages represent the products that exists in the packet. The layout of these products is always the same, but the labels (how and which data is saved in source system) can differ a lot. Example: the product house, car, motor, scooter, and boat can exist in one packet. The information that needs to be saved for each product is different. Like license number for car, motor and scooter, but m2 for house, for example.

There exist over 350 different products. So there are too many possible combinations. For this project I just want to recognize 7 different products. So is it better to label the packets as a whole and train the model on this. Or is it better to split the packet into the individual product first, and then offer the individual product to the corresponding model.

  • A = Don't split the packet into the individual product. Train model as a whole packet.
  • B = Split the packet into individual products. Each product will get it's individual model.

There is an image to help clarify the text above:

Option A or Option B visualization

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小鸟爱天空丶 2025-02-14 09:22:58

我会以不同的方式解决这个问题。

我认为类似的产品页面具有类似的方法来解析它们,例如:汽车始终在此位置拥有注册年(无论是在某些关键字还是(x,y)坐标之后)。

首先,为每个产品页面编写相应的解析规则,以获取所需的信息。有一些用于解析PDF的库,这是 python示例

然后,将数据包分为单个页面,然后训练一个机器学习模型,以便能够分类“它是什么产品?”

完整的管道看起来像是1。将数据包分为第2页。将每个产品页面分类为类别3。应用相应的解析器4。组合(i是您打算使用的),


我会选择简单的东西作为一个简单的东西在关键字上的决策树/随机森林或作为基于文本的神经网络复杂的东西。

I would approach this problem differently.

I assume similar product pages have similar ways of how to parse them, for example: cars always have the registration year at this spot (be it after some keyword or (x, y) coordinates).

First, write for each product page the corresponding parsing rules to get the information you need. There are libraries for parsing text out of pdf, here is the python example.

Then, split packets into individual pages, and train one machine learning model to be able to classify "what product is it?".

The full pipeline will look like, 1. split the packet into pages 2. classify each product page into its category 3. apply the corresponding parsers 4. combine back (I that's what you intend)


For the classifier I would choose something simple as a decision tree/ random forest on keywords or something complex as a text-based neural network.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文