分析、分类和索引元数据
我有一个大型(约 250 万条记录)图像元数据数据库。 每条记录代表一个图像,并具有唯一的 ID、描述字段、以逗号分隔的关键字列表(例如每个图像 20-30 个关键字)以及一些其他字段。 没有真正的数据库模式,如果不迭代每个图像并计算它们,我就无法知道数据库中存在哪些关键字。 此外,元数据来自多个不同的供应商,每个供应商对于如何填写不同的字段都有自己的想法。
我想用这个元数据做一些事情,但由于我对这种算法完全陌生,我什至不知道从哪里开始寻找。
- 其中一些图像有一定的使用限制(以文本形式给出),但每个供应商的措辞不同,并且无法保证一致性。 我想要一个简单的测试,可以应用于图像,以指示该图像是否不受限制。 它不必是完美的,只要“足够好”即可。 我怀疑我可以使用某种贝叶斯过滤器来实现这一点,对吗? 我可以使用我知道受限制或无限制的图像语料库来训练过滤器,然后过滤器将能够对其余图像进行预测? 或者还有更好的方法吗?
- 我还希望能够根据“关键字相似度”对这些图像进行索引,这样如果我有一张图像,我可以快速判断它与哪些其他图像共享最多的关键字。 理想情况下,该算法还会考虑到某些关键字比其他关键字更重要,并对它们进行不同的权重。 我什至不知道从哪里开始寻找这里,并且会非常高兴得到任何指示:)
我主要使用 Java 工作,但语言选择在这里无关紧要。 我更感兴趣的是了解什么方法最适合我开始阅读。 提前致谢 :)
I have a large (~2.5M records) data base of image metadata. Each record represents an image and has a unique ID, a description field, a comma-separated list of keywords (say 20-30 keywords per image), and some other fields. There's no real database schema, and I have no way of knowing which keywords exists in the database without iterating over every image and counting them. Also, the metadata comes from several different suppliers, who each have their own ideas about how to fill out the different fields.
There are some things I would like to do with this metadata, but since I'm totally new to this kind of algorithms I don't even know where to begin looking.
- Some of these images have certain usage restrictions on them (given in text), but each supplier phrase them differently, and there is no way to guarantee consistency. I'd like to have a simple test I could apply to an image that gives an indication if that image is free from restrictions or not. It doesn't have to be perfect, just 'good enough'. I suspect I could use some kind Bayesian filter for this, right? I could train the filter with a corpus of images that I know are either restricted or restriction-free, and then the filter would be able to make predictions for the rest of the images? Or are there better ways?
- I would also like to be able to index these images according to 'keyword likeness', so that if I have one image, I could quickly tell which other images it shares the most keywords with. Ideally, the algorithm would also take into account that some keywords are more significant than others and weigh them differently. I don't even know where to start looking here, and would be very glad for any pointers :)
I'm working primarily in Java, but language choice is irrelevant here. I'm more interested in learning what approaches would be best for me to start reading up on. Thanks in advance :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当然,您必须首先将“关键字列表”字段转变为真正的标记方案。 最简单的一个是标签表,以及与图像表的“多对多”关系(即第三个表,其中每个记录都有一个图像的外键和另一个关键字的外键)。 它可以非常快速地查找具有特定关键字集的所有图像。
用于检测限制短语的贝叶斯过滤器很有趣。 我想说,除非你时间紧迫,否则就去吧。 如果是这样的话,一些简单的模式匹配应该可以处理超过 90-95% 的情况,其余的可以由几个操作员手工快速完成。
definitely you have to start by turning your 'list of keywords' field into a real tagging scheme. the easiest one is a table of tags, and a 'Many-to-Many' relationship with the image table (that is, a third table where each record has a foreign key to an image and another foreign key to a keyword). it makes real fast to find all images with a certain set of keywords.
the bayesian filter to detect restriction phrasing, is interesting. i'd say go for it, unless you're pressed for time. if that's the case, a few simple pattern matching should pick up more than 90-95% of cases, and the rest could be quickly finished by hand by a couple of operators.
(1) 看起来像是一个分类问题,其中文本中的单词作为特征,“受限”和“不受限”作为标签。 贝叶斯过滤或任何分类算法都应该可以解决问题。
(2) 看起来像是一个聚类问题。 首先,您想要提出一个良好的相似性函数,该函数根据关键字返回两个图像的相似性得分。 余弦相似度可能是一个很好的起点,因为您正在比较关键字。 从那里,您可以计算相似度矩阵,只需记住数据集中每个图像的“最近邻居”列表,或者您可以进一步使用聚类算法来得出实际的图像聚类。
由于您有如此多的记录,您可能希望跳过计算整个相似性矩阵,而只计算数据集的小随机样本的聚类。 然后,您可以将其他数据点添加到适当的集群中。 如果您想保留更多相似性信息,可以研究软聚类。
希望这能让您开始。
(1) Looks like a classification problem with words in your text as features, and "Restricted" and "Not Restricted" as your labels. Bayesian filtering or any classification algorithm should do the trick.
(2) Looks like a clustering problem. First you want to come up with a good similarity function that returns a similarity score for two images bases on their keywords. Cosine similarity might be a good starting point, since you are comparing keywords. From there you can compute a similarity matrix and just remember a list of 'nearest neighbors' for each image in your dataset, or you can go further and use a clustering algorithm to come up with actual clusters of images.
Since you have so many records, you might want to skip computing the entire similarity matrix, and just compute clusters for a small, random sample of your dataset. You can then add the other data points to the appropriate clusters. If you want to preserve more similarity information you can look into soft clustering.
Hopefully that will get you started.