如果名称模糊,如何确保正确的列链接?在Python
我有400k记录的.txt文件 - 从收据中读取OCR。我专注于2列:store_id and address_store(附加图片中的表)。在现实世界中,每个store_id都应链接到单个商店地址,但是有一些OCR错误。
我注意到的是:
- 大多数ID已正确链接;
- ID链接中有3种类型的错误(Mispell,999和空白)
- 另外的地址是模糊的名称。 表
在这里,哪种算法/模型是什么样的算法? 不幸的是,我没有任何正确名称的dcitionary。
如果我使用了错误的术语,请纠正我。
I have .txt file of 400k records - read in OCR from receipts. I focus on 2 columns: store_id and address_store (table in attached pic). Inthe real world each store_id should be linked to a single store address, but there were some OCR errors.
What, I have noticed:
- most ids are correctly linked;
- there are 3 types of errors in id linkin (mispell,999, and blank)
- additionally addresses are fuzzy names.
table
What kind of algorithm / model whould be the best solution here?
Unfortunately, I dont have any dcitionary for correct names.
If I used wrong terminology, please correct me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为没有一个简单的答案可以让您清理这些数据,而且我想不出一种没有迭代性的方法。鉴于您提供的桌子,我假设所有其中的所有8个都应
我也假设重要的信息是store_id-然后将其与其他事物匹配 - 所以关键的结果是将每个记录映射到正确的store_id。
鉴于这些假设是正确的,我的方法是迭代的,老实说,在SQL(我使用数十年的经验)而不是Python(我只使用了几年)中,我会发现许多这些技术很容易。但这可能是熟悉的函数。
我绝对要做的就是扫描Whitespace Anomolies,特别是如果OCR是您的数据源。几乎可以肯定,您会拥有双空间,宽阔的空间等(即2000年至2016年之间的所有Unicode角色)将这些全部转换为ASCII 32。同样,带有破折号和下列。 OCR试图保持准确,但为了进行此练习,这只是噪音。
从那里:
如果OCR来源的质量相当高,现在绘制了大约90%至99%以上的OCR来源。在您的数据集上,仅在“ 1个香蕉干草,天堂”和“ 1 Bananaway Heaven”上不匹配。
如果超过99%,那么您可能会发现剩余的是手动检查(乏味,但可能比大多数代码方法更快)完成。
如果它接近90%,则需要进行以下一项或多项:
在此阶段重复上述内容但是,如果您仍然不落在可以通过检查的列表中,则需要查看一个语音算法(Soundex https://en.wikipedia.org/wiki/Soundex) is widely implemented
从技术角度来看(在Python中):
txt.replace()
andtxt.translate()
是解决Whitespace问题lst.Sort(key) 的强大工具= store_addr)
要允许您从上面的步骤3进行顺序您可以将OCR链接到存储号码,但很难检查。
I don't think there is a simple answer that will allow you to clean up this data, and I can't think of a method that is not somewhat iterative. Given the table you have provided I am assuming all 8 of these should be
I am also assuming the important information is the store_id - that will then be matched to something else - so the key outcome is to map each record to the correct store_id.
Given those assumptions are correct my approach would be iterative, and to be honest a number of these techniques I would find easier in SQL (which I have decades experience using) rather than Python (that I have only been using for a few years) - but that may be a function of familiarity.
The absolute first thing I would do is scan for whitespace anomolies, particularly if OCR is your datasource. You will almost certainly have double spaces, wide spaces etc (i.e. all of the unicode characters between 2000 and 2016) convert these all to ascii 32. Likewise with dashes and underscores. OCR is trying to be accurate but for this exercise it is just noise.
From there:
If the OCR source for this is of fairly high quality, you should find between about 90% and 99%+ are now mapped. On your dataset only '1 Banana Hay, Heaven' and '1 Bananaway Heaven' are not matched.
If it is above 99% then you will probably find the remainder are best done by manual inspection (tedious, but probably quicker than most of the code approaches).
If it is closer to 90% you need to do one or more of the following:
By this stage your list above would be clean, however if you are still not down to a list where it is manageable by inspection, you are needing to look at one of the phonetic algoritms (Soundex https://en.wikipedia.org/wiki/Soundex) is widely implemented Is there a soundex function for python? although there are other more sophisticated ones.) My experience is using Soundex before you are to this stage is probably going to be of limited help - Soundex does a good job finding Lochlan versus Lachlan (which sound basically the same), but is not good where you have OCR where it is the written characters that vary.
From a technical perspective (in Python):
txt.replace()
andtxt.translate()
are powerful tools to fix the whitespace problemlst.sort(key=store_addr)
to allow you to sequenceYou could link OCR to store number but this will be harder to inspect.