当前位置：文江博客话题详情

如果名称模糊，如何确保正确的列链接？在Python

发布于 2025-01-25 16:15:14 字数 400 浏览 5 评论 0原文

我有400k记录的.txt文件 - 从收据中读取OCR。我专注于2列：store_id and address_store（附加图片中的表）。在现实世界中，每个store_id都应链接到单个商店地址，但是有一些OCR错误。

我注意到的是：

大多数ID已正确链接；
ID链接中有3种类型的错误（Mispell，999和空白）
另外的地址是模糊的名称。表

在这里，哪种算法/模型是什么样的算法？不幸的是，我没有任何正确名称的dcitionary。

如果我使用了错误的术语，请纠正我。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小苏打饼 2025-02-01 16:15:14

我认为没有一个简单的答案可以让您清理这些数据，而且我想不出一种没有迭代性的方法。鉴于您提供的桌子，我假设所有其中的所有8个都应

astor_id = 156
store_addr = 1香蕉方式，天堂

我也假设重要的信息是store_id-然后将其与其他事物匹配 - 所以关键的结果是将每个记录映射到正确的store_id。

鉴于这些假设是正确的，我的方法是迭代的，老实说，在SQL（我使用数十年的经验）而不是Python（我只使用了几年）中，我会发现许多这些技术很容易。但这可能是熟悉的函数。

我绝对要做的就是扫描Whitespace Anomolies，特别是如果OCR是您的数据源。几乎可以肯定，您会拥有双空间，宽阔的空间等（即2000年至2016年之间的所有Unicode角色）将这些全部转换为ASCII 32。同样，带有破折号和下列。 OCR试图保持准确，但为了进行此练习，这只是噪音。

从那里：

找到一个唯一的store_address列表，每个store_address的记录计数和独特的store_id，
如果您的OCR质量合理，则少数地址将涵盖大多数记录“ 1香蕉方式，天堂”不超过1个其他记录，还有其他任何商店_address值 - 不是很帕累托，而是以这种方式前进）
对地址列表排序高命中率的地址列表，并寻找较小的失误（例如，'1 Banana Way Heaven '只有1个字符[缺失的逗号]，所以将其匹配为拼写错误）。
为您确定的每个地址确定正确的store_id。
然后可以将正确的store_id分配给126，999＆amp;空白记录
使用4个列表匹配无与伦比的地址的store_id→应该很容易看到“水果商店，1香蕉干草”是“ 1香蕉方式”的
修改“水果商店，1个香蕉干草”的变化，

如果OCR来源的质量相当高，现在绘制了大约90％至99％以上的OCR来源。在您的数据集上，仅在“ 1个香蕉干草，天堂”和“ 1 Bananaway Heaven”上不匹配。

如果超过99％，那么您可能会发现剩余的是手动检查（乏味，但可能比大多数代码方法更快）完成。

如果它接近90％，则需要进行以下一项或多项：

将所有内容移动到上面的上述情况
会消除标点符号并重复上述
消除空白空间，并

在此阶段重复上述内容但是，如果您仍然不落在可以通过检查的列表中，则需要查看一个语音算法（Soundex https://en.wikipedia.org/wiki/Soundex) is widely implemented

从技术角度来看（在Python中）：

txt.replace（） and txt.translate（）是解决Whitespace问题
lst.Sort（key）的强大工具= store_addr）要允许您从上面的步骤3进行顺序
，您需要开始维护2个字典
- link Store_number和有效地址
- 链接OCR地址和有效地址
  您可以将OCR链接到存储号码，但很难检查。

I don't think there is a simple answer that will allow you to clean up this data, and I can't think of a method that is not somewhat iterative. Given the table you have provided I am assuming all 8 of these should be

store_id = 156
store_addr = 1 Banana Way, Heaven

I am also assuming the important information is the store_id - that will then be matched to something else - so the key outcome is to map each record to the correct store_id.

Given those assumptions are correct my approach would be iterative, and to be honest a number of these techniques I would find easier in SQL (which I have decades experience using) rather than Python (that I have only been using for a few years) - but that may be a function of familiarity.

The absolute first thing I would do is scan for whitespace anomolies, particularly if OCR is your datasource. You will almost certainly have double spaces, wide spaces etc (i.e. all of the unicode characters between 2000 and 2016) convert these all to ascii 32. Likewise with dashes and underscores. OCR is trying to be accurate but for this exercise it is just noise.

From there:

Find a unique list of store_address, with a count of records and distinct store_id for each store_address
If your OCR is of reasonable quality a small number of addresses will cover the majority of the records (e.g. on this list of eight 50% are '1 Banana Way, Heaven' no more than 1 other record has any of the other store_address values - not quite pareto but heading that way)
Sort the list of addresses with high hit rate and look for minor mispellings (e.g. '1 Banana Way Heaven' is only 1 character away [the missing comma] so match that as a misspelling).
Decide on the correct store_id for each of the addresses you have identified.
The correct store_id can then be assigned to the 126, 999& blank record
Use the list from 4 to match the store_id of unmatched addresses → should be fairly easy to see 'Fruit store, 1 Banana hay' is a modification of '1 Banana Way, Heaven'
Use the cross match list from 6 to find all the variations of 'Fruit store, 1 Banana hay' and assign them to store 156

If the OCR source for this is of fairly high quality, you should find between about 90% and 99%+ are now mapped. On your dataset only '1 Banana Hay, Heaven' and '1 Bananaway Heaven' are not matched.

If it is above 99% then you will probably find the remainder are best done by manual inspection (tedious, but probably quicker than most of the code approaches).

If it is closer to 90% you need to do one or more of the following:

Shift everything to upercase and repeat the above
Eliminate the punctuation and repeat the above
Eliminate the white space and repeat the above

By this stage your list above would be clean, however if you are still not down to a list where it is manageable by inspection, you are needing to look at one of the phonetic algoritms (Soundex https://en.wikipedia.org/wiki/Soundex) is widely implemented Is there a soundex function for python? although there are other more sophisticated ones.) My experience is using Soundex before you are to this stage is probably going to be of limited help - Soundex does a good job finding Lochlan versus Lachlan (which sound basically the same), but is not good where you have OCR where it is the written characters that vary.

From a technical perspective (in Python):

txt.replace() and txt.translate() are powerful tools to fix the whitespace problem
lst.sort(key=store_addr) to allow you to sequence
From about step 3 above, you will need to start maintaining 2 dictionaries
- link store_number and valid addresses
- link OCR address and valid address
  You could link OCR to store number but this will be harder to inspect.

回复收藏 0 原文

~没有更多了~