PHP 修复错误文本
这是我正在做的事情,我希望得到 StackOverflow 上聪明人的意见。
我正在尝试的是一种基于组合同一文本页面的各种错误版本来修复文本的功能。基本上,这可以用于将不同的 OCR 结果组合成一个,其准确性比单独使用其中任何一个结果都要高。
我从一本包含 600,000 个英语单词的词典开始,这几乎包含了所有内容,包括法律和医学术语以及常用名称。我已经有这个了。
然后我就有了 4 个版本的文本样本。
像这样的事情:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
我试图将上面的内容结合起来得到一个看起来像这样的输出:
$text = 'First text sample is this line.';
不要告诉我这是不可能的,因为它当然不是,只是非常困难。
我非常感谢任何人对此有任何想法。
谢谢你!
我目前的想法:
仅仅根据字典检查单词是行不通的,因为有些空格位置错误,有时单词不会出现在字典中。
主要关注的是修复损坏的空格,一旦修复了这个问题,那么就可以选择最常出现的字典单词(如果存在),或者选择最常出现的非字典单词。
This is something I'm working on and I'd like input from the intelligent people here on StackOverflow.
What I'm attempting is a function to repair text based on combining various bad versions of the same text page. Basically this can be used to combine different OCR results into one with greater accuracy than any of them individually.
I start with a dictionary of 600,000 English words, that's pretty much everything including legal and medical terms and common names. I have this already.
Then I have 4 versions of the text sample.
Something like this:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
I attempting to combine the above to get an output which looks like:
$text = 'First text sample is this line.';
Don't tell me it's impossible, because it is certainly not, just very difficult.
I would very much appreciate any ideas anyone has towards this.
Thank you!
My current thoughts:
Just checking the words against the dictionary will not work, since some of the spaces are in the wrong place and occasionally the word will not be in the dictionary.
The major concern is repairing broken spacings, once this is fixed then then the most commonly occurring dictionary word can be chosen if exists, or else the most commonly occurring non-dictionary word.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您是否尝试过使用最长公共子序列算法?这些常见于源代码管理应用程序和一些文本编辑器中使用的“diff”文本比较工具。 diff 算法有助于识别两个文本样本中已更改和未更改的字符。
http://en.wikipedia.org/wiki/Diff
几年前,我从事 OCR 工作与你的类似的应用程序。我没有将多个 OCR 引擎应用于一张图像,而是使用一个 OCR 引擎来分析同一图像的多个版本。每张处理后的图像都是对原始图像应用不同去噪技术的结果:一种技术对于低对比度效果更好,另一种技术在字符形成不良时效果更好。比较每个图像上的 OCR 结果的“投票”方案提高了任意文本字符串(例如“BQCM10032”)的读取率。 OCR 学术文献中描述了其他投票方案。
有时,您可能需要匹配某个单词,但 OCR 结果组合无法生成所有字母。例如,中间的字母可能会丢失,如“w rd”或“c tch”(可能是“word”和“catch”)。在这种情况下,它可以帮助您使用三个键中的任意一个来访问字典:首字母、中间字母和尾字母(或字母组合)。每个键都与按语言中出现频率排序的单词列表相关联。 (我使用这种多键查找来提高填字游戏生成应用程序的速度;可能有更好的方法,但这个方法很容易实现。)
为了节省内存,您可以应用多键方法仅针对该语言中的前几千个常见单词,然后对于不太常见的单词只有一种查找技术。
有几个在线词频列表。
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
如果你想喜欢的话,您还可以依赖文本中先前出现的频率。例如,如果“Byrd”出现多次,那么 OCR 引擎报告置信度较低的“bird”或“bard”可能是更好的选择。仅当同一页面上出现的医学术语在统计上不太可能时,您才可以将医学词典加载到内存中,否则将医学术语从工作词典中删除,或者至少为它们分配合理的可能性。 “假肢”是一个常见的词; “前列腺炎”则不然。
如果您有图像处理技术(例如去噪和形态学操作)的经验,您还可以尝试在将图像传递到 OCR 引擎之前对其进行预处理。在您的软件识别出 OCR 引擎表现不佳的单词或区域后,图像处理也可以应用于选定的区域。
某些字母/字母和字母/数字替换是常见的。数字 0(零)可能会与字母 O 混淆,C 代表 O,8 代表 B,E 代表 F,P 代表 R,等等。如果发现某个单词的置信度较低,或者有两个常见单词可能与未完整阅读的单词相匹配,则临时形状匹配规则可能会有所帮助。例如,“bcth”可以匹配“both”或“bath”,但对于许多字体(和上下文),“both”更可能匹配,因为“o”在形状上与“c”更相似。在一长串单词中,例如小说或杂志文章中的段落,“bath”比“b8th”更好匹配。
最后,您可能可以编写一个插件或脚本将结果传递到拼写检查引擎中,以检查名词-动词一致性和其他语法检查。这可能会捕获一些额外的错误。也许您可以尝试 VBA for Word 或当今流行的任何其他脚本/应用程序组合。
Have you tried using a longest common subsequence algorithm? These are commonly seen in the "diff" text comparison tools used in source control apps and some text editors. A diff algorithm helps identify changed and unchanged characters in two text samples.
http://en.wikipedia.org/wiki/Diff
Some years ago I worked on an OCR app similar to yours. Rather than applying multiple OCR engines to one image, I used one OCR engine to analyze multiple versions of the same image. Each of the processed images was the result of applying different denoising technique to the original image: one technique worked better for low contrast, another technique worked better when the characters were poorly formed. A "voting" scheme that compared OCR results on each image improved the read rate for arbitrary strings of text such as "BQCM10032". Other voting schemes are described in the academic literature for OCR.
On occasion you may need to match a word for which no combination of OCR results will yield all the letters. For example, a middle letter may be missing, as in either "w rd" or "c tch" (likely "word" and "catch"). In this case it can help to access your dictionary with any of three keys: initial letters, middle letters, and final letters (or letter combinations). Each key is associated with a list of words sorted by frequency of occurrence in the language. (I used this sort of multi-key lookup to improve the speed of a crossword generation app; there may well be better methods out there, but this one is easy to implement.)
To save on memory, you could apply the multi-key method only to the first few thousand common words in the language, and then have only one lookup technique for less common words.
There are several online lists of word frequency.
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
If you want to get fancy, you can also rely on prior frequency of occurrence in the text. For example, if "Byrd" appears multiple times, then it may be the better choice if the OCR engine(s) reports either "bird" or "bard" with a low confidence score. You might load a medical dictionary into memory only if there is a statistically unlikely occurrence of medical terms on the same page--otherwise leave medical terms out of your working dictionary, or at least assign them reasonable likelihoods. "Prosthetics" is a common word; "prostatitis" less so.
If you have experience with image processing techniques such as denoising and morphological operations, you can also try preprocessing the image before passing it to the OCR engine(s). Image processing could also be applied to select areas after your software identifies the words or regions where the OCR engine(s) fared poorly.
Certain letter/letter and letter/numeral substitutions are common. The numeral 0 (zero) can be confused with the letter O, C for O, 8 for B, E for F, P for R, and so on. If a word is found with low confidence, or if there are two common words that could match an incompletely read word, then ad hoc shape-matching rules could help. For example, "bcth" could match either "both" or "bath", but for many fonts (and contexts) "both" is the more likely match since "o" is more similar to "c" in shape. In a long string of words such as a a paragraph from a novel or magazine article, "bath" is a better match than "b8th."
Finally, you could probably write a plugin or script to pass the results into a spellcheck engine that checks for noun-verb agreement and other grammar checks. This may catch a few additional errors. Maybe you could try VBA for Word or whatever other script/app combo is popular these days.
与使用第三方工具相比,自己处理这样的复杂算法可能会花费更长的时间并且更容易出错 - 除非您确实需要自己编程,否则您可以检查 雅虎拼写建议 API。我相信他们每天每个 IP 允许 5000 个请求。
其他人可能会提供类似的东西(我认为也有一个 bing API)。
更新:抱歉,我刚刚了解到他们已于 2011 年 4 月停止了这项服务。他们声称现在提供类似的服务,称为“拼写建议 YQL 表”。
Tackling complex algorithms like this by yourself will probably take longer and be more error prone than using a third party tool - unless you really need to program this yourself, you can check the Yahoo Spelling Suggestion API. They allow 5.000 requests per IP per day, I believe.
Others may offer something similar (I think there's a bing API, too).
UPDATE: Sorry, I just read that they've stopped this service in April 2011. They claim to offer a similar service called "Spelling Suggestion YQL table" now.
这确实是一个比较复杂的问题。
当我确实想知道如何拼写一个单词时,直接的方法就是打开字典。但是,如果我试图正确拼写一个小而复杂的句子怎么办?我个人的技巧之一就是给谷歌打电话,这在大多数情况下都有效。我将我的句子放在 Google 上的引号之间并计算结果。下面是一个例子:在 Google 上输入“your very smart”会得到 13'600k 页面。输入“you're very smart”会给出 20'000k 页。那么,正确的拼写很可能是“you're very smart”。而且......确实如此;)
基于这个概念,我猜你的样本大部分都拼写错误(好吧,如果你是为青少年游戏网站开发的话,也许不是......)。你能尝试将样本分成子部分,而不是直接到单词上,并按频率进行匹配吗?最常见的片段最有可能拼写正确。在此之前,您已经可以对 600'000 个术语进行字典拼写检查,以增加纠正小拼写错误的机会。这应该会增加正确子片段的频率。
将句子分成几部分并找到正确的“片段大小”也很棘手。
我也有点担心:如何提取样本并将它们匹配在一起才能知道拼写正确的句子是相同的(或非常接近?)。你的问题似乎假设你有这个,这对我来说也似乎非常复杂。
好吧,前面的内容只是基于我个人和人类经验的一般提示。不知道这是否有帮助。这显然不是一个真正的答案,也不应该是一个真正的答案。
This is indeed a rather complicated problem.
When I do wonder how to spell a word, the direct way is to open a dictionary. But what if it is a small complex sentence that I'm trying to spell correctly ? One of my personal trick, which works most of the time, is to call Google. I place my sentence between quotes on Google and count the results. Here is an example : entering "your very smart" on Google gives 13'600k page. Entering "you're very smart" gives 20'000k pages. Then, likely, the correct spelling is "you're very smart". And... indeed it is ;)
Based on this concept, I guess you have samples which, for the most parts, are correctly misspelled (well, maybe not if your develop for a teens gaming site...). Can you try to divide the samples into sub pieces, not going up to the words, and matching these by frequency ? The most frequent piece is the most likely correctly spelled. Prior to this, you can already make a dictionary spellcheck with your 600'000 terms to increase the chance that small spelling mistakes will alredy be corrected. This should increase the frequency of correct sub pieces.
Dividing the sentences in pieces and finding the right "piece-size" is also tricky.
What concerns me a little too : how do you extract the samples and match them together to know the correctly spelled sentence is the same (or very close?). Your question seems to assume you have this, which also seems something very complex for me.
Well, what precedes is just a general tip based on my personal and human experience. Donno if this can help. This is obviously not a real answer and is not meant to be one.
您可以尝试使用 google n-grams 来实现此目的。
You could try using google n-grams to achieve this.
如果您需要仅通过比较其他字符串来获得正确的字符串。那么这样的事情也许会有所帮助。
它还没有完成,但已经给出了一些结果。
If you need to get right string only by comparing other. Then Something like this maybe will help.
It not finished yet, but already gives some results.