如何开发一个程序来最大限度地减少手写调查的人工转录错误

发布于 2024-09-04 00:44:07 字数 870 浏览 7 评论 0原文

我需要开发定制软件来进行调查。问题可能是多项选择，或者在极少数情况下是自由文本。

我被要求设计一个子系统来检查多项选择部分的手动数据输入是否有错误。我们正在努力加快用户数据输入过程，并尽量减少数字表格和原始调查问卷之间的人工输入差异。调查充满了人类采访者的手写标记和文本，因此可能会发现难以阅读的标记，或者用户也可能会意外地在某些问题中选择不同的值，我们希望避免这种情况。

该软件必须包括一些自动控制来检测可能的打字差异。多项选择题的每个答案被选中的概率相同。

这个问题有两个部分：

GUI。

我想到的最简单的事情是实现问题显示的最有用的设计：使用大且可读的字体和慷慨的空间选择。还有别的事吗？为了更快地输入，我想使用下拉列表（更喜欢键盘而不是鼠标）。鉴于问题按部分分组，我想显示为该部分的问题选择的答案，但这可能会减慢该过程。还有其他想法吗？

错误检查子系统。

我还能做些什么来最大程度地减少或检查多项选择题中的人为拼写错误？这是一个可以解决的问题吗？是否有一些统计方法来检查用户输入的值是否与手工填写的表格相同？例如，假设调查有 5 个问题，每个问题有 4 个选项。假设我有n份由采访者以纸质形式填写的调查表格，并且已准备好输入软件中，那么如何最大限度地减少手动转录n份调查的意外差异，而不必仔细检查所有内容n次调查的5个问题？

我的第一个建议是，在处理完所有手工填写的表格后，软件可以随机选择一些表格，以便在少数情况下对答复进行双重检查，但我可以根据什么标准做出这种选择？这种验证是否足以以一种重要的方式涵盖所有内容？

实际调查是国家级的，有 56 页，总共 200 多个问题，因此将是很多人手写的页面，目的是减少错误的可能性并优化数据输入的速度;过程。考虑到采访者携带笔记本电脑或手持设备的复杂性，调查必须首先填写纸张。

原文

I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases.

I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that.

The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected.

This question has two parts:

The GUI.

The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas?

The error checking subsystem.

What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys?

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way?

The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我只土不豪 2024-09-11 00:44:07

你可以说我是老派，但我仍然认为最实用的方法是使用复式记账法。两名数据录入职员输入他们的调查，然后交换堆栈并输入另一名职员的调查。每当你的系统检测到两者之间的差异时，它就会抛出一个标志 - 然后两个职员一起思考并决定正确的答案（或者可能会由更高级的研究人员等进行审查）。结合这里的一些其他建议（我非常喜欢 mdma 对 GUI 的建议），这将形成一个低错误系统。

是的，这可能会使您的数据输入时间加倍 - 但它非常简单，并且会大大减少您的错误。 OMR 想法是一个很棒的想法，但在我看来，这个项目（一项全国性的 52 页调查）并不是一个单独的黑客首次尝试实现该想法的最佳案例。你需要什么软件？有什么硬件可以做到这一点？识别愚蠢的东西仍然需要大量的人工工作，面试官会标记所有四个可能的答案，然后在旁边写下注释 - 您可能需要随机抽样调查来了解机器的内容-读取错误率是。即使如此，您仍然只是错误率的估计，而不是更正的数据。

这次尝试一种更简单的方法来为您的雇主提供高质量的结果 - 然后使用这些结果作为预先验证的数据集，以便在下次试验 OMR 内容。

回复收藏 0 原文

一世旳自豪 2024-09-11 00:44:07

OCR/OMR 可能是最好的选择，因为您可以排除不可预测的人为错误，并将其替换为相当可预测的机器错误。甚至可以过滤掉 OCR 可能难以处理的表格，并对这些表格进行修改以提高扫描准确性。

但是，正面解决最初的问题：

错误检查

具有相关的问题，因此本质上同一件事会被多次询问，或者再次以否定的方式询问。如果相关问题的答案也不相关，则这可能表明输入错误。
与标准的偏差：如果典型响应中存在模式，那么与这些典型响应的偏差可以被视为潜在的输入错误。例如，如果问题 2 和问题 3 的答案为 A，则问题很可能是 C 或 D。这是上述相关性的概括。可以根据已输入的数据动态计算相关性。

GUI

让 GUI 模仿纸质形式，以便录入员在纸质上看到的内容反映在屏幕上。这样，在 GUI 中将纸质问题答案输入到错误问题的可能性就较小。
为数据输入人员提供视觉帮助，例如使用滑块在纸上保留当前问题的位置。
用于输入数据的自定义输入设备可能比键盘导航和列表框更容易使用。例如，所有选项都拼写为 ABC D 的触摸屏。店员只需点击一个选项，它就会被选中，并在短暂的停顿后显示下一个问题。如果职员出错，他们可以使用每个问题旁边的上一个/下一个按钮。
提供输入数据的音频反馈，因此当店员输入“A”时，他们会听到“A”。

编辑：
如果您考虑执行数据双重输入或实施改进的 GUI，则可能值得进行试点计划来评估各种方法的有效性。双重录入的成本可能很高（数据录入任务的成本加倍）——这可能会或可能不会因准确性的提高而得到证实。试点计划将使您能够快速且相对便宜地评估双重记账的有效性。它还可以让您了解单个数据输入员在没有任何 UI 更改的情况下所犯的错误级别，这有助于确定是否需要更改 UI 或其他减少错误的策略，以及实施这些策略需要多少成本。

Related links

A device that inputs data from multiple choice tests
Wikipedia: OMR - Optical Mark Recognition
ReadSoft - Automated Data Entry
Data capture hardware

回复收藏 0 原文

你与昨日 2024-09-11 00:44:07

我的第一个建议是，在处理完所有手工填写的表格后，软件可以随机选择一些表格，以便在少数情况下对答复进行双重检查

我认为这实际上不会产生有意义的结果。据推测，这些错误是无意的和随机的。随机检查会发现系统性错误，但如果您仔细检查 10% 的表格，您只能发现 10% 的随机错误（如果您检查 20% 的表格，则只能发现 20% 的错误，等等）。

纸质调查是什么样的？如果可能的话，我猜想扫描手写测试并将 OCR 检测到的答案与数据输入操作员给出的答案进行比较的 OCR 系统将是一个更好的解决方案。您最终可能仍会手动仔细检查相当数量的调查，但您会确信，与随机挑选的调查相比，您仔细检查的调查更有可能包含错误。

如果您还可以控制纸质调查的外观，那就更好了：您可以专门设计它们，以便 OCR 尽可能准确。

回复收藏 0 原文

清风无影 2024-09-11 00:44:07

请原谅我完全回避这个问题，但昨天我去了 eBay，花了 99 美元购买了一台 7 英寸 Android 平板电脑。不是世界上最好的贴纸处理器，也没有大量的 RAM，但肯定足以填写该领域的用户调查。

我不敢相信您的组织无法支付每位面试官 99 美元的费用来解决这个问题。

至少值得向你的老板建议，不是吗？

回复收藏 0 原文

伏妖词 2024-09-11 00:44:07

我支持马特·帕克关于使用复式记账法减少错误的建议。我什至看到三重输入用于对错误非常敏感的数据输入任务。

复式记账的好处是，它使您能够通过做出一些假设（主要是条目和职员之间的错误率是一致的）并使用遇到条目冲突的比率来对总体错误率进行大致估计。

更复杂的复式输入系统还可以测量部分数据输入任务和个别职员的错误率，以便您可以进行改进以降低错误率。

回复收藏 0 原文

川水往事 2024-09-11 00:44:07

听起来需要一种组合方法，实际的表单应该适合自动化处理。您可以扫描文档并只处理电子版本，如果可以自动处理多项选择输入，则可以通过将用户排除在外来获得更好的错误率。根据 OCR 包，我猜您会得到一个返回值，告诉您系统对其所做选择的确定程度，根据该值，您希望有人验证表单。注意我说的是在多项选择的标记上使用 ocr，而不是在自由格式条目上使用，这本身可能就是一个问题。

同时，您可能需要进行随机检查以找出 OCR 系统的错误率。然后，该值可用于确定多项选择问题的总和的置信度值。

我认为，如果您只是采用人工输入，类似的方法会很有帮助，您可能不会消除所有错误，因为人们会犯错误，并且他们会犯错误来纠正错误，但如果样本量足够大，您可能会能够确定人类输入中的错误率。然后可以使用该数字来确定调查结果。

至于其他 UI 想法，您可以使用扫描的表单并以 UI 复选框靠近书面复选框的方式覆盖 UI。如果您有几条已知的角度线，则拉直和缩放形状应该不会太难。如果 UI 输入元素靠近铅笔标记，您就有可能获得更高的正确分类率。

您也可以使用统计分析来选择看起来不相符的表格，但是您可能会通过不均匀的选择来扭曲结果，这可能比均匀的随机误差更糟糕。根据纸质调查的设计，将其复制到 UI 中可能会有所帮助，如果两者看起来相似，每个人都会更容易发现错误，如果您不坚持这一点，可能会参考调查中的一些参考资料设计（例如这个可能会有所帮助。

这似乎是一个相当在大型操作中，我确信员工中有一些统计学家，与他们讨论他们需要什么以及你可以做些什么来帮助他们，而不应该做更多地扭曲结果。

It sounds like there is need for a combined approach, the actual forms should be suitable for automated processing. You could scan the documents and just deal with the electronic version, if the multiple choice input can be automatically process you might get better error ratios by keeping the user out of the loop. Depending on the OCR package I would guess that you will get a value back that tells you how sure the system is about a selection it has made, dependent on that value you will want to have the form verified by a person. Note I am talking about using ocr on the marks on the multiple choice not the freeform entries, that is probably an issue by itself.

In parallel you will probably want to do random checks to find the error ratio of the ocr system. This value can then be used for determining the confidence value for the sum of the multiple choice question.

I think a similar approach would be helpful if you just go with human input, you will probably not get rid of all the errors because people will make errors and they will make errors correcting errors, but with a large enough sample size you will probably be able to determine the ratio of errors in the human input. This number can then be used for determining the results of the survey.

As for other UI ideas, you could use the scanned forms and overlay the UI in a way that the UI checkbox is close to the written checkbox. If you have a couple of known lines at angles, straightening and scaling the form should not be too hard. If the UI input element is close to the pencil marks chances are you are going to get higher rates for correct classification.

You can also probably use statistical analysis to pick forms that seem out of line, but you might then be skewing the result by non uniform selection which might be worse than a uniform random error. Depending on the design of the paper survey it might be helpful to copy that in the UI, it will be easier for everybody to find errors if the two should look similar, if you don't stick to that may some of the references on survey design (like this might be helpful.

This seems to be a rather large operation, I am sure there are some statisticians on staff, talk to them on what they need and what you could do to help them and should not do to skew results even more.

回复收藏 0 原文

岁月打碎记忆 2024-09-11 00:44:07

在针对此问题实施最佳软件组合后，您还可以考虑通过 Amazon's Mechanical turk 对转录内容进行编程并与原始内容进行人工交叉检查。类似的其他项目有 reCaptcha （尽管据我所知，它仅适用于打印文本 OCR），我只是遇到了 Beextra 它似乎正在做一些事情，比如对史密森尼媒体进行编目。

回复收藏 0 原文

燃情 2024-09-11 00:44:07

关于多项选择答案转录错误的检测，我的建议是使用多个数据输入人员和统计分析。

统计学家可以比较结果，看看是否存在任何问题，因为一个数据输入用户输入的答案与其他用户输入的答案的答案分布明显不同。如果是这样，那么可以标记这些问题以便从表格中重新输入。

假设表格被随机分配给数据输入人员，对于每个数据输入用户足够多的表格，输入的结果应该具有相当相似的答案分布。

回复收藏 0 原文

-黛色若梦 2024-09-11 00:44:07

人工双重检查可能是实现低错误数的最流行的方法。。如果您想加快速度，一个人只能计算给定答案的总数并将该数字写在调查底部（类似于“控制总和”）。向您的应用程序输入数据的人还应该在特殊字段中填写该数字，然后系统可以计算给定答案的数量并与预期值进行比较。这样可以解决数量正确的问题，但不能解决数据正确的问题。

您还可以使用数据挖掘中的一些方法来检测插入数据中的错误。示例：如果您询问年龄和工资范围，您可以创建规则：如果年龄 < X 很可能该人的收入不会超过 Y，因此发出警报并要求修改。这称为关联规则

GUI：它应该与纸质形式的表示为1:1。一些键盘快捷键可能有助于加快工作速度。

回复收藏 0 原文

留蓝 2024-09-11 00:44:07

正如已经提到的，键入两次。是的，这是“双倍的工作”，但这引出了第 2 点。

让调查易于键入。

对于键控人员来说，它们应该易于阅读。关于他们的注意力的部分很好地突出显示，因此它从表格的噪音中脱颖而出。

你的“GUI”不应该是这样。 GUI 的主要好处是“可发现性”，这些人不应该“发现”任何东西。一旦他们开始输入内容，键盘导航应该是“唯一”的方式。一两只手放在键盘上，一只手用于更改调查页面==没有手来使用鼠标。对屏幕（对于鼠标或其他任何东西）的注意力会远离对键控调查的注意力。

键控者应该“低着头”，根本不必看屏幕。如果可行，您可以使用音频提示告诉键控者他们在哪里切换了页面，以帮助确保他们键控的内容和计算机键控的内容基本上是相同的。如果无法提供音频提示，则只需让人们在他们所在的调查页面中键入条目即可。计算机已经“知道”它在“2”页上，因此当键控者键入页码时，它可以验证它们是否位于同一位置。

对于键入错误，请务必使用声音提示。不要让他们输入垃圾，点击“保存”，然后纠正错误。如果您立即知道数据有误，请阻止他们并让他们立即修复。没有什么比 5 或 6 次“叮叮叮”更能引起他们的注意了，因为在他们意识到计算机阻止他们之前，他们已经键入了 3 个字段。审核冗长的调查问卷是否有错误是浪费时间。

不要“滚动”您的数据屏幕。来回翻页。滚动很糟糕。当您滚动时，屏幕上的字段会移动。如果您不这样做，他们总是在同一个位置，因此当进入人员确实需要看屏幕时，他们总是可以看同一个地方。

因此，任何长度的下拉列表都会很糟糕。无论如何，他们不应该使用下拉菜单，因为他们不应该看屏幕。表格应该准确地告诉他们需要输入什么内容。

与数据录入保持一致。尽可能使用 10 键。如果您有超过 10 个选项，并且 0-9 对于整个调查问卷来说不实用，那么您应该使用 00-99。不要使用 AZ 来表示选项，因为人们不会那样考虑键。他们记住键盘上的字母不如记住键盘上的单词模式。 01-26 比一周中任何一天的 AZ 都快得多。

另外，SHIFT 键也不是你的朋友。但当他们处于“输入英语”模式时就没事了。

最后，组织调查，使所有“打字”、“填空”内容都集中在一个部分中（最好在最后）。这让他们可以将其余的 10 个键集中到一个区域，而无需来回移动双手。许多人在输入“english”（即使用顶行）时会“顶键”数字，而在不输入时会输入 10 个关键数字。

As has been mentioned, key it twice. Yes it's "double the work", but that leads to point 2.

Make the surveys EASY TO KEY.

They should be simple to read for the keyers. With section regarding their attention well highlighted so it stands out from the noise of the form.

Your "GUI" shouldn't be. The GUIs primary benefit is "discoverability", these folks shouldn't be "discovering" anything. Keyboard navigation should be the "only" way once they start keying stuff in. One or two hands on the keyboard, one hand for changing survey page == no hands for a mouse. Attention to the screen (for a mouse, or anything really) is attention away from the survey for keying.

The keyers should be "heads down", and not having to look at the screen at all. If practical, you can used audio prompts to tell the keyers where they've switched pages, to help ensure that what they're keying and what the computer is keying are basically the same thing. If audio prompts aren't possible, then simply have the entry people key in the page of the survey that they are on. The computer will already "know" it's on page "2", and so when the keyers keys in the page number, it can validate that they're on the same spot.

DO use audible prompts for keying errors. Don't let them key in garbage, hit "save" and then correct errors. If you KNOW the data is wrong right away, STOP them and have them fix it immediately. Nothing catches their attention than 5 or 6 "ding ding dings", because they're already keying 3 fields later before they realize the computer stopped them. Auditing a long questionnaire for errors is a waste of time.

Do NOT "scroll" your data screens. Page back and forth. Scrolling sucks. When you scroll, fields on the screens move. When you don't they're always in the same spot so when the entry person DOES need to look at the screen, they can always look at the same place.

Because of this, drop down lists of any length -- suck. They shouldn't be using drop downs anyway, as they shouldn't be looking at the screen anyway. The form should TELL THEM EXACTLY what they need to key.

Be consistent with the data entry. Use the 10 key as much as possible. If you have more than 10 options, and 0-9 isn't practical for the entire questionnaire, then you should use 00-99. Don't use A-Z for options, as people don't think of keys that way. They don't memorize letters on the keyboard as much as they memorize word patterns on the keyboard. 01-26 is far faster to key than A-Z any day of the week.

Also, the SHIFT key is NOT your friend. But it'll be fine when they're in "typing english" mode.

Finally, organize the survey so all the "typing", "fill in the blank" stuff is in one section (ideally at the end). This lets them 10 key the rest in a blaze, get in to a zone, and not have to move their hands back and forth. Many folks will "top key" numbers when typing "english" (i.e. use the top row) and 10 key numbers when not.

回复收藏 0 原文