机械土耳其人有用吗?
我在另一个线程上发布了以下问题:
“有人知道可以从 php 中使用的一个好的解决方案,该解决方案可以有效地从文档中删除电话号码、电子邮件地址甚至联系地址等联系信息吗?”
我很快就被告知了我的怀疑......我问得太多了:)
所以现在我正在寻找替代解决方案。我正在考虑使用亚马逊的 Mechanical Turk 来删除联系信息。
那么两个问题?
- 这适合机械土耳其人吗?
- 服务的效果如何?
I posted the following question on another thread:
"Does anybody know of a good solution that can be used from php that will effectively remove contact information like phone numbers, email addresses and maybe even contact addresses from a document?"
I quickly got told what I suspected... I am asking too much :)
So now I am looking for alternative solutions. One I am considering is using Amazon's Mechanical Turk to do the contact information removal.
So two question?
- Would this be a good fit for mechanical turk?
- How effective is the service?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
查看 http://www.microtask.com。 (我不隶属于这家公司。)
您也许可以使用正则表达式撒下一张大网,然后让工作人员筛选出真实的地址、电话号码和电子邮件地址。对于人类来说,“某某”是否是地址、电话号码或电子邮件地址是一个相当简单的问题。
因为他们把表格切碎了(或者说他们做了——我没有用过),你就不用太担心隐私问题,或者也许能够证明它们是合理的。如果 MicroTask 有数百个客户端,他们能够做的就是将所有微任务放入一个巨大的漏斗中,随机分配每个工作人员看到的任务。因此,他们实际上可以保证工作人员几乎没有办法关联他们处理的任何敏感信息。每个工人每天都会看到数千条独立的信息。在这些条件下,谁能够辨别出第 1 天的任务 347 具有与第 3 天的任务 1133 相对应的电子邮件地址?即使他们可以,这对他们来说也不值得。只要按照要求去做,他们可能会赚更多的钱。
Check out http://www.microtask.com. (I'm not affiliated with this company.)
You might be able to cast a wide net with your regular expressions and then have the human workers sift out the real addresses, phone numbers, and e-mail addresses. Whether "such-and-such" is an address, phone number, or e-mail address is a fairly straightforward question for a human.
Since they chop the form up (or say they do -- I haven't used it) you don't have as much to worry about privacy concerns, or may be able to justify them. If MicroTask has hundreds of clients, what they are able to do is take all of the microtasks and throw them in a giant hopper that randomizes which ones each individual worker sees. Hence, they could virtually guarantee that the workers will have almost no means to correlate any of the sensitive information they work on. Each worker would see thousands of independent pieces of information each day. Under these conditions, who would be able to discern that Task 347 on day 1 had the e-mail address that corresponds to Task 1133 on day 3? Even if they could, it's hardly worth it to them. They'll probably make more money just doing what is asked of them.