收集、维护和确保庞大数据集准确性的最佳实践是什么?
我提出这个问题是为了寻求有关如何设计系统的实用建议。
amazon.com 和 pandora 等网站拥有并维护着巨大的数据集来运营其核心业务。例如,亚马逊(以及所有其他主要电子商务网站)有数百万种待售产品、这些产品的图像、定价、规格等。
忽略来自第 3 方卖家的数据和用户生成的内容所有这些“东西”都必须来自某个地方并由某人维护。它也非常详细和准确。如何?他们是如何做到的?是否只有一支数据录入人员队伍,或者他们是否设计了系统来处理繁重的工作?
我的公司也有类似的情况。我们维护着庞大的汽车零部件目录(包含数千万条记录)以及它们所适用的汽车。我们已经这样做了一段时间,并提出了许多计划和流程来保持我们的目录不断增长和准确;然而,似乎将目录扩大到 x 个项目,我们需要将团队扩大到 y 个项目。
我需要找到一些方法来提高数据团队的效率,希望我可以从其他人的工作中学习。任何建议都值得赞赏,不过更多的是指向我可以花一些时间阅读的内容的链接。
I am posing this question looking for practical advice on how to design a system.
Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.
Ignoring the data coming in from 3rd party sellers and the user generated content all that "stuff" had to come from somewhere and is maintained by someone. It's also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?
My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We've been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.
I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
使用访客。
即使一件商品只有一个人,也会有错误的记录,客户会发现。因此,让他们将项目标记为“不适当”并发表简短的评论。但别忘了,他们不是你的员工,不要对他们要求太多;看看Facebook的“喜欢”按钮,它很容易使用,并且不需要用户太多的精力。良好的性能/价格。如果Facebook中有一个必填字段,询问“你为什么喜欢它?”,任何人都不应该使用该功能。
访问者还以隐式方式帮助您:他们访问项目页面,并使用搜索功能(我的意思是内部搜索引擎和外部搜索引擎,例如 Google)。你可以从访客的活动中获取信息,比如说,设置访问量最大的项目的顺序,那么你应该将更多的人力集中在列表的顶部,而少一些“长尾”。访问者
Use visitors.
Even if you have one person per item, there will be wrong records, and customers will find it. So, let them mark items as "inappropiate" and make a short comment. But don't forget, they're not your employees, don't ask them too much; see Facebook's "like" button, it's easy to use, and requires not too much energy from the user. Good performance/price. If there would be a mandatory field in Facebook, which asks "why do you like it?", no one should use that function.
Visitors also helps you implicite way: they visit item pages, and use search function (I mean both internal search engine and external ones, like Google). You can gain information from visitors' activity, say, set up the order of the most visited items, then you should concentrate more human forces on the top of the list, and less for the "long tail".
由于这更多的是关于管理团队/代码/数据而不是实施,而且既然您提到了 Amazon,我认为您会发现这很有用: http://highscalability.com/amazon-architecture。
特别是,请点击 Werner Vogels 采访链接。
Since this is more about managing the team/code/data rather than implementation and since you mentioned Amazon I think you'll find this useful: http://highscalability.com/amazon-architecture.
In particular, click the link to Werner Vogels interview.
首先将其构建正确。确保使用您正在使用的数据库中可用的每种完整性检查方法(适合您存储的内容)。上传失败总比悄悄引入不良数据要好。
然后,弄清楚您将在自己的完整性检查方面做什么。数据库完整性检查是一个好的开始,但很少是您所需要的全部。这也将迫使您从一开始就思考您正在使用什么类型的数据,需要如何存储它,以及如何识别和标记或拒绝不良或有问题的数据。
我无法告诉你我在尝试改造(或只是日常工作)充满垃圾数据的旧系统时所经历的痛苦。正确地执行并预先对其进行彻底测试可能看起来很痛苦,而且确实如此,但回报是拥有一个在很大程度上运行良好且几乎不需要干预的系统。
至于链接,如果有人必须考虑和设计可扩展性,那就是谷歌。您可能会发现这很有启发性,其中有一些值得记住的好处:http://highscalability.com/google-architecture
Build it right in the first place. Ensure that you use every integrity checking method available in the database you're using, as appropriate to what you're storing. Better that an upload fail than bad data get silently introduced.
Then, figure out what you're going to do in terms of your own integrity checking. DB integrity checks are a good start, but rarely are all you need. That will also force you to think, from the beginning, about what type of data you're working with, how you need to store it, and how to recognize and flag or reject bad or questionable data.
I can't tell you the amount of pain I've seen from trying to rework (or just day to day work with) old systems full of garbage data. Doing it right and testing it thoroughly up front may seem like a pain, and it can be, but the reward is having a system that for the most part hums along and needs little to no intervention.
As for a link, if there's anyone who's had to think about and design for scalability, it's Google. You might find this instructive, it's got some good things to keep in mind: http://highscalability.com/google-architecture
主数据管理是所提议的另一种替代方案。 这里是 Microsoft 的文章“主数据的内容、原因和方式”管理”。 数据管理员被赋予维护企业数据准确性的权利/责任。
主要的扩展能力来自于使技术与业务保持一致,这样数据人员就不再是唯一可以管理信息的人。工具和流程/程序使企业主能够帮助管理企业数据。
Master Data Management is another alternative to what has been proposed. Here is Microsoft's article "The What, Why, and How of Master Data Management". Data stewards are given the rights/responsibility to maintain the accuracy of the data for the enterprise.
The main ability to scale comes from aligning the technology with the business so that the data personnel are not the only people who can manage the information. Tools and processes/procedures enable business owners to help manage enterprise data.
与您的供应商共享日期。然后将数据输入一次。
如果重要的话就应该做一次,否则根本不做。
Share date with your suppliers. Then the data is entered once.
If it is important it should be done once, else not at all.
我会大力投资数据挖掘。尽可能多地获取有关您要销售的产品的信息。直接从供应商以及 Mitchell 和 Haynes 等汽车维修公司获取有关车辆的信息。
一旦您知道所需的零件,请将这些零件编号与互联网上可用的零件编号交叉关联。还将这些零件编号与图像、评论和文章交叉关联。尝试在一个页面中聚合尽可能多的信息,并最终允许该页面被谷歌索引。
根据数据聚合的结果,为每个产品分配一系列权重。根据您的权重值,将结果传递给员工并让他们与供应商协商价格,按原样创建页面并链接到来源(假设您会收到佣金),或者不出售该零件。
一旦您在一处拥有足够的产品,您就可以支持其他想要向您的网站添加其他产品的人。亚马逊上可用资源的广泛性在很大程度上归功于对第三方卖家的支持并允许这些卖家在亚马逊网站上列出。
特别是在汽车行业,我认为它们在高质量索引方面具有巨大的价值,这既可以在谷歌上找到,也可以由希望更换特定组件的人们在逻辑上找到。您可能还想考虑根据他们有兴趣购买的组件,通过 IP 地理位置销售/提供特定位置的服务。
I would invest heavily in data mining. Get as many feeds as possible about the products you are trying to sell. Get feeds about the vehicle's directly from vendors, as well as from automotive repair companies like Mitchell and Haynes.
Once you know the parts that you need, cross correlate those part numbers with part numbers that are available on the interenet. Also Cross correlate those part numbers with images, reviews, and articles. Attempt to aggregate as much information as possible in one page, and eventually allow that page to be indexed by google.
Based on the results of your data aggregation, assign a series of weights to each products. Based on the value of your weights either pass on the results to an employee and have them negotiate price with suppliers, create a page as is and link to the sources (assuming you would receive a commission), or, don't sell the part.
Once you have enough products in one place, you can then support other people who would like to add additional products to your website. The breadth of resources available on Amazon is in a large part due to supporting third party sellers and allowing those sellers to list on Amazon's website.
Especially in the auto-industry, I think their is a great value in high quality indexing which is both google findable as well as logically findable by people looking to replace a specific component. You may also want to look into selling/providing location specific services through IP geo-location based on the component they are interested in purchasing.
谷歌等网站管理的大部分数据都来自用户。我输入我的数据并对其准确性负责。网站有自己的数据,这些数据是从网络上捕获的。搜索数据是从搜索中捕获的。这可能与您正在尝试的有很大不同。谷歌员工几乎不需要用它做任何事情。
与制造商提要合作可以减少您的人力密集度。权衡是投资数据转换软件。您可能想要捕获每个交叉引用的源。当您获得更新时,这将简化重新加载的过程。
根据我的经验,您还存在交叉引用可能是单向的问题。 A可以代替B,但B不能代替A。
只要你手动输入,就会有错误。您可以在界面中执行的任何操作来检测这些错误可能都是值得的。员工的输入量应该线性扩展。
查看有关注意力周期的研究,以确定是否可以采取一些措施来提高输入和验证过程的质量。安全扫描方面的最新研究表明,您可能希望在验证数据中生成周期性错误。
正如其他人所指出的,让用户更容易标记错误是一个好主意。
Much of the data managed by Site like google comes from users. I enter my data and an responsible for its accuracy. Sites have their data, and it is captured from the web. Search data is captured from a search. This is likely significantly different from what your are attempting. There is little requirement for Google staff to do anything with it.
Working with manufacturers feeds could make your efforts less manpower intensive. The trade-off is investing in the data transformation software. You may want to capture the source for each cross-reference. This will ease reloads when you get updates.
From my experience you also have the issue that cross-references may be unidirectional. A can replace B, but B can not replace A.
As long as you have manual entry, you will have errors. Anything you can do in your interface to detect these errors is likely worth the effort. Input volume to staff should scale linearly.
Review the research on attention cycles to determine if you can do something to increase quality of input and verification processes. Recent research in security scanning indicates that you may want to generate periodic errors in the verification data.
As other have noted, making it easier for users to flag errors is a good idea.