将旧 EAV 模式转换为 Mongo 或 Couch
假设我有一个遗留应用程序,由于各种原因,以前的开发人员决定必须有一个任意灵活的模式,并且他们再次重新发明了实体-属性-值模型。他们实际上试图建立一个文档存储库,像 Mongo 或 Couch 这样的工具现在更适合当今的世界,但以前的团队不可用或不知道。
为了保持竞争力,假设我们需要构建更强大的方法来查询和分析系统中的信息。基于属性的绝对数量和多样性,与逐渐将系统重构为更相关的模式相比,map/reduce 似乎更适合我们的问题集。
原始源数据库拥有数百万个文档,但只有少量不同的文档类型。不同的文档类型有一些共同点。
从 MySql 等大规模 EAV 实现迁移到 Mongo 或 Couch 等面向文档的存储的有效策略是什么?
我当然可以想象一种解决此问题的方法,但我真的很想看到教程或战争故事,以便向已经解决过此类问题的人学习。
有哪些有效的策略可以实现这种转换?你学到了什么教训?我应该避免哪些陷阱?您如何处理仍然希望能够与现有数据库交互的遗留应用程序?
Let's say I have a legacy application that, for various reasons, previous developers decided must have an arbitrarily flexible schema, and they reinvented the Entity-Attribute-Value model yet again. They were actually trying to build a document repository, for which tools like Mongo or Couch would now be a better fit in today's world, but were not available or not known to the previous teams.
To stay competitive, let's say we need to build more powerful methods for querying and analyzing information in our system. Based on the sheer number and variety of attributes, it seems like map/reduce is a better fit for our set of problems than gradually refactoring the system into a more relational schema.
The original source database has millions of documents, but only a small number of distinct document types. There are some commonalities across the distinct document types.
What's an effective strategy for doing a migration from a massive EAV implementation in, say, MySql, to a document-oriented store like Mongo or Couch?
I can certainly imagine an approach to attack this, but I'd really like to see a tutorial or war story to learn from someone who has already attacked this type of problem.
What were some strategies for doing this kind of conversion that worked well? What lessons did you learn? What pitfalls should I avoid? How did you deal with legacy apps that still expect to be able to interact with the existing database?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我第一次使用 Couch 是在编写了 Ruby 和 Postgres 网络爬虫(定向爬取 mp3 博客以构建推荐引擎)之后。
当我尝试记录 ID3 元数据、音频签名等时,关系模式变得非常粗糙,并且检测重叠并以其他方式进行重复数据删除。它有效,但速度很慢。由于速度太慢,我开始将 JSON API 行作为 blob 字段缓存到相应的主 ActiveRecord 对象上。
我有一个选择:深入研究并学习 Postgres 性能调优,或者转向横向方法。因此,我使用 Nutch 和 Hadoop 来抓取网络,并使用 PipeMapper 使用 Ruby / Hpricot 来解析页面。因此,我能够重用所有解析器代码,只需将其从另存为规范化数据库更改为另存为 JSON。我编写了一个名为 CouchRest 的小库来处理 JSON 和 REST URL 端点,我用它来将 Hpricot 结果保存到 CouchDB 中。
对于该项目,我只是在单个 EC2 节点上运行 Couch,并填充一个小型 6 节点 Hadoop 集群。直到我开始为蜘蛛数据构建浏览界面时,我才真正对查询功能有了很好的感觉。
事实证明,我很灵活,特别适合 OLTP 应用程序,我很快开始在我的所有项目中使用它,并最终与两位创建者围绕该技术创立了一家公司。
My first usage of Couch was after I had written a Ruby and Postgres web crawler (directed crawl of mp3 blogs to build a recommendation engine).
The relational schema got deeply gnarly as I tried to record ID3 metadata, audio signatures, etc etc, and the detect overlaps and otherwise do deduplication. It worked but it was slow. So slow I started caching my JSON API rows onto the corresponding primary ActiveRecord objects as blob fields.
I had a choice: dig in and learn Postgres performance tuning, or move to a horizontal approach. So I used Nutch and Hadoop to spider the web, and the PipeMapper to parse pages with Ruby / Hpricot. So I was able to reuse all my parser code, and just change it from saving as a normalized database, into saving as JSON. I wrote a little library to handle the JSON and the REST URL endpoints, called CouchRest, which I used to save the Hpricot results into CouchDB.
For that project I just ran Couch on a single EC2 node, with a small 6 node Hadoop cluster populating it. It was only when I got around to building the browsing interface for the spidered data, that I really got a good feeling for the query capabilities.
I turned out to be flexible and especially well suited to OLTP applications, I quickly started using it in all my project and eventually founded a company around the technology with two of the creators.