向远程 MySQL 提供数据的模式
我想从社区听到解决以下问题的好模式。
我有一个“万能”服务器,包括网络服务器、mysql、爬虫服务器。两三周以来,使用监控工具,我发现当我的爬虫运行时,我的平均负载总是超过 5(4 核服务器,直到 4.00 为止负载都可以)。所以,我有另一台服务器,我想将我的爬虫移到那里。我的问题是。一旦我在爬虫服务器中爬取了数据,我就必须将其插入到我的数据库中。我不想打开远程连接并将其插入数据库,因为我更喜欢使用Rails框架,顺便说一句,我使用rails,以更轻松地创建所有关系等。
需要解决的问题:
服务器,有爬网数据(一堆 csv 文件),我想将其移动到远程服务器并使用 Rails 将其插入我的数据库中。
限制:我不想运行mysql(从站+主站),因为它需要更深入的分析才能知道哪里发生了更多的写操作。
想法:
从爬网程序中移动 csv 以使用(ssh、rsync)删除服务器并在白天导入
- < p>在我的远程服务器可以拉取(一天多次)的爬虫服务器中编写一个API并导入数据
围绕这个主题还有其他想法或好的模式吗?
I would like to hear about from the community a nice pattern to the following problem.
I had a "do-everything" server, which were webserver, mysql, crawlers server. Since two or three weeks, using monitoring tools, i saw that always when my crawlers were running, my load average was going over 5 (a 4 core server, would be ok to have until 4.00 as load). So, i've got another server and i want to move my crawlers to there. My question is. As soon as i have the data crawled in my crawler server, i have to insert in my database. And i wouldn't like to open a remote connection and insert it in the database, since i prefer to use the Rails framework, btw i'm using rails, to keep easier to create all relationships, and etc.
problem to be solved:
server, has the crawled data (bunch of csv files) and i want to move it to a remote server and insert it in my db using rails.
restriction: I don't want to run mysql (slave + master) since it would require a deeper analysis to know where happens more write operations.
Ideas:
move the csvs from crawlers to remove server using (ssh, rsync) and importing it during the day
write an API in the crawler server that my remote server can pull (many times at day) and import the data
any other idea or good patterns around this theme?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您已经注意到,与第二种模式略有不同,您可以在 web-app-server/db-server 中拥有一个 API。爬虫将使用它来报告他的数据。他可以批量、实时或仅在特定的时间窗口(白天/夜间时间...等)执行此操作。
这种模式将让爬虫决定何时报告数据。而不是让网络应用程序对数据进行“轮询”。
With a slight variation to the second pattern you have noted you could have a API in your web-app-server/db-server. Which the crawler will use to report in his data. He could do this in batches, real-time or only in a specific window of time (day/night time...etc).
This pattern will let the crawler decide when to report in the data. rather than having the web-app do the 'polling' for data.