大容量日志记录并批量保存到数据库?
我希望以快速的方式存储有关对我的网站的请求的信息,而不会给我的数据库带来额外的压力。目标是使用此信息来防止滥用并收集有关用户如何与网站交互的信息(ip、GET/POST、url/action、时间戳)。
我目前正在对数据库的每个页面请求保存一个新行。然而,当服务器也已经将相同的信息记录到 nginx 日志文件时,这会通过额外的数据库调用来浪费资源。
我想知道我能做些什么来更好地处理这个问题,我有两个想法我想知道是否有更好的方法。
- CRON 作业每天解析访问日志并将其作为批量事务保存到数据库。
- RAM缓存(redis/memcached)用于存储有关请求的数据,然后CRON将其保存到数据库。
但是,如果我以可以检索所有记录并将它们插入数据库的方式使用键值缓存,我不确定如何存储数据。
我也不知道如何以不会重新读取条目的方式解析访问日志。
如何有效地记录访问尝试?
I want to store information about requests to my sites in a quick way that doesn't put a additional strain on my database. The goal is to use this information to prevent abuse and gather information about how users interact with the site (ip, GET/POST, url/action, timestamp).
I'm currently saving a new row on each page request to the database. However, this is wasting resources with an extra database call when the server is also already logging the same information to the nginx log file.
I want to know what I can do to handle this better, I have two ideas I would like to know if there are any better methods.
- CRON job to parse access log each day and save as batch transaction to database.
- RAM cache (redis/memcached) to store data about request, then CRON to save to database.
However, I'm not sure how to store the data if I use a key-value cache in a way I can retrieve all the records and insert them in a database.
I also don't know how to parse the access log in a way that I won't re-read the entries.
How can I record access attempts in an efficient way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
常见的模式是拥有一个简单的表用于纯写入并将日志每分钟/小时移动到一组主表。主集可以高度规范化和索引,以及一个简单的非规范化表(以节省空间)。
另一种模式是拥有一个简单的大表并每分钟/每小时运行一次汇总查询。简单表可以按日期索引(记住使用本机类型)。
最后的提示是,使架构和脚本具有幂等性(如果多次运行,数据仍然有效)。出现故障是很常见的,在特定的分钟/小时/天窗口内简单地重新运行任务可以快速修复所有问题,而不是进行大规模重建。
A common pattern is having a simple table for plain writes and moving the logs every minute/hour to a main set of tables. The main set can be highly normalized and indexed and a simple de-normalized table (to save space).
Another pattern is to have a simple big table and run a summary query every minute/hour. The simple table can be indexed by date (remember to use a native type).
A final tip, make the architecture and scripts idempotent (if you run it multiple the data is still valid). It's very common to have blips and a simple re-run of a task for a specific window of minute/hour/day can quickly fix everything instead of a massive rebuild.