从 PHP 应用程序记录页面请求数据的可扩展方式?
我正在开发的 Web 应用程序(使用 PHP)需要能够记录每个页面请求。
就像普通的 access_log 一样,它将存储请求的 url、源 IP 地址、日期/时间等详细信息,但我还需要它来存储登录用户的用户 ID(存储在 php 会话变量中)。
然后,系统将查询这些数据,以便稍后根据需要创建站点范围或每个用户的分析报告 - 例如访问总数/独立访问次数、特定时间段内的页面浏览量、IP 地址的地理定位和查找在位置、一天中最活跃的时间、最活跃的成员等。
显而易见的事情是在每个页面上都有一个 mysql 插入语句,但如果应用程序每秒接收数千个请求,这将是一个巨大的瓶颈在数据库上,所以我正在寻找替代方案,无需大型基础设施要求即可实现此目的的可扩展方法。
我的一些想法是:
1)研究一种方法,让 Nginx 能够在普通 Web 服务器 access_log 中记录会话/应用程序中的 user_id,该日志可以定期(每晚)解析并加载到数据库中)。这感觉有点像黑客攻击,并且随着系统的扩展,需要在每个 Web 服务器上执行此操作。
2)将每个页面请求记录到具有高写入速度的Redis中 - 这样做的问题是缺乏稍后查询日期的能力。
3) 将每个页面请求记录到充当缓存(或消息队列)的 Memcache/Redis 中,并从那里定期提取、插入到 MySQL 中并删除。
4)像MongoDB这样具有更多查询能力的东西是否合适?
我感兴趣的是你将如何处理这个问题,以及是否有人有类似应用程序的经验(或在网上遇到过任何东西)。
我还对如何适当地构造数据以存储在 memcache/redis 中的想法感兴趣。
谢谢
A web application I am developing (in PHP) requires the ability to log each page request.
Just like a normal access_log, it will store details like url requested, source ip address, date/time but I also need it to store the User ID of the logged in user (which is stored in a php session variable).
This data will then be queried to create site-wide or per user analytics reports as required at a later date - things such as total number of visits/unique visits, page views in a certain time period, geo-locating the ip addresses and looking at locations, most active times of day, most active members etc.
The obvious thing to do would be to have a mysql insert statement on each page but if the application is receiving thousands of req/sec, this is going to be a hugh bottleneck on the database so I am looking at alternative, scalable ways of doing this without big infrastructure requirements.
A few of the ideas i've had are:
1) Work on a way for Nginx to be able to log the user_id from the session/application in the normal web server access_log, which can be parsed and loaded into a database periodically (nightly). This feels like a bit of a hack and will need doing on each web server as the system scales out.
2) Log each page request into Redis which has high write speeds - the problem with this is the lack of ability to query the date at a later date.
3) Log each page request into either Memcache/Redis acting as a cache (or a message queue) and from there it would be regularly extracted, inserted into MySQL and removed.
4) Would something like MongoDB which has more query capability be suitable?
I'm interested in how you would approach this and if anyone has any experience of a similar application (or has come across anything online).
I'm also interested on thoughts on how the data could be suitably structured to be stored in memcache/redis.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这当然可以通过多种方法实现。我将讨论每个列出的选项以及一些附加评论。
1)如果NGinx能做到,那就让它去做。我使用 Apache、JBOSS 和 Tomcat 来完成此操作。然后,我使用 syslog-ng 集中收集它们并从那里进行处理。对于这条路线,我建议使用分隔的日志消息格式,例如制表符分隔,因为它更容易解析和阅读。我不知道它记录 PHP 变量,但它肯定可以记录标头和 cookie 信息。如果你打算使用 NGinx 日志记录,如果可能的话我会推荐这条路线 - 为什么要记录两次?
2)不存在“缺乏稍后查询日期的能力”,更多内容如下。
3)这是一个选项,但它是否有用取决于您想要保留数据多长时间以及您想要编写多少清理工作。更多内容见下文。
4) MongoDB 当然可以工作。您必须编写查询,它们不是简单的 SQL 命令。
现在,将数据存储在 redis 中。我目前使用 syslog-ng 记录日志,并使用程序目标来解析数据并将其填充到 Redis 中。就我而言,我有几个分组标准,例如按虚拟主机和按集群,因此我的结构可能有点不同。
您首先需要解决的问题是“我想要从这些数据中得到什么数据”?其中一些是流量率等计数器。其中一些将是聚合的,还有更多将是“按受欢迎程度对我的页面进行排序”之类的内容。
我将演示一些轻松将其放入 Redis 中(从而退出)的技术。
首先,让我们考虑一段时间内的流量统计数据。首先确定粒度。您想要每分钟的统计数据还是每小时的统计数据就足够了?以下是跟踪给定 URL 流量的一种方法:
使用键“traffic-by-url:URL:YYYY-MM-DD”将数据存储在排序集中,在此排序集中您将使用 zincrby 命令并提供成员“HH:MM”。例如,在 Python 中,其中“r”是您的 Redis 连接:
此示例在 5 月 18 日凌晨 1:04 增加了 url“/foo.html”的计数器。
要检索特定日期的数据,您可以在键上调用 zrange (""traffic-by-url:URL:YYYY-MM-DD")到获取从最不流行到最流行的排序集。例如,要获取前 10 名,您可以使用 zrevrange 并为其指定范围。 Zrevrange 返回反向排序,命中最多的将位于顶部。还有几个可用的排序集命令可以让您执行良好的查询,例如分页、按最小分数获取一系列结果等。
您可以简单地更改或扩展键名来处理不同的时间窗口。通过将其与 zunionstore 结合使用,您可以自动汇总到不太精细的时间段。例如,您可以对一周或一个月内的所有密钥进行联合,并存储在新密钥中,例如“traffic-by-url:monthly:URL:YYYY-MM”。通过对指定日期的所有 URL 执行上述操作,您可以获得每日数据。当然,您也可以有一个每日总流量密钥并增加它。这主要取决于您希望何时输入数据 - 通过日志文件导入离线或作为用户体验的一部分。
我建议不要在实际用户会话期间做太多事情,因为它会延长用户体验它(和服务器负载)所需的时间。最终,这将是基于流量水平和资源的呼叫。
正如您可以想象的那样,上述存储方案可以应用于您想要或确定的任何基于计数器的统计数据。例如,将 URL 更改为 userID,您就可以进行每个用户的跟踪。
您还可以将原始日志存储在 Redis 中。我对一些将它们存储为 JSON 字符串的日志执行此操作(我将它们作为键值对)。然后我有第二个进程将它们拉出来并对数据进行处理。
为了存储原始命中,您还可以使用使用纪元时间作为排名的排序集,并使用 zrange/zrevrange 命令轻松获取时间窗口。或者将它们存储在基于用户 ID 的密钥中。集合可以解决这个问题,排序集合也可以。
我没有讨论过但对于某些数据可能有用的另一个选项是存储为哈希。例如,这对于存储有关给定会话的详细信息可能很有用。
如果您确实想要数据库中的数据,请尝试使用 Redis 的 Pub/Sub 功能,并让订阅者将其解析为分隔格式并转储到文件中。然后使用复制命令(或数据库的等效命令)进行批量导入的导入过程。您的数据库会感谢您的。
这里的最后一点建议(我可能已经花了足够的时间思考)是明智和自由地使用 expire 命令。使用 Redis 2.2 或更高版本,您可以设置偶数计数器键的过期时间。这里的一大优势是自动数据清理。想象一下您遵循我上面概述的计划。通过使用过期命令,您可以自动清除旧数据。也许您想要长达 3 个月的每小时统计数据,然后只需要每日统计数据; 6 个月的每日统计数据,然后仅每月统计数据。只需在三个月 (86400*90) 后让您的每小时密钥过期,在每天 6 点 (86400*180) 后让您的密钥过期,您就不需要进行清理。
对于地理标记,我对 IP 进行离线处理。想象一下具有以下关键结构的排序集:“traffic-by-ip:YYYY-MM-DD”,使用 IP 作为元素,并使用上面提到的 zincryby 命令获得每个 IP 的流量数据。现在,在您的报告中,您可以获取排序集并查找 IP。为了在执行报告时节省流量,您可以在 Redis 中设置一个哈希,将 IP 映射到您想要的位置。例如,“geo:country”作为键,IP 作为散列成员,国家代码作为存储值。
我要补充的一个重要警告是,如果您的流量水平非常高,您可能需要运行两个 Redis 实例(或更多,具体取决于流量)。第一个是写入实例,它不会启用 bgsave 选项。如果您的流量相当高,您将始终执行 bgsave。这就是我推荐第二个例子的原因。它是第一个的从属,它将保存到磁盘。您还可以对从属服务器运行查询以分配负载。
我希望这能给你一些想法和尝试的东西。尝试不同的选项,看看什么最适合您的特定需求。我正在 redis 中跟踪高流量网站上的大量统计信息(以及 MTA 日志统计信息),它的性能非常出色 - 与 Django 和 Google 的可视化 API 相结合,我得到了非常漂亮的图表。
It is certainly doable in a variety of methods. I'll address each listed option as well as some additional commentary.
1) If NGinx can do it, let it. I do it with Apache as well as JBOSS and Tomcat. I then use syslog-ng to collect them centrally and process from there. For this route I'd suggest a delimited log message format such as tab-separated as it makes it easier to parse and read. I don't know about it logging PHP variables, but it can certainly log headers and cookie information. If you are going to use NGinx logging at all I'd recommend this route if possible - why log twice?
2) There is no "lack of ability to query the date at a later date", more down below.
3) This is an option but whether or not it is useful depends on how long you want to keep the data and how much cleanup you want to write. More below.
4) MongoDB could certainly work. You will have to write the queries, and they are not simple SQL commands.
Now, to storing the data in redis. I currently log things with syslog-ng as noted and use a program destination to parse the data and stuff it into Redis. In my case I've got several grouping criteria such as by vhost and by cluster, so my structures may be a bit different.
The question you need to address first is "what data do I want out of this data"? Some of it will be counters such as traffic rates. Some of it will be aggregates, and still more will be things like "order my pages by popularity".
I'll demonstrate some of the techniques to easily get this into redis (and thus back out).
First, let us consider the traffic over time stats. First decide on the granularity. Do you want per-minute stats or will per-hour stats suffice? Here is one way to track a given URL's traffic:
Store the data in a sorted set using the key "traffic-by-url:URL:YYYY-MM-DD" in this sorted set you'll use the zincrby command and supply the member "HH:MM". for example in Python where "r' is your redis connection:
This example increases the counter for the url "/foo.html" on the 18th of May at 1:04 in the morning.
To retrieve data for a specific day, you can call zrange on the key (""traffic-by-url:URL:YYYY-MM-DD") to get a sorted set from least popular to most popular. To get the top 10, for example, you'd use zrevrange and give it the range. Zrevrange returns a reverse sort, the most hit will be at the top. Several more sorted set commands are available that allow you to do nice queries such as pagination, get a range of results by minimum score, etc..
You can simply alter or extend your key name to handle different temporal windows. By combining this with zunionstore you can automatically roll-up to less granular time periods. For example you could do a union of all keys in a week or month and store in a new key like "traffic-by-url:monthly:URL:YYYY-MM". By doing the above on all URLs in a given day you can get daily. Of course, you could also have a daily total traffic key and increment that. It mostly depends on when you want the data to be input - offline via logfile import or as part of the user experience.
I'd recommend against doing much during the actual user session as it extends the time it takes for your users to experience it (and server load). Ultimately that will be a call based on traffic levels and resources.
As you could imagine the above storage scheme can be applied to any counter based stat you want or determine. For example change URL to userID and you have per-user tracking.
You could also store logs raw in Redis. I do this for some logs storing them as JSON strings (I have them as key-value pairs). Then I have a second process that pulls them out and does things with the data.
For storing raw hits you could also use a sorted sets using the Epoch Time as the rank and easily grab a temporal window using the zrange/zrevrange commands. Or store them in a key that is based on the user ID. Sets would work for this, as would sorted sets.
Another option I've not discussed but for some of your data may be useful is storing as a hash. This could be useful for storing detailed information about a given session for example.
If you really want the data in a database, try using Redis' Pub/Sub feature and have a subscriber that parses it into a delimited format and dumps to a file. Then have an import process that uses the copy command (or equivalent for your DB) to import in bulk. Your DB will thank you.
A final bit of advice here (I've probably taken enough mental time already) is to make judicious and liberal use of the expire command. Using Redis 2.2 or newer you can set expiration on even counter keys. The big advantage here is automatic data cleanup. Imagine you follow a scheme like I've outlined above. By using the expiration commands you can automatically purge old data. Perhaps you want hourly stats for up to 3 months, then only the daily stats; daily stats for 6 months then monthly stats only. Simply expire your hourly keys after three months (86400*90), your daily at 6 (86400*180) and you won't need to do cleanup.
For geotagging I do offline processing of the IP. Imagine a sorted set with this key structure: "traffic-by-ip:YYYY-MM-DD" using the IP as the element and using the zincryby command noted above you get per-IP traffic data. Now, in your report, you can get the sorted set and do lookups of the IP. To save traffic when doing the reports, you could set up a hash in redis that maps the IP to the location you want. For example "geo:country" as the key and IP as the hash member with country code as the stored value.
A big caveat I would add is that if your traffic level is very high you may want to run two instances of Redis (or more depending on traffic). The first would be the write instance, It would not have the bgsave option enabled. If your traffic is pretty high you'll always be doing a bgsave. This is what I recommend the second instance for. It is a slave to the first and it does the saves to disk. You can also run your queries against the slave to distribute load.
I hope that gives you some ideas and things to try out. Play around with the different options to see what works best for your specific needs. I am tracking a lot of stats on a high traffic website (and also MTA log stats) in redis and it performs beautifully - combined with Django and Google's Visualization API I get very nice looking graphs.
当您使用 MongoDB 进行日志记录时,需要关注的是高写入吞吐量导致的锁争用。虽然 MongoDB 的插入默认是即发即忘风格,但调用大量 insert() 会导致严重的写锁争用。这可能会影响应用程序性能,并阻止读者聚合/过滤存储的日志。
一种解决方案可能是使用日志收集器框架,例如Fluentd、Logstash,或 Flume。这些守护进程应该在每个应用程序节点上启动,并从应用程序进程中获取日志。
它们缓冲日志并异步将数据写入其他系统,如MongoDB / PostgreSQL等。写入是批量完成的,因此比直接写入效率高得多来自应用程序。此链接描述了如何将日志从 PHP 程序放入 Fluentd 中。
这里有一些关于 MongoDB + Fluentd 的教程。
When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from PHP program.
Here's some tutorials about MongoDB + Fluentd.
将日志信息发送到 syslog-ng :)
Send the logging information to syslog-ng :)