我有一个 Java
模块,它从最终用户的浏览器接收 User-Agent
字符串,其行为需要根据浏览器类型、浏览器版本以及可能的不同而略有不同甚至操作系统。
例如: {"FireFox", "7.0", "Win7"}
, {"Safari", "3.2", "iOS9"}
我明白 由于不同的插件安装等原因,对于完全相同的配置,User-Agent
字符串的格式可能会有所不同。
我的问题:
-
User-Agent
的结构是否定义良好?如果是的话 - 我在哪里可以准确找到它? (根据我对 RFC 的理解,这里没有太多标准化)。
- 假设#1 的问题是
否
- 是否有正确的方法来解析它以获取我需要的信息?
- 除了
User-Agent
字符串之外,还有更好的方法来获取我需要的信息吗?
重要提示 - 我正在谈论一个网络应用程序,因此我的数据收集能力仅限于 javascript
。
I have a Java
module that receives the User-Agent
string from an end user's browser needs to behave slightly differently depending on the type of browser, the version of the browser and maybe even the operating system.
E.g.: {"FireFox", "7.0", "Win7"}
, {"Safari", "3.2", "iOS9"}
I understood that the User-Agent
string can vary in its format for the exact same configuration due to different plug-in installations etc.
My questions:
- Is the structure of the
User-Agent
well defined? If yes - where can I find it exactly? (From my understanding of the RFC there is not much standardization here).
- Assuming the question for #1 is
No
- is there a proper way to parse it to get the info I need?
- Is there a better way to get the info I need other than the
User-Agent
string?
Important note - I'm talking about a web-app, so my data collection abilities are limited to javascript
.
发布评论
评论(3)
看看我为此目的编写的 Java 库: Yauaa
我制作的一个非常简单的 servlet,您可以尝试一下,看看它是否给出您正在寻找的答案:https://try.yauaa.basjes.nl/
它已获得 Apache 2 许可并发布到 Maven 中,因此在 Java 应用程序中使用它非常容易。它目前在荷兰最繁忙的网站之一(我工作的地方)的生产中使用。
请参阅此博客https://techlab.bol.com/making-sense -用户代理字符串/
Have a look at the Java library I wrote for this purpose: Yauaa
I made a very simple servlet where you can try it out to see if it gives the answers you are looking for: https://try.yauaa.basjes.nl/
It is Apache 2 licensed and published into Maven so using it in a Java application is really easy. It is currently used in production on one of the busiest websites of the Netherlands (where I work).
See this blog about this https://techlab.bol.com/making-sense-user-agent-string/
对于 Java,请查看 User-Agent-Utils。它相当紧凑(< 50kB)并且没有依赖性。
请注意,尽管最新版本是最近的(1.21,于 2018 年 1 月 24 日发布),但该库的页面指出:
并且在 github 上页面上面写着:
For Java, take a look at User-Agent-Utils. It's fairly compact (< 50kB) and has no dependencies.
Note although the latest release is quite recent (1.21, released 2018-01-24), the library's page states:
And on the github page it says:
不,用户代理字符串的结构不是标准化的,但不同代理之间非常相似。尽管它们非常相似,但仍然需要使用多种模式进行检测。
您可以尝试库 UADetector。它是 user-agent-string.info 的用户代理数据库的包装。
我不会说这是更好或更坏的方法,但检测用户代理的另一种方法是客户端使用 JavaScript 来收集有关用户代理的信息,并通过隐藏的 HTML 输入或 XmlHttpRequest 将其提交到后端。这完全取决于您想要识别的内容。对于准确检测网络爬虫而言,JavaScript 无法提供帮助。
No, the structure of an User-Agent string is not standardized but is very similar between different agents. Although they are very similar, it is still necessary to use multiple patterns for detection.
You can try the library UADetector. It is a wrapper for the User-Agent-Database of user-agent-string.info.
I would not say it is a better or worse way, but another way to detect user agents is the client-side use of JavaScript to collect informations about the User-Agent and submitting it via hidden HTML inputs or XmlHttpRequest to your backend. It all depends on what you want to identify. For accurate detection of webcrawlers JavaScript won't be able to help.