如何处理 whois 数据
我需要将 whois 数据放入表中,例如
- 注册者、
- 创建日期、
- 过期日期等。
我有从 whois 服务器提取数据的脚本,但每个域扩展的输出都不同。
例如,对于 .com
域,注册者详细信息以总地址形式出现,对于 .org
域,它以注册者名称、street1、street2、street3 等形式出现
。无法将注册者详细信息提取为一个单元放入数据库中。
我听说如果我们获取 xml 数据,我们就可以提取它,有人可以帮助解决这个问题吗?谢谢!。
I need to put whois data in a table like
- registrant,
- created date,
- expire date etc.
I've the script which is extracting data from whois servers, but the output is different for each domain extensions.
For example, for .com
domains registrant details comes as a total address and for .org
domains it comes as registrant name,street1,street2,street3 etc.
so i'm not able extract registrant details as a unit to be put in db.
some where i heard if we get as xml data we can able to extract it, can somebody help to get around this? Thanks!.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
事实上,问题比这要大得多。
WHOIS 服务由 RFC3912 定义。这是一个非常基本的请求协议,根本没有定义应答内容的格式。。因此,答案通常反映了包含数据的数据库的格式,并且您可能会为每个数据库得到不同的语法。由于 WHOIS 可用于您想要的任何内容,因此您不能对将得到的答案的格式做出很多假设。但希望您可以收到可解析的内容以及每个请求的类似格式的答案。
因此,您需要为每个服务器开发一个解析逻辑,您必须以非常经验的方式进行。
不过,这里有一些来自 RFC 的开发技巧。
您需要使用 TCP 端口 43 发送请求,并以 CR+LF ASCII 字符结尾的单行
您必须期望 TCP 连接结束仅意味着答案已完成。
特别是关于域名,您可能需要注意,以前对 ASCII 编码的限制使得一些注册人使用 Punycode 对 DNS 系统中的一些(通过示例强调的)字符串进行编码,因此您可能希望能够在 Whois 答案中期待这些字符串如果您在某些回复中遇到同样的情况。自 2003 年以来,国际化域名的存在将要求您支持 unicode 编码。转换名称的算法很复杂,RFC 3490 应该为您提供一些有用的详细信息。
祝你好运 !
Actually the problem is a big larger than that.
The WHOIS service is defined by RFC3912. It is a very basic request protocol that does not define the format of answered contents at all. So the answers often reflects the format of the database containing the data and you may get different syntax for each database. Since WHOIS can be use for whatever contents you want, you cannot make many assumptions about the format of answer you will get. Hopefully however, you can expect to receive parseable content, and similarly formatted answers for each request.
So you need to develop a parsing logic for each server which you will have to do in a very empirical manner.
However here a a few tips for your development that come from the RFC.
you need to send request using TCP port 43 with a single line ended by CR+LF ASCII characters
you must expect TCP end of connection as meaning the answer is finished, only.
About domain names specifically, you might be want to note that formerly restriction to ASCII encoding made some registrants to use Punycode to encode some (accentuated by example) strings in DNS systems, so you might want to be able to expect these in a Whois answer also if you meet in some replies. The existence of Internationalized Domain Names since 2003 will require you to support unicode encoding. Algorithms to converts names are complex, RFC 3490 should give you some useful details about this.
Good luck !
您需要检测格式并为它们使用不同的正则表达式。或者,正如您提到的,您可以使用 XML 甚至 JSON API
http://whoisxmlapi.com/
http://www.domaintools.com/api/docs/
You need to detect the format ands use different regular expressions for them. alternatively as you mentioned you can use XML or even JSON APIs
http://whoisxmlapi.com/
http://www.domaintools.com/api/docs/
您需要扩展数据库和处理才能更好地处理问题。
正如您已经注意到的,远程服务提供的数据采用不同的格式。因此,您需要将获取数据和解析数据的关注点分开,因为这两件事是相互独立的。例如,一个 TLD 的格式可能会随着时间的推移而改变。
因此,首先,您获取每个域的纯文本数据并存储它的元数据:
然后您可以稍后在第二个处理中进行解析。您可以使用已经存在的元数据来决定您需要哪种解析算法。这也可以帮助您随着时间的推移维护您的应用程序。
解析正确后,您就得到了您想要的标准化格式。
除了这些技术处理之外,您还应该注意 whois 服务提供的使用条件。并非所有技术上可行的事情都在法律或道德上被接受。妥善保管并尊重他人的个人记录。保护您收集的数据,例如归档和加扰/锁定您在正在进行的处理中不再需要的数据。
另请参阅:
You need to extend your database and processing to better deal with the problem.
The data provided by the remote service is in different format as you've already noted. So you need to separate the concerns of fetching the data and parsing it, because both things are independent to each other. For example, the format for one TLD can change over time.
So first of all you fetch the plain text data per domain and store it's meta-data:
You can then later on within a second processing do the parsing. You can use the metadata that already exists to decide which parsing algorithm you need. That helps you to maintain your application over time as well.
After parsing went right, you've got the normalized format which is what you aim for.
Next to these technical processings, you should take care of the usage conditions offered by the whois service(s). Not everything that is technically possible, is legally or morally accepted. Take care and treat other persons personal records with the respect this deserves. Protect the data you collect, e.g. archive and scramble / lock-away data you don't need any longer for your on going processing.
See as well: