用于在网站上查找联系方式的脚本或库
有谁知道脚本/食谱/库可以在网站上查找最相关的联系信息?
一些可能的情况:
- 在个人网页上查找联系电话号码
- 在博客上查找所有者电子邮件地址
- 查找联系页面的 url
Does anyone know a script/recipe/library to find most relevant contact information on a website?
Some possible case:
- Find contact phone number on a personal web page
- Find owner email address on a blog
- Find url of the contact page
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 WSO2 的 Mashup 服务器。您可以在本地计算机上运行它并按照抓取教程。您可以将所需的动态参数传递到抓取器的
元素中,以循环运行相同抓取的多个站点,然后将所有内容推送到收集源(用于捕获信息的 AJAX 应用程序或存储在 WSO2 服务器内)。您可以使用 XPath 和 XSLT 编写非常复杂的搜索模式来捕获所需的信息。我没有足够的关于您正在抓取的特定网站的信息来帮助编写脚本,但无论您采取什么方式,都需要进行大量的试验和错误,直到获得您想要的结果。
快乐刮擦!
Check out WSO2's Mashup Server. You can run it on your local machine and follow the tutorial for scraping. You could pass the dynamic parameters you need into the
<http>
element of the scraper to loop through multiple sites running the same scrape, then push everything to a collection source (AJAX application for capturing the information or store inside WSO2 server). You can write very complex search patterns using XPath and XSLT to capture the information you want.I don't have enough information about the specific sites you are scraping to help with the script, but any way you go, it's going to take a lot of trial and error until you get the result you are looking for.
Happy scraping!
我不知道有任何图书馆这样做。
嗯,我会使用正则表达式来匹配电话号码和电子邮件地址,结合遍历网站的网络蜘蛛,然后使用对联系信息进行排名的方法。
通常,联系信息还会与一些常见标签之一结合使用,例如“支持”、“支持电子邮件”、“销售”等。可能有十几个版本,覆盖 95% 的英语网站。
所以,基本上我会首先构建一个简单的递归网络蜘蛛,它会遍历给定域中的所有可公开访问的页面,解析 HTML 中的电子邮件地址和电话号码,并列出它们的列表,然后根据是否或对它们进行排名它们并没有列在任何常见标签附近。
它不会是完美的,但话又说回来,这就是算法价值的一部分——让它变得更聪明,并随着时间的推移对其进行调整,直到它变得更好。
I'm not aware of any libraries that do this.
Hm, I would use regular expressions to match for phone numbers and email addresses, combined with a web spider that walks the site, and then a method for ranking the contact information.
Typically contact information will also be partnered with one of a few common labels such as "Support", "Support email", "Sales", etc. There's probably a dozen or so versions of this that will cover 95% of all sites in English.
So, basically I would start by building a simple recursive web spider that walks all the publicly accessible pages in a given domain, parsing the HTML for email addresses and phone numbers, and making a list of them, and then ranking them based on whether or not they are listed near to any of the common labels.
It won't be perfect, but then again, that's part of the value of the algorithm - making it smarter, and tweaking it over time until it gets better.