如何使用jsp从网页中提取链接?
我的要求是动态地从网页中提取所有链接(使用“a href”)。我正在使用JSP。更具体地说,我正在 JSP 中构建一个元搜索引擎。因此,当用户输入查询项时,我必须从 yahoo、ask、google、momma 等搜索结果页面中提取链接。 为了以字符串格式获取页面,我现在使用的代码是。
> > try
{
> String sUrl_yahoo = "http://www.mamma.com/result.php?type=web&q=hai+bird&j_q=&l=";
>
> String nextLine;
> String webPage;
> StringBuffer wPage;
> String sSql;
> java.net.URL siteURL = new java.net.URL (sUrl_yahoo);
> java.net.URLConnection siteConn = siteURL.openConnection();
> java.io.BufferedReader in = new java.io.BufferedReader ( new java.io.InputStreamReader(siteConn.getInputStream() ) );
> wPage = new StringBuffer(30*1024);
> while ( ( nextLine = in.readLine() ) != null ) {
> wPage.append(nextLine); }
> in.close();
> webPage = wPage.toString(); out.println(webPage); }
> catch(Exception e) {
> out.println("Error" + e); }
现在,我的请求是:您能建议一些从 String webPage 中提取链接的方法吗? 或者还有其他方法来提取这些链接吗?我更愿意在不使用任何外部包的情况下完成此操作。
My requirement is to extract all links (using "a href") from a web page dynamically. I am using JSP. To be more specific, i am building a meta search engine in JSP. So when user enters a query item, i have to extract the links from the search results pages of yahoo, ask, google, momma etc.
For getting the pages in string format, the code i am using right now is.
> > try
{
> String sUrl_yahoo = "http://www.mamma.com/result.php?type=web&q=hai+bird&j_q=&l=";
>
> String nextLine;
> String webPage;
> StringBuffer wPage;
> String sSql;
> java.net.URL siteURL = new java.net.URL (sUrl_yahoo);
> java.net.URLConnection siteConn = siteURL.openConnection();
> java.io.BufferedReader in = new java.io.BufferedReader ( new java.io.InputStreamReader(siteConn.getInputStream() ) );
> wPage = new StringBuffer(30*1024);
> while ( ( nextLine = in.readLine() ) != null ) {
> wPage.append(nextLine); }
> in.close();
> webPage = wPage.toString(); out.println(webPage); }
> catch(Exception e) {
> out.println("Error" + e); }
Now, my request is: Can you suggest some way to extract the links from the String webPage ?
Or is there some other way to extract those links ? I would prefer doing it without using any external packages.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一个快速的解决方案是使用正则表达式 Matcher 对象来提取 URL:
您可能需要稍微尝试一下 URL 模式以使其更加密封,但这是一种快速而肮脏的解决方案,无需使用外部库。
One quick solution would be to use a regex Matcher object to pull the URLs out:
You might have to play around with the URL pattern a little to make it more airtight, but this is a quick and dirty solution without using external libraries.