使用 JAVA 从 HTML 网页中的 META 标记检索关键字
我想使用 Java 检索 HTML 网页中的所有内容词以及同一 HTML 网页的 META 标签中包含的所有关键字。
例如,考虑这个 html 源代码:
<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document.
<br>
It has just 2 'lines'.
</body>
</html>
这里的内容词是:my、very、short、html , 文档, 它, 有, 只是, 行
注意: 标点符号和数字“2”被排除。
这里的关键字是:欺骗、错综复杂、背叛
我为此目的创建了一个名为 WebDoc 的类,据我所知,能够得到。
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;
public class WebDoc {
protected URL _url;
protected Set<String> _contentWords;
protected Set<String> _keyWords
public WebDoc(URL paramURL) {
_url = paramURL;
}
public Set<String> getContents() throws IOException {
//URL url = new URL(url);
Set<String> contentWords = new TreeSet<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
contentWords.add(RemoveTag(inputLine));
//System.out.println(RemoveTag(inputLine));
}
in.close();
System.out.println(contentWords);
_contentWords = contentWords;
return contentWords;
}
public String RemoveTag(String html) {
html = html.replaceAll("\\<.*?>","");
html = html.replaceAll("&","");
return html;
}
public Set<String> getKeywords() {
//NO IDEA !
return null;
}
public URL getURL() {
return _url;
}
@Override
public String toString() {
return null;
}
}
I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:
<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document.
<br>
It has just 2 'lines'.
</body>
</html>
The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines
Note: The punctuation and the number '2' are ruled out.
The KEYWORDS here are: deception, intricacy, treachery
I have created a class for this purpose called WebDoc, this is as far as I have been able to get.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;
public class WebDoc {
protected URL _url;
protected Set<String> _contentWords;
protected Set<String> _keyWords
public WebDoc(URL paramURL) {
_url = paramURL;
}
public Set<String> getContents() throws IOException {
//URL url = new URL(url);
Set<String> contentWords = new TreeSet<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
contentWords.add(RemoveTag(inputLine));
//System.out.println(RemoveTag(inputLine));
}
in.close();
System.out.println(contentWords);
_contentWords = contentWords;
return contentWords;
}
public String RemoveTag(String html) {
html = html.replaceAll("\\<.*?>","");
html = html.replaceAll("&","");
return html;
}
public Set<String> getKeywords() {
//NO IDEA !
return null;
}
public URL getURL() {
return _url;
}
@Override
public String toString() {
return null;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
因此,在 RedSoxFan 回答有关元关键字的问题后,您只需拆分内容行即可。
您可以在那里使用类似的方法:
而不是
使用
.split(...)
在所有非字母处分割您的行(我希望这有效,请尝试并报告),返回一组子字符串,每个子字符串应仅包含字母,以及其间的一些空字符串。Arrays.asList(...)
将此数组包装在列表中。addAll(...)
将此数组的所有元素添加到集合中,但不重复)。最后,您应该从 contentWords-Set 中删除空字符串
""
。So, after the answer from RedSoxFan about the meta-keywords, you only need to split your content lines.
You can use a similar method there:
Instead of
use
.split(...)
splits your line at all non-letters (I hope this works, please try and report), giving back an array of substrings, which each should contain only of letters, and some empty strings between.Arrays.asList(...)
wraps this array in a list.addAll(...)
adds all the elements of this array to the set, but not duplicates).At the end you should delete the empty string
""
from your contentWords-Set.处理每一行并使用
如果您需要解释,请告诉我。
Process each line and use
If you need an explanation, let me know.