使用 JAVA 从 HTML 网页中的 META 标记检索关键字

发布于 2024-10-18 22:48:54 字数 2022 浏览 1 评论 0原文

我想使用 Java 检索 HTML 网页中的所有内容词以及同一 HTML 网页的 META 标签中包含的所有关键字。
例如,考虑这个 html 源代码:

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

这里的内容词是:myveryshorthtml , 文档, , , 只是,

注意: 标点符号和数字“2”被排除。

这里的关键字是:欺骗错综复杂背叛

我为此目的创建了一个名为 WebDoc 的类,据我所知,能够得到。

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines

Note: The punctuation and the number '2' are ruled out.

The KEYWORDS here are: deception, intricacy, treachery

I have created a class for this purpose called WebDoc, this is as far as I have been able to get.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情话难免假 2024-10-25 22:48:54

因此,在 RedSoxFan 回答有关元关键字的问题后,您只需拆分内容行即可。
您可以在那里使用类似的方法:

而不是

contentWords.add(RemoveTag(inputLine));

使用

contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));
  • .split(...) 在所有非字母处分割您的行(我希望这有效,请尝试并报告),返回一组子字符串,每个子字符串应仅包含字母,以及其间的一些空字符串。
  • Arrays.asList(...) 将此数组包装在列表中。
  • addAll(...) 将此数组的所有元素添加到集合中,但不重复)。

最后,您应该从 contentWords-Set 中删除空字符串 ""

So, after the answer from RedSoxFan about the meta-keywords, you only need to split your content lines.
You can use a similar method there:

Instead of

contentWords.add(RemoveTag(inputLine));

use

contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));
  • .split(...) splits your line at all non-letters (I hope this works, please try and report), giving back an array of substrings, which each should contain only of letters, and some empty strings between.
  • Arrays.asList(...) wraps this array in a list.
  • addAll(...) adds all the elements of this array to the set, but not duplicates).

At the end you should delete the empty string "" from your contentWords-Set.

七秒鱼° 2024-10-25 22:48:54

处理每一行并使用

public Set<String> getKeywords(String str) {
        Set<String> s = new HashSet<String>();
        str = str.trim();
        if (str.toLowerCase().startsWith("<meta ")) {
           if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
               // Returns only whats in the content attribute (case-insensitive)
               str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
               for (String st:str.split(",")) s.add(st.trim());
               return s;
           }
        }
        return null;
    }

如果您需要解释,请告诉我。

Process each line and use

public Set<String> getKeywords(String str) {
        Set<String> s = new HashSet<String>();
        str = str.trim();
        if (str.toLowerCase().startsWith("<meta ")) {
           if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
               // Returns only whats in the content attribute (case-insensitive)
               str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
               for (String st:str.split(",")) s.add(st.trim());
               return s;
           }
        }
        return null;
    }

If you need an explanation, let me know.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文