使用 JAVA 从 HTML 网页中的 META 标记检索关键字

发布于 2024-10-18 22:48:54 字数 2022 浏览 1 评论 0原文

我想使用 Java 检索 HTML 网页中的所有内容词以及同一 HTML 网页的 META 标签中包含的所有关键字。
例如，考虑这个 html 源代码：

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

这里的内容词是：my、very、short、html , 文档, 它, 有, 只是, 行

注意： 标点符号和数字“2”被排除。

这里的关键字是：欺骗、错综复杂、背叛

我为此目的创建了一个名为 WebDoc 的类，据我所知，能够得到。

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

原文

I want to retrieve all the content words from a HTML WebPage and all the keywords contained in the META TAG of the same HTML webpage using Java.
For example, consider this html source code:

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

The CONTENT WORDS here are: my, very, short, html, document, it, has, just, lines

Note: The punctuation and the number '2' are ruled out.

The KEYWORDS here are: deception, intricacy, treachery

I have created a class for this purpose called WebDoc, this is as far as I have been able to get.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话难免假 2024-10-25 22:48:54

因此，在 RedSoxFan 回答有关元关键字的问题后，您只需拆分内容行即可。
您可以在那里使用类似的方法：

而不是

contentWords.add(RemoveTag(inputLine));

使用

contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));

.split(...) 在所有非字母处分割您的行（我希望这有效，请尝试并报告），返回一组子字符串，每个子字符串应仅包含字母，以及其间的一些空字符串。
Arrays.asList(...) 将此数组包装在列表中。
addAll(...) 将此数组的所有元素添加到集合中，但不重复）。

最后，您应该从 contentWords-Set 中删除空字符串 ""。

So, after the answer from RedSoxFan about the meta-keywords, you only need to split your content lines.
You can use a similar method there:

Instead of

contentWords.add(RemoveTag(inputLine));

use

contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));

.split(...) splits your line at all non-letters (I hope this works, please try and report), giving back an array of substrings, which each should contain only of letters, and some empty strings between.
Arrays.asList(...) wraps this array in a list.
addAll(...) adds all the elements of this array to the set, but not duplicates).

At the end you should delete the empty string "" from your contentWords-Set.

回复收藏 0 原文

七秒鱼° 2024-10-25 22:48:54

处理每一行并使用

public Set<String> getKeywords(String str) {
        Set<String> s = new HashSet<String>();
        str = str.trim();
        if (str.toLowerCase().startsWith("<meta ")) {
           if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
               // Returns only whats in the content attribute (case-insensitive)
               str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
               for (String st:str.split(",")) s.add(st.trim());
               return s;
           }
        }
        return null;
    }

如果您需要解释，请告诉我。

Process each line and use

public Set<String> getKeywords(String str) {
        Set<String> s = new HashSet<String>();
        str = str.trim();
        if (str.toLowerCase().startsWith("<meta ")) {
           if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
               // Returns only whats in the content attribute (case-insensitive)
               str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
               for (String st:str.split(",")) s.add(st.trim());
               return s;
           }
        }
        return null;
    }

If you need an explanation, let me know.

回复收藏 0 原文

~没有更多了~