HTML 解析/抓取算法帮助..Java

发布于 2024-08-07 13:39:05 字数 576 浏览 3 评论 0原文

我正在编写一个 HTML 抓取程序的程序,当它从页面中抓取 HTML 时,它会返回 HTML,我想抓取全部大写字母的单词,然后将这些单词存储到数据库中。我现在的问题是我无法纠正算法来解析我返回的 HTML 的每一行以存储单词。这基本上就是我正在使用的格式。 重要你会注意到大写字母的单词总是第一个单词,所以本质上我只需要查看 HTML 的每一行的第一个字母,然后确定整个单词是否大写。如果是,那么我想将单词添加到列表中,如果不是,那么我想转到下一行...所以它看起来像这样...

名称列表 ---->应跳过此行,因为第一个单词并非全部大写
亚伦....
亚伯拉罕 ....
安吉拉...
艾米...
ASHLEY ....

       AARON through ASHLEY should be added to list because first word is all CAPS 

我能够获取上述格式的html,但现在我很难编写获取每行第一个单词的算法,然后

有人知道如何在不使用外部解析的情况下做到这一点并仅使用循环和列表。谢谢,我很感谢你的帮助

I am writing a program that an HTML scraper that when it grabs the HTML from the page, it returns the HTML, and I want to Grab words that are All Capital letters, and then stores these words into a database. My problem right now is I cannot right the algorithm to parse each line of the HTML I got back in order to store the words. This is essentially what format that I am working with. IMPORTANT You will notice that the capital lettered words are always the first ones, so essentially I only need to look at the first letter of each line of HTML, and then decide if the whole word is capital. If it is then I want to add the word to a list, if it isn't then I want to go to the next line...So the it would look like this...

list of names ----> This line should be skipped because first word is not all CAPS
AARON ....
ABRAHAM ....
ANGELA ...
AMY ...
ASHLEY....

       AARON through ASHLEY should be added to list because first word is all CAPS 

I am able to get the html in the format above, but now I am having a hard time writing the algorithm for getting the first word of each line, and then

does anybody know how to do this without using external parsing and just using loops and lists. Thanks, I appreciate you helping out

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

剩余の解释 2024-08-14 13:39:05

首先,我不会重新发明轮子,因为解析不好的 HTML 可能会很痛苦,我会使用现有的 HTML 解析器,比如 TagSoupJericho。实际上,我更喜欢 Jericho,因为它具有 从 HTML 标记中提取所有文本

然后,我使用正则表达式 (\p{Upper}+) 提取所有大写单词。请参阅 java.lang. util.regex

First, instead of reinventing the wheel and because it can be a pain to parse bad HTML, I'd use an existing HTML parser, something like TagSoup or Jericho. Actually, Jericho would have my preference here as it has a built-in functionality to extract all text from HTML markup.

Then, I'd use a regex (\p{Upper}+) to extract all words in uppercase. See java.util.regex.

毁梦 2024-08-14 13:39:05

您可以使用 正则表达式:

for (String line: lines) {
    if (line.matches("[A-Z]+\\b.*")) {
        ...
    }
}

匹配包含一个或多个大写字母 [AZ]+ 的任何行,后跟单词边界 \\b,后跟任何其他内容 。 *。如果您只希望每行只有一个名称,后面没有任何内容,则可以去掉 \\b.*

或者,您可以使用 String.split() 将行分成单词,然后检查第一个单词是否全部大写:

for (String line: lines) {
    String[] words = line.split("\\s");

    if (words.length > 0 && words[0].equals(words[0].toUpperCase())) {
        ...
    }
}

这里 \\s 匹配任何空格、制表符,或其他空白字符。

You could do this with a regular expression:

for (String line: lines) {
    if (line.matches("[A-Z]+\\b.*")) {
        ...
    }
}

This matches any line that has one or more capital letters [A-Z]+, followed by a word boundary \\b, followed by anything else .*. You could get rid of the \\b.* if you only expect there to be a single name on each line and nothing after.

Alternatively you could use a String.split() to break up the line into words and then check the first word for all caps:

for (String line: lines) {
    String[] words = line.split("\\s");

    if (words.length > 0 && words[0].equals(words[0].toUpperCase())) {
        ...
    }
}

Here \\s matches any space, tab, or other whitespace character.

疯了 2024-08-14 13:39:05
String line = "AARON asdfasdflökj";

int i;
String cmp;

if( (i=line.indexOf(' ')) != -1 ) {
    cmp = line.substring( 0, i );
} else {
    cmp = line;
}

if( cmp.equals( cmp.toUpperCase() ) ) {
    // Line starts with all capitals
} else {
    // ...
}

第一个 if 检查 String 行中是否有空格并删除其后面的所有内容。第二个 if 检查字符串中的每个字符是否都是大写。

String line = "AARON asdfasdflökj";

int i;
String cmp;

if( (i=line.indexOf(' ')) != -1 ) {
    cmp = line.substring( 0, i );
} else {
    cmp = line;
}

if( cmp.equals( cmp.toUpperCase() ) ) {
    // Line starts with all capitals
} else {
    // ...
}

The first if checks wheter there's a space in the String line and removes everything behind it. The second if checks if every char is upper case in the String.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文