使用boilerpipe提取非英文文章

发布于 01-04 15:46 字数 655 浏览 4 评论 0原文

我正在尝试使用 boilerpipe java 库从一组网站中提取新闻文章。 它非常适合英语文本,但对于带有特殊字符的文本,例如带有重音符号的单词 (história),无法正确提取这些特殊字符。我认为这是一个编码问题。

在boilerpipe常见问题解答中,它说“如果您提取非英语文本,您可能需要更改一些参数”,然后引用论文。我在这篇论文中没有找到解决方案。

我的问题是,使用锅炉管道时是否有任何参数可以指定编码?有什么办法可以绕过并正确获取文本吗?

我如何使用图书馆: (第一次尝试基于 URL):(

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

第二次尝试基于 HTLM 源代码)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library:
(first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

蹲墙角沉默2025-01-11 15:46:24

您不必修改内部 Boilerpipe 类。

只需将 InputSource 对象传递给 ArticleExtractor.INSTANCE.getText() 方法并对该对象强制编码。例如:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

问候!

You don't have to modify inner Boilerpipe classes.

Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:

URL url = new URL("http://some-page-with-utf8-encodeing.tld");

InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

Regards!

许一世地老天荒2025-01-11 15:46:24

嗯,据我所知,当你这样使用它时,库会自动选择要使用的编码。来自 HTMLFetcher 源:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

尝试稍微调试一下他们的代码,从 ArticleExtractor.getText(URL) 开始,看看是否可以覆盖编码

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:

public static HTMLDocument fetch(final URL url) throws IOException {
    final URLConnection conn = url.openConnection();
    final String ct = conn.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
            final String charset = m.group(1);
            try {
                cs = Charset.forName(charset);
            } catch (UnsupportedCharsetException e) {
                // keep default
            }
        }
    }

Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

我要还你自由2025-01-11 15:46:24

好的,找到解决方案了。
正如 Andrei 所说,我必须更改 HTMLFecther 类,该类位于 de.l3s.boilerpipe.sax 包中
我所做的是将所有获取的文本转换为 UTF-8。
在获取函数的末尾,我必须添加两行,并更改最后一行:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
烟若柳尘2025-01-11 15:46:24

Boilerpipe 的 ArticleExtractor 使用一些专门针对英语定制的算法 - 测量平均短语中的单词数量等。在任何比英语冗长或不那么冗长的语言(即:所有其他语言)中,这些算法都会不太准确。

此外,该库使用一些英语短语来尝试找到文章的结尾(评论、发表评论、发表意见等),这显然不适用于其他语言。

这并不是说该库将彻底失败 - 只是要注意,为了在非英语语言中获得良好的结果,可能需要进行一些修改。

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.

Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.

This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

◇流星雨2025-01-11 15:46:24

Java:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse:
运行>运行配置>通用选项卡。将编码设置为其他(UTF-8),然后单击运行。

在此处输入图像描述

Java:

import java.net.URL;

import org.xml.sax.InputSource;

import de.l3s.boilerpipe.extractors.ArticleExtractor;

public class Boilerpipe {

    public static void main(String[] args) {
        try{
            URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");

            InputSource is = new InputSource();
            is.setEncoding("UTF-8");
            is.setByteStream(url.openStream());

            String text = ArticleExtractor.INSTANCE.getText(is);
            System.out.println(text);
        }catch(Exception e){
            e.printStackTrace();
        }
    }

}

Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

enter image description here

探春2025-01-11 15:46:24

我遇到了一些问题; cnr 解决方案效果很好。只需将 UTF-8 编码更改为 ISO-8859-1 即可。谢谢

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's

URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());

String text = ArticleExtractor.INSTANCE.getText(is);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文