摘录文章'文本来自维基百科

发布于 2024-12-19 02:33:05 字数 3085 浏览 5 评论 0原文

我正在编写一些java代码,以便获取一些维基百科文章的原始文本(给出单词的jList,在维基百科中搜索它们并提取相应文章的第一句)。我的 GUI 包含一个按钮,我为其定义了以下操作侦听器:

private void loadButtonActionPerformed(java.awt.event.ActionEvent evt) {                                           

final DefaultListModel conceptsListFilesModel = new DefaultListModel();

conceptsList.setModel(conceptsListFilesModel);

final List definitionWiki = new ArrayList();        

//Remplir la list avec la première collone de la liste
final Thread updater = new Thread(){
@Override public void run() {        
for(int i=0; i< 20 /*dataTable.getRowCount()*/ ; i++) {
conceptsListFilesModel.addElement(dataTable.getValueAt(i, 0));

try {
Object concept = conceptsListFilesModel.elementAt(i);
WikipediaParser parser = new WikipediaParser("en");
System.out.println(concept+"");
String firstParagraph = parser.fetchFirstParagraph(concept+"");
int point = firstParagraph.indexOf(".");
String firstsentence = firstParagraph.substring(0, point+1);
definitionWiki.add(i, firstsentence) ;
} catch (IOException ex) {
Logger.getLogger(Tex2TaxView.class.getName()).log(Level.SEVERE, null, ex);
}

try { Thread.sleep(1000);
} catch (InterruptedException e) {throw new RuntimeException(e) ;}
}
JOptionPane.showMessageDialog(null, "Successful loading !")  ;
}
};
updater.start(); 
} 

WikipediaParser 类:

public class WikipediaParser {

private final String baseUrl; 

public WikipediaParser(String lang) {
this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
}

public String fetchFirstParagraph(String article) throws IOException {
String url = baseUrl + article;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
return firstParagraph.text();
}

}

执行生成以下异常列表:

nov. 30, 2011 12:42:55 AM tex2tax.Tex2TaxView$11 run
Grave: null java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:641)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:589)
at  
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1319)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at tex2tax.WikipediaParser.fetchFirstParagraph(WikipediaParser.java:25)
at tex2tax.Tex2TaxView$11.run(Tex2TaxView.java:595)

需要帮助来解决此问题

I'm writing some java code in order to get the raw text of some Wikipedia articles (Giving a jList of words, search them in wikipedia and extract the first sentence of the corresponding article). My GUI contains a button for which I defined the following action listener:

private void loadButtonActionPerformed(java.awt.event.ActionEvent evt) {                                           

final DefaultListModel conceptsListFilesModel = new DefaultListModel();

conceptsList.setModel(conceptsListFilesModel);

final List definitionWiki = new ArrayList();        

//Remplir la list avec la première collone de la liste
final Thread updater = new Thread(){
@Override public void run() {        
for(int i=0; i< 20 /*dataTable.getRowCount()*/ ; i++) {
conceptsListFilesModel.addElement(dataTable.getValueAt(i, 0));

try {
Object concept = conceptsListFilesModel.elementAt(i);
WikipediaParser parser = new WikipediaParser("en");
System.out.println(concept+"");
String firstParagraph = parser.fetchFirstParagraph(concept+"");
int point = firstParagraph.indexOf(".");
String firstsentence = firstParagraph.substring(0, point+1);
definitionWiki.add(i, firstsentence) ;
} catch (IOException ex) {
Logger.getLogger(Tex2TaxView.class.getName()).log(Level.SEVERE, null, ex);
}

try { Thread.sleep(1000);
} catch (InterruptedException e) {throw new RuntimeException(e) ;}
}
JOptionPane.showMessageDialog(null, "Successful loading !")  ;
}
};
updater.start(); 
} 

The WikipediaParser class:

public class WikipediaParser {

private final String baseUrl; 

public WikipediaParser(String lang) {
this.baseUrl = String.format("http://%s.wikipedia.org/wiki/", lang);
}

public String fetchFirstParagraph(String article) throws IOException {
String url = baseUrl + article;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
return firstParagraph.text();
}

}

The execution generates the following list of exceptions:

nov. 30, 2011 12:42:55 AM tex2tax.Tex2TaxView$11 run
Grave: null java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:641)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:589)
at  
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1319)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at tex2tax.WikipediaParser.fetchFirstParagraph(WikipediaParser.java:25)
at tex2tax.Tex2TaxView$11.run(Tex2TaxView.java:595)

Need help to solve this problem

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

魄砕の薆 2024-12-26 02:33:06

确保您的网址正确。连接超时通常意味着存在一些连接问题。

如果您向维基百科发出了很多请求,您可能会被阻止。

您还应该使用 Wikipedia API 而不是请求和解析网页。它会比请求和解析 HTML 快得多。

Ensure that your URL is correct. A connection timeout usually means that there is some connectivity problem.

If you have been making many requests to wikipedia, you might get blocked.

You should also be using the Wikipedia API instead of requesting and parsing web pages. It will be much faster than requesting and parsing the HTML.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文