- GUI
- Windows API tutorial
- Introduction to Windows API
- Windows API main functions
- System functions in Windows API
- Strings in Windows API
- Date & time in Windows API
- A window in Windows API
- First steps in UI
- Windows API menus
- Windows API dialogs
- Windows API controls I
- Windows API controls II
- Windows API controls III
- Advanced controls in Windows API
- Custom controls in Windows API
- The GDI in Windows API
- PyQt4 tutorial
- PyQt5 tutorial
- Qt4 tutorial
- Introduction to Qt4 toolkit
- Qt4 utility classes
- Strings in Qt4
- Date and time in Qt4
- Working with files and directories in Qt4
- First programs in Qt4
- Menus and toolbars in Qt4
- Layout management in Qt4
- Events and signals in Qt4
- Qt4 Widgets
- Qt4 Widgets II
- Painting in Qt4
- Custom widget in Qt4
- The Breakout game in Qt4
- Qt5 tutorial
- Introduction to Qt5 toolkit
- Strings in Qt5
- Date and time in Qt5
- Containers in Qt5
- Working with files and directories in Qt5
- First programs in Qt5
- Menus and toolbars in Qt5
- Layout management in Qt5
- Events and signals in Qt5
- Qt5 Widgets
- Qt5 Widgets II
- Painting in Qt5
- Custom widget in Qt5
- Snake in Qt5
- The Breakout game in Qt5
- PySide tutorial
- Tkinter tutorial
- Tcl/Tk tutorial
- Qt Quick tutorial
- Java Swing tutorial
- JavaFX tutorial
- Java SWT tutorial
- wxWidgets tutorial
- Introduction to wxWidgets
- wxWidgets helper classes
- First programs in wxWidgets
- Menus and toolbars in wxWidgets
- Layout management in wxWidgets
- Events in wxWidgets
- Dialogs in wxWidgets
- wxWidgets widgets
- wxWidgets widgets II
- Drag and Drop in wxWidgets
- Device Contexts in wxWidgets
- Custom widgets in wxWidgets
- The Tetris game in wxWidgets
- wxPython tutorial
- Introduction to wxPython
- First Steps
- Menus and toolbars
- Layout management in wxPython
- Events in wxPython
- wxPython dialogs
- Widgets
- Advanced widgets in wxPython
- Drag and drop in wxPython
- Internationalisation
- Application skeletons in wxPython
- The GDI
- Mapping modes
- Creating custom widgets
- Tips and Tricks
- wxPython Gripts
- The Tetris game in wxPython
- C# Winforms Mono tutorial
- Java Gnome tutorial
- Introduction to Java Gnome
- First steps in Java Gnome
- Layout management in Java Gnome
- Layout management II in Java Gnome
- Menus in Java Gnome
- Toolbars in Java Gnome
- Events in Java Gnome
- Widgets in Java Gnome
- Widgets II in Java Gnome
- Advanced widgets in Java Gnome
- Dialogs in Java Gnome
- Pango in Java Gnome
- Drawing with Cairo in Java Gnome
- Drawing with Cairo II
- Nibbles in Java Gnome
- QtJambi tutorial
- GTK+ tutorial
- Ruby GTK tutorial
- GTK# tutorial
- Visual Basic GTK# tutorial
- PyGTK tutorial
- Introduction to PyGTK
- First steps in PyGTK
- Layout management in PyGTK
- Menus in PyGTK
- Toolbars in PyGTK
- Signals & events in PyGTK
- Widgets in PyGTK
- Widgets II in PyGTK
- Advanced widgets in PyGTK
- Dialogs in PyGTK
- Pango
- Pango II
- Drawing with Cairo in PyGTK
- Drawing with Cairo II
- Snake game in PyGTK
- Custom widget in PyGTK
- PHP GTK tutorial
- C# Qyoto tutorial
- Ruby Qt tutorial
- Visual Basic Qyoto tutorial
- Mono IronPython Winforms tutorial
- Introduction
- First steps in IronPython Mono Winforms
- Layout management
- Menus and toolbars
- Basic Controls in Mono Winforms
- Basic Controls II in Mono Winforms
- Advanced Controls in Mono Winforms
- Dialogs
- Drag & drop in Mono Winforms
- Painting
- Painting II in IronPython Mono Winforms
- Snake in IronPython Mono Winforms
- The Tetris game in IronPython Mono Winforms
- FreeBASIC GTK tutorial
- Jython Swing tutorial
- JRuby Swing tutorial
- Visual Basic Winforms tutorial
- JavaScript GTK tutorial
- Ruby HTTPClient tutorial
- Ruby Faraday tutorial
- Ruby Net::HTTP tutorial
- Java 2D games tutorial
- Java 2D tutorial
- Cairo graphics tutorial
- PyCairo tutorial
- HTML5 canvas tutorial
- Python tutorial
- Python language
- Interactive Python
- Python lexical structure
- Python data types
- Strings in Python
- Python lists
- Python dictionaries
- Python operators
- Keywords in Python
- Functions in Python
- Files in Python
- Object-oriented programming in Python
- Modules
- Packages in Python
- Exceptions in Python
- Iterators and Generators
- Introspection in Python
- Ruby tutorial
- PHP tutorial
- Visual Basic tutorial
- Visual Basic
- Visual Basic lexical structure
- Basics
- Visual Basic data types
- Strings in Visual Basic
- Operators
- Flow control
- Visual Basic arrays
- Procedures & functions in Visual Basic
- Organizing code in Visual Basic
- Object-oriented programming
- Object-oriented programming II in Visual Basic
- Collections in Visual Basic
- Input & output
- Tcl tutorial
- C# tutorial
- Java tutorial
- AWK tutorial
- Jetty tutorial
- Tomcat Derby tutorial
- Jtwig tutorial
- Android tutorial
- Introduction to Android development
- First Android application
- Android Button widgets
- Android Intents
- Layout management in Android
- Android Spinner widget
- SeekBar widget
- Android ProgressBar widget
- Android ListView widget
- Android Pickers
- Android menus
- Dialogs
- Drawing in Android
- Java EE 5 tutorials
- Introduction
- Installing Java
- Installing NetBeans 6
- Java Application Servers
- Resin CGIServlet
- JavaServer Pages, (JSPs)
- Implicit objects in JSPs
- Shopping cart
- JSP & MySQL Database
- Java Servlets
- Sending email in a Servlet
- Creating a captcha in a Servlet
- DataSource & DriverManager
- Java Beans
- Custom JSP tags
- Object relational mapping with iBATIS
- Jsoup tutorial
- MySQL tutorial
- MySQL quick tutorial
- MySQL storage engines
- MySQL data types
- Creating, altering and dropping tables in MySQL
- MySQL expressions
- Inserting, updating, and deleting data in MySQL
- The SELECT statement in MySQL
- MySQL subqueries
- MySQL constraints
- Exporting and importing data in MySQL
- Joining tables in MySQL
- MySQL functions
- Views in MySQL
- Transactions in MySQL
- MySQL stored routines
- MySQL Python tutorial
- MySQL Perl tutorial
- MySQL C API programming tutorial
- MySQL Visual Basic tutorial
- MySQL PHP tutorial
- MySQL Java tutorial
- MySQL Ruby tutorial
- MySQL C# tutorial
- SQLite tutorial
- SQLite C tutorial
- SQLite PHP tutorial
- SQLite Python tutorial
- SQLite Perl tutorial
- SQLite Ruby tutorial
- SQLite C# tutorial
- SQLite Visual Basic tutorial
- PostgreSQL C tutorial
- PostgreSQL Python tutorial
- PostgreSQL Ruby tutorial
- PostgreSQL PHP tutorial
- PostgreSQL Java tutorial
- Apache Derby tutorial
- SQLAlchemy tutorial
- MongoDB PHP tutorial
- MongoDB Java tutorial
- MongoDB JavaScript tutorial
- MongoDB Ruby tutorial
- Spring JdbcTemplate tutorial
- JDBI tutorial
Jsoup tutorial
This is an introductory tutorial of the Jsoup HTML parser. In the tutorial we are going to parse HTML data form a HTML string, local HTML file, and a web page. We are going to sanitize data and perform a Google search.
Jsoup is a Java library for extracting and manipulating HTML data. It implements the HTML5 specification, and parses HTML to the same DOM as modern browsers. The project's web site is jsoup.org .
With Jsop we are able to:
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.9.2</version> </dependency>
In the examples of this tutorial, we have used the above Maven dependency.
The Jsoup
class provides the core public access point to the jsoup functionality.
Parsing a HTML string
In the first example, we are going to parse a HTML string.
JSoupFromStringEx.java
package com.zetcode; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JSoupFromStringEx { public static void main(String[] args) { String htmlString = "<html><head><title>My title</title></head>" + "<body>Body content</body></html>"; Document doc = Jsoup.parse(htmlString); String title = doc.title(); String body = doc.body().text(); System.out.printf("Title: %s%n", title); System.out.printf("Body: %s", body); } }
The example parses a HTML string and outputs its title and body content.
String htmlString = "<html><head><title>My title</title></head>" + "<body>Body content</body></html>";
This string contains simple HTML data.
Document doc = Jsoup.parse(htmlString);
With the Jsoup's parse()
method, we parse the HTML string. The method returns a HTML document.
String title = doc.title();
The document's title()
method gets the string contents of the document's title element.
String body = doc.body().text();
The document's body()
method returns the body element; its text()
method gets the text of the element.
Parsing a local HTML file
In the second example, we are going to parse a local HTML file. We use the overloaded Jsoup.parse()
method that takes a File
object as its first parameter.
index.html
<!DOCTYPE html> <html> <head> <title>My title</title> <meta charset="UTF-8"> </head> <body> <div id="mydiv">Contents of a div element</div> </body> </html>
For the example, we use the above HTML file.
JSoupFromFileEx.java
package com.zetcode; import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; public class JSoupFromFileEx { public static void main(String[] args) throws IOException { String fileName = "src/main/resources/index.html"; Document doc = Jsoup.parse(new File(fileName), "utf-8"); Element divTag = doc.getElementById("mydiv"); System.out.println(divTag.text()); } }
The example parses the index.html
file, which is located in the src/main/resources/
directory.
Document doc = Jsoup.parse(new File(fileName), "utf-8");
We parse the HTML file with the Jsoup.parse()
method.
Element divTag = doc.getElementById("mydiv");
With the document's getElementById()
method, we get the element by its ID.
System.out.println(divTag.text());
The text of the tag is retrieved with the element's text()
method.
Reading a web site's title
In the following example, we scrape and parse a web page and retrieve the content of the title element.
JSoupTitleEx.java
package com.zetcode; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JSoupTitleEx { public static void main(String[] args) throws IOException { String url = "http://www.something.com"; Document doc = Jsoup.connect(url).get(); String title = doc.title(); System.out.println(title); } }
In the code example, we read the title of a specified web page.
Document doc = Jsoup.connect(url).get();
The Jsoup's connect()
method creates a connection to the given URL. The get()
method executes a GET request and parses the result; it returns a HTML document.
String title = doc.title();
With the document's title()
method, we get the title of the HTML document.
Reading HTML source
The next example retrieves the HTML source of a web page.
JSoupHTMLSourceEx.java
package com.zetcode; import java.io.IOException; import org.jsoup.Jsoup; public class JSoupHTMLSourceEx { public static void main(String[] args) throws IOException { String webPage = "http://www.something.com"; String html = Jsoup.connect(webPage).get().html(); System.out.println(html); } }
The example prints the HTML of a web page.
String html = Jsoup.connect(webPage).get().html();
The html()
method returns the HTML of an element; in our case the HTML source of the whole document.
Getting meta information
Meta information of a HTML document provides structured metadata about a Web page, such as its description and keywords.
JSoupMetaInfoEx.java
package com.zetcode; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JSoupMetaInfoEx { public static void main(String[] args) throws IOException { String url = "http://www.jsoup.org"; Document document = Jsoup.connect(url).get(); String description = document.select("meta[name=description]").first().attr("content"); System.out.println("Description : " + description); String keywords = document.select("meta[name=keywords]").first().attr("content"); System.out.println("Keywords : " + keywords); } }
The code example retrieves meta information about a specified web page.
String keywords = document.select("meta[name=keywords]").first().attr("content");
The document's select()
method finds elements that match the given query. The first()
method returns the first matched element. With the attr()
method, we get the value of the content
attribute.
Parsing links
The next example parses links from a HTML page.
JSoupLinksEx.java
package com.zetcode; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JSoupLinksEx { public static void main(String[] args) throws IOException { String url = "http://jsoup.org"; Document document = Jsoup.connect(url).get(); Elements links = document.select("a[href]"); for (Element link : links) { System.out.println("link : " + link.attr("href")); System.out.println("text : " + link.text()); } } }
In the example, we connect to a web page and parse all its link elements.
Elements links = document.select("a[href]");
To get a list of links, we use the document's select()
method.
Sanitizing HTML data
Jsoup provides methods for sanitizing HTML data.
JsoupSanitizeEx.java
package com.zetcode; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.safety.Cleaner; import org.jsoup.safety.Whitelist; public class JsoupSanitizeEx { public static void main(String[] args) { String htmlString = "<html><head><title>My title</title></head>" + "<body><center>Body content</center></body></html>"; boolean valid = Jsoup.isValid(htmlString, Whitelist.basic()); if (valid) { System.out.println("The document is valid"); } else { System.out.println("The document is not valid."); System.out.println("Cleaned document"); Document dirtyDoc = Jsoup.parse(htmlString); Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc); System.out.println(cleanDoc.html()); } } }
In the example, we sanitize and clean HTML data.
String htmlString = "<html><head><title>My title</title></head>" + "<body><center>Body content</center></body></html>";
The HTML string contains the center element, which is deprecated.
boolean valid = Jsoup.isValid(htmlString, Whitelist.basic());
The isValid()
method determines whether the string is a valid HTML. A white list is a list of HTML (elements and attributes) that can pass through the cleaner. The Whitelist.basic()
defines a set of basic clean HTML tags.
Document dirtyDoc = Jsoup.parse(htmlString); Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);
With the help of the Cleaner
, we clean the dirty HTML document.
The document is not valid. Cleaned document <html> <head></head> <body> Body content </body> </html>
This is the output of the program. We can see that the center element was removed.
Performing a Google search
The following example performs a Google search with Jsoup.
JsoupGoogleSearchEx.java
package com.zetcode; import java.io.IOException; import java.util.HashSet; import java.util.Set; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupGoogleSearchEx { private static Matcher matcher; private static final String DOMAIN_NAME_PATTERN = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,15}"; private static Pattern patrn = Pattern.compile(DOMAIN_NAME_PATTERN); public static String getDomainName(String url) { String domainName = ""; matcher = patrn.matcher(url); if (matcher.find()) { domainName = matcher.group(0).toLowerCase().trim(); } return domainName; } public static void main(String[] args) throws IOException { String query = "Milky Way"; String url = "https://www.google.com/search?q=" + query + "&num=10"; Document doc = Jsoup .connect(url) .userAgent("Jsoup client") .timeout(5000).get(); Elements links = doc.select("a[href]"); Set<String> result = new HashSet<>(); for (Element link : links) { String attr1 = link.attr("href"); String attr2 = link.attr("class"); if (!attr2.startsWith("_Zkb") && attr1.startsWith("/url?q=")) { result.add(getDomainName(attr1)); } } for (String el : result) { System.out.println(el); } } }
The example creates a search request for the "Milky Way" term. It prints ten domain names that match the term.
private static final String DOMAIN_NAME_PATTERN = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,15}"; private static Pattern patrn = Pattern.compile(DOMAIN_NAME_PATTERN);
A Google search returns long links from which we want to get the domain names. For this we use a regular expression pattern.
public static String getDomainName(String url) { String domainName = ""; matcher = patrn.matcher(url); if (matcher.find()) { domainName = matcher.group(0).toLowerCase().trim(); } return domainName; }
The getDomainName()
returns a domain name from the search link using the regular expression matcher.
String query = "Milky Way";
This is our search term.
String url = "https://www.google.com/search?q=" + query + "&num=10";
This is the url to perform a Google search.
Document doc = Jsoup .connect(url) .userAgent("Jsoup client") .timeout(5000).get();
We connect to the url, set a 5 s time out, and send a GET request. A HTML document is returned.
Elements links = doc.select("a[href]");
From the document, we select the links.
Set<String> result = new HashSet<>(); for (Element link : links) { String attr1 = link.attr("href"); String attr2 = link.attr("class"); if (!attr2.startsWith("_Zkb") && attr1.startsWith("/url?q=")) { result.add(getDomainName(attr1)); } }
We look for links that do not have class="_Zkb" attribute and have href="/url?q=" attribute. Note that these are hard-coded values that might change in the future.
for (String el : result) { System.out.println(el); }
Finally, we print the domain names to the console.
en.wikipedia.org www.space.com www.nasa.gov sk.wikipedia.org www.bbc.co.uk imagine.gsfc.nasa.gov www.forbes.com www.milkywayproject.org www.youtube.com www.universetoday.com
These are top Google search results for the "Milky Way" term.
This tutorial was dedicated to the Jsoup HTML parser.
You might also be interested in the related tutorials: Java tutorial , Reading a web page in Java , Reading text files in Java , or Jtwig tutorial .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论