Class Parser


  • public class Parser
    extends java.lang.Object
    Parses HTML or XML into a Document. Generally, it is simpler to use one of the parse methods in Jsoup.

    Note that a Parser instance object is not threadsafe. To reuse a Parser configuration in a multi-threaded environment, use newInstance() to make copies.

    • Constructor Detail

      • Parser

        public Parser​(TreeBuilder treeBuilder)
        Create a new Parser, using the specified TreeBuilder
        Parameters:
        treeBuilder - TreeBuilder to use to parse input into Documents.
      • Parser

        private Parser​(Parser copy)
    • Method Detail

      • newInstance

        public Parser newInstance()
        Creates a new Parser as a deep copy of this; including initializing a new TreeBuilder. Allows independent (multi-threaded) use.
        Returns:
        a copied parser
      • parseInput

        public Document parseInput​(java.lang.String html,
                                   java.lang.String baseUri)
      • parseInput

        public Document parseInput​(java.io.Reader inputHtml,
                                   java.lang.String baseUri)
      • parseFragmentInput

        public java.util.List<Node> parseFragmentInput​(java.lang.String fragment,
                                                       Element context,
                                                       java.lang.String baseUri)
      • getTreeBuilder

        public TreeBuilder getTreeBuilder()
        Get the TreeBuilder currently in use.
        Returns:
        current TreeBuilder.
      • setTreeBuilder

        public Parser setTreeBuilder​(TreeBuilder treeBuilder)
        Update the TreeBuilder used when parsing content.
        Parameters:
        treeBuilder - new TreeBuilder
        Returns:
        this, for chaining
      • isTrackErrors

        public boolean isTrackErrors()
        Check if parse error tracking is enabled.
        Returns:
        current track error state.
      • setTrackErrors

        public Parser setTrackErrors​(int maxErrors)
        Enable or disable parse error tracking for the next parse.
        Parameters:
        maxErrors - the maximum number of errors to track. Set to 0 to disable.
        Returns:
        this, for chaining
      • getErrors

        public ParseErrorList getErrors()
        Retrieve the parse errors, if any, from the last parse.
        Returns:
        list of parse errors, up to the size of the maximum errors tracked.
        See Also:
        setTrackErrors(int)
      • isTrackPosition

        public boolean isTrackPosition()
        Test if position tracking is enabled. If it is, Nodes will have a Position to track where in the original input source they were created from. By default, tracking is not enabled.
        Returns:
        current track position setting
      • setTrackPosition

        public Parser setTrackPosition​(boolean trackPosition)
        Enable or disable source position tracking. If enabled, Nodes will have a Position to track where in the original input source they were created from.
        Parameters:
        trackPosition - position tracking setting; true to enable
        Returns:
        this Parser, for chaining
      • settings

        public Parser settings​(ParseSettings settings)
        Update the ParseSettings of this Parser, to control the case sensitivity of tags and attributes.
        Parameters:
        settings - the new settings
        Returns:
        this Parser
      • settings

        public ParseSettings settings()
        Gets the current ParseSettings for this Parser
        Returns:
        current ParseSettings
      • isContentForTagData

        public boolean isContentForTagData​(java.lang.String normalName)
        (An internal method, visible for Element. For HTML parse, signals that script and style text should be treated as Data Nodes).
      • defaultNamespace

        public java.lang.String defaultNamespace()
      • parse

        public static Document parse​(java.lang.String html,
                                     java.lang.String baseUri)
        Parse HTML into a Document.
        Parameters:
        html - HTML to parse
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        parsed Document
      • parseFragment

        public static java.util.List<Node> parseFragment​(java.lang.String fragmentHtml,
                                                         Element context,
                                                         java.lang.String baseUri)
        Parse a fragment of HTML into a list of nodes. The context element, if supplied, supplies parsing context.
        Parameters:
        fragmentHtml - the fragment of HTML to parse
        context - (optional) the element that this HTML fragment is being parsed for (i.e. for inner HTML). This provides stack context (for implicit element creation).
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        list of nodes parsed from the input HTML. Note that the context element, if supplied, is not modified.
      • parseFragment

        public static java.util.List<Node> parseFragment​(java.lang.String fragmentHtml,
                                                         Element context,
                                                         java.lang.String baseUri,
                                                         ParseErrorList errorList)
        Parse a fragment of HTML into a list of nodes. The context element, if supplied, supplies parsing context.
        Parameters:
        fragmentHtml - the fragment of HTML to parse
        context - (optional) the element that this HTML fragment is being parsed for (i.e. for inner HTML). This provides stack context (for implicit element creation).
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        errorList - list to add errors to
        Returns:
        list of nodes parsed from the input HTML. Note that the context element, if supplied, is not modified.
      • parseXmlFragment

        public static java.util.List<Node> parseXmlFragment​(java.lang.String fragmentXml,
                                                            java.lang.String baseUri)
        Parse a fragment of XML into a list of nodes.
        Parameters:
        fragmentXml - the fragment of XML to parse
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        list of nodes parsed from the input XML.
      • parseBodyFragment

        public static Document parseBodyFragment​(java.lang.String bodyHtml,
                                                 java.lang.String baseUri)
        Parse a fragment of HTML into the body of a Document.
        Parameters:
        bodyHtml - fragment of HTML
        baseUri - base URI of document (i.e. original fetch location), for resolving relative URLs.
        Returns:
        Document, with empty head, and HTML parsed into body
      • unescapeEntities

        public static java.lang.String unescapeEntities​(java.lang.String string,
                                                        boolean inAttribute)
        Utility method to unescape HTML entities from a string
        Parameters:
        string - HTML escaped string
        inAttribute - if the string is to be escaped in strict mode (as attributes are)
        Returns:
        an unescaped string
      • htmlParser

        public static Parser htmlParser()
        Create a new HTML parser. This parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags.
        Returns:
        a new HTML parser.
      • xmlParser

        public static Parser xmlParser()
        Create a new XML parser. This parser assumes no knowledge of the incoming tags and does not treat it as HTML, rather creates a simple tree directly from the input.
        Returns:
        a new simple XML parser.