Class StreamParser

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public class StreamParser
    extends java.lang.Object
    implements java.io.Closeable
    A StreamParser provides a progressive parse of its input. As each Element is completed, it is emitted via a Stream or Iterator interface. Elements returned will be complete with all their children, and an (empty) next sibling, if applicable.

    Elements (or their children) may be removed from the DOM during the parse, for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit into memory, yet still providing a DOM interface to the document and its elements.

    Additionally, the parser provides a selectFirst(String query) / selectNext(String query), which will run the parser until a hit is found, at which point the parse is suspended. It can be resumed via another select() call, or via the stream() or iterator() methods.

    Once the input has been fully read, the input Reader will be closed. Or, if the whole document does not need to be read, call stop() and close().

    The document() method will return the Document being parsed into, which will be only partially complete until the input is fully consumed.

    A StreamParser can be reused via a new parse(Reader, String), but is not thread-safe for concurrent inputs. New parsers should be used in each thread.

    If created via Connection.Response.streamParser(), or another Reader that is I/O backed, the iterator and stream consumers will throw an UncheckedIOException if the underlying Reader errors during read.

    The StreamParser interface is currently in beta and may change in subsequent releases. Feedback on the feature and how you're using it is very welcome via the jsoup discussions.

    Since:
    1.18.1
    • Constructor Detail

      • StreamParser

        public StreamParser​(Parser parser)
        Construct a new StreamParser, using the supplied base Parser.
        Parameters:
        parser - the configured base parser
    • Method Detail

      • parse

        public StreamParser parse​(java.io.Reader input,
                                  java.lang.String baseUri)
        Provide the input for a Document parse. The input is not read until a consuming operation is called.
        Parameters:
        input - the input to be read.
        baseUri - the URL of this input, for absolute link resolution
        Returns:
        this parser, for chaining
      • parse

        public StreamParser parse​(java.lang.String input,
                                  java.lang.String baseUri)
        Provide the input for a Document parse. The input is not read until a consuming operation is called.
        Parameters:
        input - the input to be read
        baseUri - the URL of this input, for absolute link resolution
        Returns:
        this parser
      • parseFragment

        public StreamParser parseFragment​(java.io.Reader input,
                                          Element context,
                                          java.lang.String baseUri)
        Provide the input for a fragment parse. The input is not read until a consuming operation is called.
        Parameters:
        input - the input to be read
        context - the optional fragment context element
        baseUri - the URL of this input, for absolute link resolution
        Returns:
        this parser
        See Also:
        completeFragment()
      • parseFragment

        public StreamParser parseFragment​(java.lang.String input,
                                          Element context,
                                          java.lang.String baseUri)
        Provide the input for a fragment parse. The input is not read until a consuming operation is called.
        Parameters:
        input - the input to be read
        context - the optional fragment context element
        baseUri - the URL of this input, for absolute link resolution
        Returns:
        this parser
        See Also:
        completeFragment()
      • stream

        public java.util.stream.Stream<Element> stream()
        Creates a Stream of Elements, with the input being parsed as each element is consumed. Each Element returned will be complete (that is, all of its children will be included, and if it has a next sibling, that (empty) sibling will exist at Element.nextElementSibling()). The stream will be emitted in document order as each element is closed. That means that child elements will be returned prior to their parents.

        The stream will start from the current position of the backing iterator and the parse.

        When consuming the stream, if the Reader that the Parser is reading throws an I/O exception (for example a SocketTimeoutException), that will be emitted as an UncheckedIOException

        Returns:
        a stream of Element objects
        Throws:
        java.io.UncheckedIOException - if the underlying Reader excepts during a read (in stream consuming methods)
      • iterator

        public java.util.Iterator<Element> iterator()
        Returns an Iterator of Elements, with the input being parsed as each element is consumed. Each Element returned will be complete (that is, all of its children will be included, and if it has a next sibling, that (empty) sibling will exist at Element.nextElementSibling()). The elements will be emitted in document order as each element is closed. That means that child elements will be returned prior to their parents.

        The iterator will start from the current position of the parse.

        The iterator is backed by this StreamParser, and the resources it holds.

        Returns:
        a stream of Element objects
      • stop

        public StreamParser stop()
        Flags that the parse should be stopped; the backing iterator will not return any more Elements.
        Returns:
        this parser
      • close

        public void close()
        Closes the input and releases resources including the underlying parser and reader.

        The parser will also be closed when the input is fully read.

        The parser can be reused with another call to parse(Reader, String).

        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
      • document

        public Document document()
        Get the current Document as it is being parsed. It will be only partially complete until the input is fully read. Structural changes (e.g. insert, remove) may be made to the Document contents.
        Returns:
        the (partial) Document
      • complete

        public Document complete()
                          throws java.io.IOException
        Runs the parser until the input is fully read, and returns the completed Document.
        Returns:
        the completed Document
        Throws:
        java.io.IOException - if an I/O error occurs
      • completeFragment

        public java.util.List<Node> completeFragment()
                                              throws java.io.IOException
        When initialized as a fragment parse, runs the parser until the input is fully read, and returns the completed fragment child nodes.
        Returns:
        the completed child nodes
        Throws:
        java.io.IOException - if an I/O error occurs
        See Also:
        parseFragment(Reader, Element, String)
      • selectFirst

        public Element selectFirst​(java.lang.String query)
                            throws java.io.IOException
        Finds the first Element that matches the provided query. If the parsed Document does not already have a match, the input will be parsed until the first match is found, or the input is completely read.
        Parameters:
        query - the Selector query.
        Returns:
        the first matching Element, or null if there's no match
        Throws:
        java.io.IOException - if an I/O error occurs
      • expectFirst

        public Element expectFirst​(java.lang.String query)
                            throws java.io.IOException
        Just like selectFirst(String), but if there is no match, throws an IllegalArgumentException. This is useful if you want to simply abort processing on a failed match.
        Parameters:
        query - the Selector query.
        Returns:
        the first matching element
        Throws:
        java.lang.IllegalArgumentException - if no match is found
        java.io.IOException - if an I/O error occurs
      • selectFirst

        public Element selectFirst​(Evaluator eval)
                            throws java.io.IOException
        Finds the first Element that matches the provided query. If the parsed Document does not already have a match, the input will be parsed until the first match is found, or the input is completely read.
        Parameters:
        eval - the Selector evaluator.
        Returns:
        the first matching Element, or null if there's no match
        Throws:
        java.io.IOException - if an I/O error occurs
      • selectNext

        public Element selectNext​(java.lang.String query)
                           throws java.io.IOException
        Finds the next Element that matches the provided query. The input will be parsed until the next match is found, or the input is completely read.
        Parameters:
        query - the Selector query.
        Returns:
        the next matching Element, or null if there's no match
        Throws:
        java.io.IOException - if an I/O error occurs
      • expectNext

        public Element expectNext​(java.lang.String query)
                           throws java.io.IOException
        Just like selectFirst(String), but if there is no match, throws an IllegalArgumentException. This is useful if you want to simply abort processing on a failed match.
        Parameters:
        query - the Selector query.
        Returns:
        the first matching element
        Throws:
        java.lang.IllegalArgumentException - if no match is found
        java.io.IOException - if an I/O error occurs
      • selectNext

        public Element selectNext​(Evaluator eval)
                           throws java.io.IOException
        Finds the next Element that matches the provided query. The input will be parsed until the next match is found, or the input is completely read.
        Parameters:
        eval - the Selector evaluator.
        Returns:
        the next matching Element, or null if there's no match
        Throws:
        java.io.IOException - if an I/O error occurs