Package org.cyberneko.html.filters
Class Purifier
java.lang.Object
org.cyberneko.html.filters.DefaultFilter
org.cyberneko.html.filters.Purifier
- All Implemented Interfaces:
org.apache.xerces.xni.parser.XMLComponent
,org.apache.xerces.xni.parser.XMLDocumentFilter
,org.apache.xerces.xni.parser.XMLDocumentSource
,org.apache.xerces.xni.XMLDocumentHandler
,HTMLComponent
This filter purifies the HTML input to ensure XML well-formedness.
The purification process includes:
- fixing illegal characters in the document, including
- element and attribute names,
- processing instruction target and data,
- document text;
- ensuring the string "--" does not appear in the content of a comment;
- ensuring the string "]]>" does not appear in the content of a CDATA section;
- ensuring that the XML declaration has required pseudo-attributes and that the values are correct; and
- synthesized missing namespace bindings.
Illegal characters in XML names are converted to the character sequence "_u####_" where "####" is the value of the Unicode character represented in hexadecimal. Whereas illegal characters appearing in document content is converted to the character sequence "\\u####".
In comments, the character '-' is replaced by the character sequence "- " to prevent "--" from ever appearing in the comment content. For CDATA sections, the character ']' is replaced by the character sequence "] " to prevent "]]" from appearing.
The URI used for synthesized namespace bindings is "http://cyberneko.org/html/ns/synthesized/number" where number is generated to ensure uniqueness.
- Version:
- $Id: Purifier.java,v 1.5 2005/02/14 03:56:54 andyc Exp $
- Author:
- Andy Clark
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final String
Include infoset augmentations.protected boolean
Augmentations.protected boolean
True if inside a CDATA section.protected org.apache.xerces.xni.NamespaceContext
Namespace information.protected boolean
Namespaces.protected String
Public identifier of doctype declaration.protected boolean
True if the doctype declaration was seen.protected boolean
True if root element was seen.protected int
Synthesized namespace binding count.protected String
System identifier of doctype declaration.protected static final String
Namespaces.protected static final HTMLEventInfo
Synthesized event info item.static final String
Synthesized namespace binding prefix.Fields inherited from class org.cyberneko.html.filters.DefaultFilter
fDocumentHandler, fDocumentSource
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionvoid
characters
(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) Characters.void
comment
(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) Comment.void
doctypeDecl
(String root, String pubid, String sysid, org.apache.xerces.xni.Augmentations augs) Doctype declaration.void
emptyElement
(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) Empty element.void
endCDATA
(org.apache.xerces.xni.Augmentations augs) End CDATA section.void
endElement
(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) End element.protected void
Handle start document.protected void
handleStartElement
(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs) Handle start element.void
processingInstruction
(String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) Processing instruction.protected String
purifyName
(String name, boolean localpart) Purify name.protected org.apache.xerces.xni.QName
purifyQName
(org.apache.xerces.xni.QName qname) Purify qualified name.protected org.apache.xerces.xni.XMLString
purifyText
(org.apache.xerces.xni.XMLString text) Purify content.void
reset
(org.apache.xerces.xni.parser.XMLComponentManager manager) Resets the component.void
startCDATA
(org.apache.xerces.xni.Augmentations augs) Start CDATA section.void
startDocument
(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.Augmentations augs) Start document.void
startDocument
(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) Start document.void
startElement
(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) Start element.protected void
synthesizeBinding
(org.apache.xerces.xni.XMLAttributes attrs, String ns) Synthesize namespace binding.protected final org.apache.xerces.xni.Augmentations
Returns an augmentations object with a synthesized item added.protected static String
toHexString
(int c, int padlen) Returns a padded hexadecimal string for the given value.void
xmlDecl
(String version, String encoding, String standalone, org.apache.xerces.xni.Augmentations augs) XML declaration.Methods inherited from class org.cyberneko.html.filters.DefaultFilter
endDocument, endGeneralEntity, endPrefixMapping, getDocumentHandler, getDocumentSource, getFeatureDefault, getPropertyDefault, getRecognizedFeatures, getRecognizedProperties, ignorableWhitespace, merge, setDocumentHandler, setDocumentSource, setFeature, setProperty, startGeneralEntity, startPrefixMapping, textDecl
-
Field Details
-
SYNTHESIZED_NAMESPACE_PREFX
Synthesized namespace binding prefix.- See Also:
-
NAMESPACES
Namespaces.- See Also:
-
AUGMENTATIONS
Include infoset augmentations.- See Also:
-
SYNTHESIZED_ITEM
Synthesized event info item. -
fNamespaces
protected boolean fNamespacesNamespaces. -
fAugmentations
protected boolean fAugmentationsAugmentations. -
fSeenDoctype
protected boolean fSeenDoctypeTrue if the doctype declaration was seen. -
fSeenRootElement
protected boolean fSeenRootElementTrue if root element was seen. -
fInCDATASection
protected boolean fInCDATASectionTrue if inside a CDATA section. -
fPublicId
Public identifier of doctype declaration. -
fSystemId
System identifier of doctype declaration. -
fNamespaceContext
protected org.apache.xerces.xni.NamespaceContext fNamespaceContextNamespace information. -
fSynthesizedNamespaceCount
protected int fSynthesizedNamespaceCountSynthesized namespace binding count.
-
-
Constructor Details
-
Purifier
public Purifier()
-
-
Method Details
-
reset
public void reset(org.apache.xerces.xni.parser.XMLComponentManager manager) throws org.apache.xerces.xni.parser.XMLConfigurationException Description copied from class:DefaultFilter
Resets the component. The component can query the component manager about any features and properties that affect the operation of the component.- Specified by:
reset
in interfaceorg.apache.xerces.xni.parser.XMLComponent
- Overrides:
reset
in classDefaultFilter
- Parameters:
manager
- The component manager.- Throws:
org.apache.xerces.xni.parser.XMLConfigurationException
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start document.- Overrides:
startDocument
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
startDocument
public void startDocument(org.apache.xerces.xni.XMLLocator locator, String encoding, org.apache.xerces.xni.NamespaceContext nscontext, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start document.- Specified by:
startDocument
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
startDocument
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
xmlDecl
public void xmlDecl(String version, String encoding, String standalone, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException XML declaration.- Specified by:
xmlDecl
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
xmlDecl
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
comment
public void comment(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Comment.- Specified by:
comment
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
comment
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
processingInstruction
public void processingInstruction(String target, org.apache.xerces.xni.XMLString data, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Processing instruction.- Specified by:
processingInstruction
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
processingInstruction
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
doctypeDecl
public void doctypeDecl(String root, String pubid, String sysid, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Doctype declaration.- Specified by:
doctypeDecl
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
doctypeDecl
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
startElement
public void startElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start element.- Specified by:
startElement
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
startElement
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
emptyElement
public void emptyElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Empty element.- Specified by:
emptyElement
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
emptyElement
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
startCDATA
public void startCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Start CDATA section.- Specified by:
startCDATA
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
startCDATA
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
endCDATA
public void endCDATA(org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException End CDATA section.- Specified by:
endCDATA
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
endCDATA
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
characters
public void characters(org.apache.xerces.xni.XMLString text, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException Characters.- Specified by:
characters
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
characters
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
endElement
public void endElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.Augmentations augs) throws org.apache.xerces.xni.XNIException End element.- Specified by:
endElement
in interfaceorg.apache.xerces.xni.XMLDocumentHandler
- Overrides:
endElement
in classDefaultFilter
- Throws:
org.apache.xerces.xni.XNIException
-
handleStartDocument
protected void handleStartDocument()Handle start document. -
handleStartElement
protected void handleStartElement(org.apache.xerces.xni.QName element, org.apache.xerces.xni.XMLAttributes attrs) Handle start element. -
synthesizeBinding
Synthesize namespace binding. -
synthesizedAugs
protected final org.apache.xerces.xni.Augmentations synthesizedAugs()Returns an augmentations object with a synthesized item added. -
purifyQName
protected org.apache.xerces.xni.QName purifyQName(org.apache.xerces.xni.QName qname) Purify qualified name. -
purifyName
Purify name. -
purifyText
protected org.apache.xerces.xni.XMLString purifyText(org.apache.xerces.xni.XMLString text) Purify content. -
toHexString
Returns a padded hexadecimal string for the given value.
-