Input validation: XML

Remember that using XML does not mean input validation is done for you; many XML parsers do no validation whatsoever. This problem is CWE-112, Missing XML validation. XML parsers that do perform validation require a well-defined schema, an XML description of the data structures that might be contained in the document or might be obtained through the network. Creating a schema is time consuming and difficult to get right. Verifying all but the simplest business rules is rarely, if ever possible even using the schema.

When using XML, you have to decide between two parser models: Document Object Model (DOM) and the Simple API for XML (SAX). DOM-based parsers build a tree to represent the whole document. This tree is normally held in memory, but some parsers can use persistent (disk) storage. They (should) require that the XML document be well-formed and correct; this means that all open tags are closed, nesting is correct, and all the data conforms to the schema. This is better from an input validation perspective. However, attackers have several attacks they can do against DOM parsers, for example memory exhaustion by sending a large document (CAPEC-231) or sending a document with a recursive structure (CAPEC-230).

The other predominant parser model, SAX, is event-based. It returns tags as it encounters them. It does not build an in-memory tree and it tends to be faster than DOM-based parsers. However, if you need to look up data in an order other than what it appears, you end up building an in-memory structure, so this benefit might or might not apply to your use. Code using SAX parsers tend to be weaker in terms of validation, because you cannot know until the end of the document if it was well-formed or not. Rarely do programs using SAX parsers verify this.

While schemas are necessary for XML do to the input validation, using them can lead to some substantial issues. First, if the schema is loaded from a remote site, what about network problems? How do you know the server is the correct one and not an imposter? SSL/TLS can help here, but your public key infrastructure (PKI) must be working properly. How do you know the schema is the right one? If you use a digital signature to verify the schema, you need a functioning PKI and you have to verify the signature. All of this takes time, both your time to write the code as well as time spent waiting for the schema to come through the network and CPU time to do the signature verification. These are but a few of the issues that make XML input validation much more difficult than it might seem at first.

If you are not using a schema to validate the data, YOU must be doing all of the validation. Attackers know that this validation is often weak.

With XML, you have to also be aware of the possibility of injection attacks; we cover this in more detail in Section 7.2.

Language specifics

A Java DOM XML and SAX parser are in javax.xml.parsers. You validate either one by using javax.xml.validation.

A C# DOM XML parser is in System.Xml.XmlDocument. It does not perform full schema validation by default, but you can use XmlDocument.Validate() or by passing a validating XmlReader to the Load() method.

The most commonly-used C++ DOM and SAX XML parsers are the Xerces–C++ XML Parsers. You control validation with the validation schemes, which can be one of Val_Never, Val_Always, or Val_Auto. You set these via the methods setValidationScheme() and setValidationSchemaFullChecking() and similar functions.

Published by

Kenneth Ingham

Kenneth has been working with security since the early 1980s. He was a system administrator when the Morris Worm was released, and one of his work areas was helping secure the systems. Since then, Kenneth has studied network and computer security, and spent many, many hours looking at how software fails. With knowledge about failures, he can help you produce software that is less likely to suffer a security breach. Kenneth has a Ph.D. in Computer Science, and his research topic was on computer security. Kenneth has helped developers creating systems for financial services institutions, telecommunications companies, and embedded system and process control.

Leave a Reply

Your email address will not be published. Required fields are marked *