Input validation: XML

Remember that using XML does not mean input validation is done for you; many XML parsers do no validation whatsoever. This problem is CWE-112, Missing XML validation. XML parsers that do perform validation require a well-defined schema, an XML description of the data structures that might be contained in the document or might be obtained through the network. Creating a schema is time consuming and difficult to get right. Verifying all but the simplest business rules is rarely, if ever possible even using the schema.

When using XML, you have to decide between two parser models: Document Object Model (DOM) and the Simple API for XML (SAX). DOM-based parsers build a tree to represent the whole document. This tree is normally held in memory, but some parsers can use persistent (disk) storage. They (should) require that the XML document be well-formed and correct; this means that all open tags are closed, nesting is correct, and all the data conforms to the schema. This is better from an input validation perspective. However, attackers have several attacks they can do against DOM parsers, for example memory exhaustion by sending a large document (CAPEC-231) or sending a document with a recursive structure (CAPEC-230).

The other predominant parser model, SAX, is event-based. It returns tags as it encounters them. It does not build an in-memory tree and it tends to be faster than DOM-based parsers. However, if you need to look up data in an order other than what it appears, you end up building an in-memory structure, so this benefit might or might not apply to your use. Code using SAX parsers tend to be weaker in terms of validation, because you cannot know until the end of the document if it was well-formed or not. Rarely do programs using SAX parsers verify this.

While schemas are necessary for XML do to the input validation, using them can lead to some substantial issues. First, if the schema is loaded from a remote site, what about network problems? How do you know the server is the correct one and not an imposter? SSL/TLS can help here, but your public key infrastructure (PKI) must be working properly. How do you know the schema is the right one? If you use a digital signature to verify the schema, you need a functioning PKI and you have to verify the signature. All of this takes time, both your time to write the code as well as time spent waiting for the schema to come through the network and CPU time to do the signature verification. These are but a few of the issues that make XML input validation much more difficult than it might seem at first.

If you are not using a schema to validate the data, YOU must be doing all of the validation. Attackers know that this validation is often weak.

With XML, you have to also be aware of the possibility of injection attacks; we cover this in more detail in Section 7.2.

Language specifics

A Java DOM XML and SAX parser are in javax.xml.parsers. You validate either one by using javax.xml.validation.

A C# DOM XML parser is in System.Xml.XmlDocument. It does not perform full schema validation by default, but you can use XmlDocument.Validate() or by passing a validating XmlReader to the Load() method.

The most commonly-used C++ DOM and SAX XML parsers are the Xerces–C++ XML Parsers. You control validation with the validation schemes, which can be one of Val_Never, Val_Always, or Val_Auto. You set these via the methods setValidationScheme() and setValidationSchemaFullChecking() and similar functions.

Input validation: What must be validated

Given that input validation is critical and every system needs it, let’s talk about what you need to be checking. Each of the following must be checked somewhere: character set and encoding, length, boundaries, syntax, type, and business rules. In addition to the data that enters your system, you need to also consider the rate at which it enters. Computers are far faster than humans, and rate limiting can reduce the speed at which an attack is delivered, thereby reducing the damage it causes.

Character set and encoding

When communicating, peers need to agree on the character set in use and how it is encoded. For example, old IBM mainframes used EBCDIC, other computers from that age used ASCII. Both are 7-bits per character. ISO/IEC 8859 is an 8-bit per character encoding that includes most of the characters that people using Latin alphabets might want. However, there are many other characters in use in the world, think about Korean, Chinese, Arabic, Hebrew, Sanskrit, Cyrillic, Hindu, and many others. Unicode 2.0 has well over a million different characters, including Egyptian Hieroglyphs.

The problem comes when a system assumes the character set or its encoding. An attacker can often cause data to be misinterpreted, leading to various attacks. Unicode has multiple equivalent encodings for the same character, making input validation more challenging―Canonicalization is necessary. String comparisons with different character sets might not have the result you expect. Vulnerabilities have included cross-site scripting (XSS), for example, CVE-2013-3192 is about how a bug in Internet Explorer (IE) 6 through 10 permits attackers to inject arbitrary HTML (i.e., conduct a cross-site scripting attack) by using EUC-JP encoding. IE is not alone; Firefox suffered a similar problem (CVE-2008-0416).

What this means is that every communication should specify the character set and encoding used. Always.

Length

Length validation means verifying that the input is neither too long nor too short. This check might seem trivial, however:

  1. CWE-120 is Buffer Copy without Checking Size of Input (’Classic Buffer Overflow’). Attackers have been using it since the 1960s, and problems continue to be found that attackers are exploiting.

  2. How many programs will do the right thing when a 0-length input is provided?

  3. How long is the real maximum? The answer might come from the business rules.

As a simple example, a quick check of an online Engligh-language dictionary showed antidisestablishmentarianisms (29 characters long) as the longest non-technical word. Therefore, we might state when validating a word, the length, l, must meet: 0 < l < 30.

Boundaries

Boundary conditions are often mishandled. Software engineering studies of where bugs are found show that a failure to properly check boundary conditions are common. Attackers know this, and they will be testing all of your boundaries.

How often do you check to ensure that an array index is not negative? Unless your programming language allows unsigned integers and you use them properly, you need to be checking. One system allowed a denial-of-service attack due to a failure to check for this problem (CVE-2013-2175).

Assuming you know the input sources (Section 3.6), the first step in doing your own testing is to identify all boundaries for all input. For example, for an integer n, a full boundary test would have three values for each boundary n: n, n-1, and n+1. Making this more concrete, if you are representing months as integers, then the month m Є [1,12], and the test values you need to use are: 0, 1, 2, 11, 12, and 13.

There are many natural boundaries you should consider:

  • Data type sizes: 8-, 16-, 32-, and 64-bit signed and unsigned integer values. Many integer overflow errors (Section 27.1) occur at these boundaries.

  • Characters with the high bit set (decimal values above 127).

  • Any system accepting dates has number of days in the month (you are getting your leap year calculations correct, aren’t you?), year, months in the year, etc.

  • Times also have number of seconds in a minute, minutes in an hour, hours in a day, etc. How about leap seconds?

Syntax

How you put the smaller pieces together to make a valid item is the syntax. For example, a floating point number matches the following regular expression:

-?[0-9]*.[0-9]*

Or does it? “-.” matches that expression, but 0 does not. This shows that accurately defining syntax can be challenging. The difference between what you accept and what you should accept is where the attacker will place their attack. When developing a system, think carefully about the syntax. When testing, be brutal.

Type

The basic types that your program receives as input are defined by the data the program works with. For example: integer, floating point, alphanumeric, relational operator, XML tag, etc. Type is intermixed with syntax, in that higher-level parts of syntax are the ways the different types can be combined to make a valid input.

Business rules

What the data means comes from the business rules. This category is also called the semantics of the data. You must verify that the input makes sense in the context of how you will use it. Input validation frameworks can often handle length, type, and maybe even syntax, but they are rarely able to do business rule validation. This means that you must verify that the input makes sense in the context of how you will use it.

As one example, a bank might limit the amount of money that can be withdrawn at an ATM to $300. Therefore, the withdrawal amount, w, is not simply a fixed-point number, but 0 ≤ w ≤ 300.00.

As another example: square and circle are shapes, but green and salty are not. You cannot assume that a length-limited word composed solely of letters is valid when you need a shape.

As yet another example, the sum of all of the percentages (should) always add up to 100.

Business rule validation typically occurs later in processing than the other parts of input validation.

Remember that white lists (Section 3.9.6) are not only for characters. You can use them for checking higher-level constructs in your syntax and semantics. Depending on the business context, only a subset of the white list of possible values might be valid.

Rate limiting

How many attempts will you give the attacker? How often and rapidly can they provide bad input before something happens? For example, with login attempts, you might limit total failed logins as well as the number of login attempts per hour. Or, for another example, how many database transactions per second or minute make sense? Remember, with a botnet, an attacker might be able to perform trillions of attempts per minute.

Input validation: ALL input is hostile until proven otherwise

Until you have done complete input validation, you do not know if input is hostile or not. This includes all of your communication peers, even if it is developed at your company or even if YOU developed it. As a simple example, Oracle once assumed that only their clients would be communicating with their servers. When attackers tried directly connecting to the server, they discovered that input validation was weak at best. Examples include CVE-2003-0095, CVE-2002-1641, and CVE-2003-1208, but these are not a complete list of Oracle’s problems in the early 2000s.

Even if the entity has authenticated, they are not to be trusted—successful authentication does not show intent. Plenty of attacks are initiated by insiders; they are often behind some of the most expensive attacks that you never hear about. Cross-site request forgeries mean an authenticated user might be inadvertently attacking your system.

Never trust the client

Clients are always hostile until proven otherwise. For example, some web applications (incorrectly) rely on the client to validate data. After all, in a web browser, JavaScript can validate data. However, the user controls the client, and hence the execution environment. What this means is that everything the server sends to the client is a request that the client can execute, ignore, or do whatever it wants. An attacker might run the client in a debugger, allowing changing arbitrary memory locations. Or, she might not be using the expected client, but instead using an attack tool. What this means is that if you are working on the server, you must assume that the client is sending hostile data.

In web applications, you can use client-side data validation to give the user a better experience with your software. However, you must assume that an attacker disabled the validation software, so you must also validate the data on the server. For more information, here is but one link: Client Side Data Validation: A False Sense of Security by Tarak Modi.

Never trust the server

Counterfeit cell base stations have been used to eavesdrop on conversations, collect phone information, etc. Examples include a demonstration project a Iowa State University, IMSI-catcher, a verified bogus GSM base station, information on how to build an inexpensive bogus GSM base station, and a reported man-in-the-middle (MITM) attack on 4G and CDMA systems. Unfortunately, the GSM standard originally explicitly stated that they would not even authenticate the base station—they changed this in the 3G standard, but how long will it take to upgrade every GSM base station and handset? How much will it cost?

Hostile web servers exploit bugs in web clients. Examples of what they do include:

  1. make the client part of a botnet to spend spam, attack other computers, etc.
  2. track the client’s movements on the web, including to the bank and then sending the username and password to the attacker
  3. extract personal information, sending it to marketing types and/or criminals.

Browser flaws and social engineering can cause users to visit malicious websites. Also, hostile websites can be ordinary web sites that have been compromised. A 2008 report by John Leyden noted, “More than 10,000 web pages have been booby trapped with malware in one of the largest attacks of its kind to date.” Unfortunately, what was the largest is now normal. This type type of attack is called a “drive-by download attack”. Microsoft reported:

Blacole, a family of exploits used by the so-called Blackhole exploit kit to deliver malicious software through infected web pages, was the most commonly detected exploit family in the first half of 2012 by a large margin

This problem is not just web servers. Trusted servers (e.g. software repositories) can and have been compromised. Do you always verify the digital signature on everything you download? Where did you get the key you use to check? Could the attacker be controlling both? What about downloads without signatures? Many popular applications have had insecure update mechanisms, for examples, see Secure Software Updates: Disappointments and New Challenges by Bellissimo, Burgess, and Fu.

Even if the real server is still secure, DNS cache poisoning attacks, ARP poisoning, man-in-the-middle (MITM) attacks, and host-based malware can compromise other applications via server impersonation. Consider this example described in Malicious JavaScript insertion through ARP Poisoning Attacks by Bojan Zdrnja, in IEEE Security & Privacy, volume 7, number 3, May/June 2009:

  1. An attacker compromises a machine at a hosting company.
  2. They use this machine to perform an ARP poisoning attack on another web server hosted at the same place. The result of this attack is routing all of the second victim’s traffic through the first victim.
  3. They modify the web pages in transit to add hostile JavaScript. A simple Internet search will turn up lots of examples.

Note that the second victim can be uncompromised and yet still be a “hostile web server” as far as their web clients are concerned. This means that if you are writing a client, you must assume that the server is hostile.

In other words, when you are working on a system, no matter whether it is an embedded system, an app for a mobile device, a client, or a server, you have to assume that your communication peer is hostile until proven otherwise. It takes more time, but the benefit is more secure code. You also get the benefit of more robust code that better withstands non-hostile bugs in peers as well.

Input validation: overview

Input validation problems represent a contributing factor in about half of the non-design security flaws. The basic problem is that an attacker provides input that the programmer did not check. The result is that the attacker is able to make the software do something unexpected (and useful for the attacker). The following OWASP Top 10 problems have input validation as at least a part of the problem: A1: Injection, A3: Cross-Site Scripting (XSS), A4: Insecure Direct Object References, and A10: Unvalidated Redirects and Forwards. On the Top 25 list, items 1–4, 10, and 13 all have input validation as at least a part of the problem. In particular, 10 is CWE-807, Reliance on Untrusted Inputs in a Security Decision and 13 is CWE-22, Improper Limitation of a Pathname to a Restricted Directory (’Path Traversal’). CWE-20 is improper input validation. CWE-602 is Client-Side Enforcement of Server-Side Security,

Consequences of poor input validation can be substantial. Examples of problems that can arise include:

  • Crashing the program, denial of service (CVE-2008-1737, CVE-2007-5893).

  • Command injection (CWE-77). Input validation is not sufficient to stop injection attacks, but it should be your first line of defense.

  • SQL injection (CWE-89) (CVE-2008-2223, CVE-2006-5525). Again, input validation is not sufficient to stop injection attacks, but it should be your first line of defense.

  • Cross-site scripting (CWE-79) (CVE-2008-0971). Similarly, input validation is not sufficient to stop cross-site scripting attacks, but it should be your first line of defense.

  • Privilege escalation (CVE-2008-3494, CVE-2008-3174)

  • Buffer overflow attacks (CWE-120) (thousands of examples)

The CVE references are only examples and are far from a complete list. What this really means is that most any bad outcome you can imagine can result from poor input validation.