Input validation: What must be validated

Given that input validation is critical and every system needs it, let’s talk about what you need to be checking. Each of the following must be checked somewhere: character set and encoding, length, boundaries, syntax, type, and business rules. In addition to the data that enters your system, you need to also consider the rate at which it enters. Computers are far faster than humans, and rate limiting can reduce the speed at which an attack is delivered, thereby reducing the damage it causes.

Character set and encoding

When communicating, peers need to agree on the character set in use and how it is encoded. For example, old IBM mainframes used EBCDIC, other computers from that age used ASCII. Both are 7-bits per character. ISO/IEC 8859 is an 8-bit per character encoding that includes most of the characters that people using Latin alphabets might want. However, there are many other characters in use in the world, think about Korean, Chinese, Arabic, Hebrew, Sanskrit, Cyrillic, Hindu, and many others. Unicode 2.0 has well over a million different characters, including Egyptian Hieroglyphs.

The problem comes when a system assumes the character set or its encoding. An attacker can often cause data to be misinterpreted, leading to various attacks. Unicode has multiple equivalent encodings for the same character, making input validation more challenging―Canonicalization is necessary. String comparisons with different character sets might not have the result you expect. Vulnerabilities have included cross-site scripting (XSS), for example, CVE-2013-3192 is about how a bug in Internet Explorer (IE) 6 through 10 permits attackers to inject arbitrary HTML (i.e., conduct a cross-site scripting attack) by using EUC-JP encoding. IE is not alone; Firefox suffered a similar problem (CVE-2008-0416).

What this means is that every communication should specify the character set and encoding used. Always.

Length

Length validation means verifying that the input is neither too long nor too short. This check might seem trivial, however:

  1. CWE-120 is Buffer Copy without Checking Size of Input (’Classic Buffer Overflow’). Attackers have been using it since the 1960s, and problems continue to be found that attackers are exploiting.

  2. How many programs will do the right thing when a 0-length input is provided?

  3. How long is the real maximum? The answer might come from the business rules.

As a simple example, a quick check of an online Engligh-language dictionary showed antidisestablishmentarianisms (29 characters long) as the longest non-technical word. Therefore, we might state when validating a word, the length, l, must meet: 0 < l < 30.

Boundaries

Boundary conditions are often mishandled. Software engineering studies of where bugs are found show that a failure to properly check boundary conditions are common. Attackers know this, and they will be testing all of your boundaries.

How often do you check to ensure that an array index is not negative? Unless your programming language allows unsigned integers and you use them properly, you need to be checking. One system allowed a denial-of-service attack due to a failure to check for this problem (CVE-2013-2175).

Assuming you know the input sources (Section 3.6), the first step in doing your own testing is to identify all boundaries for all input. For example, for an integer n, a full boundary test would have three values for each boundary n: n, n-1, and n+1. Making this more concrete, if you are representing months as integers, then the month m Є [1,12], and the test values you need to use are: 0, 1, 2, 11, 12, and 13.

There are many natural boundaries you should consider:

  • Data type sizes: 8-, 16-, 32-, and 64-bit signed and unsigned integer values. Many integer overflow errors (Section 27.1) occur at these boundaries.

  • Characters with the high bit set (decimal values above 127).

  • Any system accepting dates has number of days in the month (you are getting your leap year calculations correct, aren’t you?), year, months in the year, etc.

  • Times also have number of seconds in a minute, minutes in an hour, hours in a day, etc. How about leap seconds?

Syntax

How you put the smaller pieces together to make a valid item is the syntax. For example, a floating point number matches the following regular expression:

-?[0-9]*.[0-9]*

Or does it? “-.” matches that expression, but 0 does not. This shows that accurately defining syntax can be challenging. The difference between what you accept and what you should accept is where the attacker will place their attack. When developing a system, think carefully about the syntax. When testing, be brutal.

Type

The basic types that your program receives as input are defined by the data the program works with. For example: integer, floating point, alphanumeric, relational operator, XML tag, etc. Type is intermixed with syntax, in that higher-level parts of syntax are the ways the different types can be combined to make a valid input.

Business rules

What the data means comes from the business rules. This category is also called the semantics of the data. You must verify that the input makes sense in the context of how you will use it. Input validation frameworks can often handle length, type, and maybe even syntax, but they are rarely able to do business rule validation. This means that you must verify that the input makes sense in the context of how you will use it.

As one example, a bank might limit the amount of money that can be withdrawn at an ATM to $300. Therefore, the withdrawal amount, w, is not simply a fixed-point number, but 0 ≤ w ≤ 300.00.

As another example: square and circle are shapes, but green and salty are not. You cannot assume that a length-limited word composed solely of letters is valid when you need a shape.

As yet another example, the sum of all of the percentages (should) always add up to 100.

Business rule validation typically occurs later in processing than the other parts of input validation.

Remember that white lists (Section 3.9.6) are not only for characters. You can use them for checking higher-level constructs in your syntax and semantics. Depending on the business context, only a subset of the white list of possible values might be valid.

Rate limiting

How many attempts will you give the attacker? How often and rapidly can they provide bad input before something happens? For example, with login attempts, you might limit total failed logins as well as the number of login attempts per hour. Or, for another example, how many database transactions per second or minute make sense? Remember, with a botnet, an attacker might be able to perform trillions of attempts per minute.

Published by

Kenneth Ingham

Kenneth has been working with security since the early 1980s. He was a system administrator when the Morris Worm was released, and one of his work areas was helping secure the systems. Since then, Kenneth has studied network and computer security, and spent many, many hours looking at how software fails. With knowledge about failures, he can help you produce software that is less likely to suffer a security breach. Kenneth has a Ph.D. in Computer Science, and his research topic was on computer security. Kenneth has helped developers creating systems for financial services institutions, telecommunications companies, and embedded system and process control.

Leave a Reply

Your email address will not be published. Required fields are marked *