Input validation in Java

java.util.Scanner is a simple text parser.  It breaks input into tokens that you request.  It can help you with: primitive types (boolean, byte, double, float, int, long, short), and strings that match regular expressions.  This function is also available on Android Java.

Example:

Scanner s = new Scanner("CAFE false");
System.out.println(s.nextInt(16));
System.out.println(s.nextBoolean());

Prints

51966
false

You can also work with regular expressions using String.matches().  s.matches(“regex”) returns true if the entire string matches the expression.  s.split(“regex”) returns an array of substrings divided at “regex” (the character(s) matching “regex” are not included).

Example

String s = "The food is in the barn.";
Boolean b;
b = s.matches("foo.*bar"); // false
b = s.matches("The.*barn."); // true

You can also work with regular expressions using the java.util.regex package.  You use java.util.regex.Pattern() to set the regular expression to match.  You use the returned Matcher to test matches and perform other related operations.

You should be aware of the worst-case complexity of their expression. Some can be exponential and lead to a DoS vulnerability.

String s = "The food is in the barn.";
Pattern p = Pattern.compile("foo.*bar");
Matcher m = p.matcher(s);
b = m.matches(); // false
b = Pattern.compile("The.*barn.").matcher(s).matches(); // true
b = Pattern.matches("The.*barn.",s); // true

Techniques to avoid input validation errors

As you develop code, keep this tenet in mind:

ALL input is hostile until proven otherwise.

Never trust the client. Never trust the server. Even if the entity has authenticated, it should still be considered hostile. Successful authentication does not show intent. This means that you should assume that parameters sent to object methods or functions are sent by a hostile entity. Besides being more secure, you will also catch reliability bugs in the code. Expect the unexpected.

Worth noting is that I have found that impossible conditions occur at least once a month, so you should have code to check for them. In languages that support them, use exceptions for these impossible situations.

There is a line you have to draw. Too much error checking and handling can obscure the real purpose of the code. At some point, you have to draw a line. Personally, I often draw the “trusted/untrusted” line at the object boundary. When data comes into an object, I do thorough input validation on it, calling validation methods as appropriate to encapsulate common code. Once data have made it in, I presume that it has remained valid. Note that in some circumstances, this assumption is not valid, so be sure to do the right thing for your specific project.

Assume everybody is out to get you. If they are not, you can receive a pleasant surprise. It’s better than the alternative.

What to do when input validation fails

Clearly, there is the issue of what to do when you find a problem. A robust program keeps running in the presence of problems. A correct program always produces correct results. You cannot produce correct results when you have bad input data. Note that a crashed program will probably do less damage than one that continues operating with incorrect input data. It all depends on your threat model what is the proper behavior when you discover bad input. After all, life-critical systems cannot simply stop providing life support. (Arian 5 abort issue) If you are going to fail, a general rule is that failing early tends to be better than failing late; the failure point is closer to the root bug or problem. Also note that your code is assumed to be bad until you can show that someone else’s code contains the bug Good input checking can show this earlier and faster.

options include:

  • Halting (aborting) the program, with or without a stack dump, etc

  • Have the user retry or repeatedly retry (till when?)

  • apply a “standard” fix to the input data. For example, if a value is above the maximum, lower it to the max and continue.

  • log the failure

  • Given an error message to the user. Be sure to indicate what they can do to fix the problem; “File not found” as an error message is useless.

  • Is an input validation error a security problem?

Examples of bad input validation code

Because poor input validation is the largest single class of vulnerability, there are many, many examples. This section has only a few. With very little work, you should be able to find plenty more on the web. These examples are in Java and C/C++. However, the problems occur across all languages, and the concepts are the same in all of them.

Java

This example is from CVE-20: Improper Input Validation. Here is some Java code with a serious bug:

// BAD CODE
public static final double price = 20.00;
int quantity = currentUser.getAttribute("quantity");
double total = price * quantity;
chargeUser(total);

Before reading on, answer: How can an attacker exploit this?

The attacker can provide a negative quantity and end up with a credit. Unfortunately, many programmers forget that integers (other than unsigned) can be negative, even if such a value makes no sense. Boundary-value testing (Section 3.3.3) would have caught this type of bug. Careful thought about business rules would also have provided the necessary guidance.

Java

This example is also from CVE-20: Improper Input Validation. Here is some more Java code with a serious bug:

// BAD CODE
private void buildList ( int untrustedListSize ){
    if ( 0 > untrustedListSize ){
        die("Negative value supplied for list size");
    }
    Widget[] list = new Widget [ untrustedListSize ];
    list[0] = new Widget();
}

Before reading on, answer: How can an attacker exploit this?

If the attacker provides a 0, an exception will be thrown when the first element is inserted. Again, this is a boundary condition (Section 3.3.3) that is not properly handled.

C/C++

Here is some real-life code that has a vulnerability:

// BAD CODE
main(argc, argv)
char *argv[ ];
{
    register char *sp;
    char line[512];
    struct sockaddr in sin;
    int i, p[2], pid, status;
    FILE *fp; char *av[4];

    i = sizeof (sin);
    if (getpeername(0, &sin, &i) < 0)
        fatal(argv[0], "getpeername");
    line[0] = ’\0’;
    gets(line); /* receive user name */
    sp = line;
    /* . . . */

Before reading on, answer: How can an attacker exploit this?

What user would have a user name 512 bytes long when the system maximum at the time was 8? This code did no length checking and the result was a buffer overflow allowing the attacker to execute arbitrary code. This code is from fingerd, and is one of the vulnerabilities that the Morris worm attacked.

C/C++

This example is also from CVE-20: Improper Input Validation. Here is some bad CC/C++ code:

// BAD CODE
#define MAX_DIM 100
int m,n, error; /* board dimensions */
board_square_t *board;

printf("Please specify the board height: \n");
error = scanf("%d", &m);
if ( EOF == error )
    die("No integer passed: Die evil hacker!\n");

printf("Please specify the board width: \n");
error = scanf("%d", &n);
if ( EOF == error )
    die("No integer passed: Die evil hacker!\n");
if ( m > MAX_DIM || n > MAX_DIM )
    die("Value too large: Die evil hacker!\n");
board = (board_square_t*) 
    malloc( m * n * sizeof(board_square_t));

Before reading on, answer: How can an attacker exploit this?

The attacker can provide negative values that will be converted to unsigned and represent large values. The attacker could also do an overflow attack with large negative values. The failure is again a boundary value one.

Example input validation vulnerabilities

You can find many examples of input validation vulnerabilities by searching for CVE entries with references to CWE-20. This search only returns results where input validation is the primary problem, not where it is a contributing factor. In spite of this limitation, the search returns thousands of results. This section lists a very few of the examples you can easily find.

Example: 3D3.Com ShopFactory

3D3.Com ShopFactory is an e-commerce shopping cart application. In 2002, the software stored the items in the shopping cart and their prices in cookies. However, it is trivial to change cookie values; examples include using the Firefox web developer toolbar or using a proxy such as OWSAP ZAP. This means that users could alter their cookies and thereby alter the price they paid for items.

A later version “encrypted” the cookies. The encryption/decryption code was in JavaScript running on the client. Look into Greasemonkey for Firefox as one simple way of attacking this approach.

The solution was to use server-side validation of all input data and cryptographic tamper detection (Section 16.4.1) for any values stored in the browser (e.g., cookies).

Example: VMware vSphere API

CVE-2012-5703 describes a bug in some versions of VMware ESX and ESXi. By sending an invalid value in a SOAP request toRetrieveProp or RetrievePropEx, the attacker could crash the system causing all guest systems to become unresponsive. SOAP is a remote procedure call protocol using XML and HTTP. Note that using XML or SOAP did not do the input validation for the programmers. For more information including an example exploit, see the Core Security advisory VMware vSphere Hypervisor Vulnerability.

Example: Internet Explorer URL validation error

CVE-2010-0027 describes an input validation vulnerability in several versions of Microsoft Internet Explorer. The attacker (e.g., a compromised web site sending malware, hostile email, etc) supplies a specially-crafted URL that allows the attacker to run an arbitrary program.

Example: Buffer overflow in Poster Software PUBLISH-iT

CVS-2014-0980 describes a classic buffer overflow vulnerability where the developer(s) failed to perform length validation. The result is that a remote attacker can run arbitrary code. A proof-of-concept attack exists showing the code execution capability.

Input validation: Finding input locations

For Java (and all other languages), a static analysis system that performs data flow analysis (e.g., IBM’s AppScan Source) can identify all input locations. In particular, any static analysis system that performs taint tracking or taint analysis can identify all input locations. For a free, academic example, see Andromeda: Accurate and scalable security analysis of web applications by Marco Pistoia, Patrick Cousot, Radhia Cousot, and Salvatore Guarnieri.

Another way of finding input is through a source code review or white-box testing. Any source code that includes any classes in java.io obviously performs some kind of I/O. Similarly, network I/O occurs through classes in java.net.

You can use a dynamic analysis system or debugger to watch for calls to classes that do I/O. For a dynamic analysis system example, you could use Chord from Georgia Tech, but doing so would require some work on your part.

Input validation: Finding input locations

For web applications, a web spider such as the one in OWASP ZAP, the one in the Burp suite, or the one in the Paros proxy can help.

When you find an input location, document it and record what tests you performed to help avoid duplication of effort. For web applications, you want to know:

  • The URI of the request.

  • The parameters for the URI, including any optional ones and hidden ones.

  • The method (GET, POST).

  • The cookies used and set.

  • All HTTP headers the system might use.

Testing for other vulnerabilities also needs this information.

Input validation: Finding input locations

Ideally, you or others in the development group should know all input locations. However, because of hidden inputs, you might not know of all of them. Therefore, this section discusses ways that you can find them (and then test the input validation) Depending on the type of program and the OS environment, different approaches and tools might be useful. When you find an input location, document it and record what tests you performed to help avoid duplication of effort.

GNU/Linux systems

All I/O must go through a system call. The GNU/Linux command strace will show you all system calls a program makes as it runs. This means you can identify all files opened, network connections attempted, etc. You could use other ways of finding this information such as linking with a library that records all I/O, using a debugger and setting a watchpoint at all calls that open a file or network socket. Strace is the easiest.

As an example, I ran strace on the web browser Opera. Here is part of the output:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 49
connect(49, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("91.203.99.55")}, 16) = -1 EINPROGRESS
(Operation now in progress)
...
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 48
connect(48, {sa_family=AF_INET, sin_port=htons(443),
sin_addr=inet_addr("213.236.208.94")}, 16) = -1 EINPROGRESS
(Operation now in progress)
...
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 55
connect(55, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("199.7.71.190")}, 16) = -1 EINPROGRESS
(Operation now in progress)

In that output, you can see it communicating with the following IP addresses: 91.203.99.55, 213.236.208.94, and 199.7.71.190. To learn more about the communication, I used wireshark:

operacomm1

From this, we now know:

  • The HTTP protocol parser must be tested (unsurprising for a web browser, but in another program this might be important).

  • The application uses a certificate revocation list, so we need to test the parsing of it.

Additionally, in the full strace output, we can see that this application also opens and reads from several files, performs DNS lookups, etc. In other words, a complex real-life program has many, many input sources.