Input validation in Java

java.util.Scanner is a simple text parser.  It breaks input into tokens that you request.  It can help you with: primitive types (boolean, byte, double, float, int, long, short), and strings that match regular expressions.  This function is also available on Android Java.

Example:

Scanner s = new Scanner("CAFE false");
System.out.println(s.nextInt(16));
System.out.println(s.nextBoolean());

Prints

51966
false

You can also work with regular expressions using String.matches().  s.matches(“regex”) returns true if the entire string matches the expression.  s.split(“regex”) returns an array of substrings divided at “regex” (the character(s) matching “regex” are not included).

Example

String s = "The food is in the barn.";
Boolean b;
b = s.matches("foo.*bar"); // false
b = s.matches("The.*barn."); // true

You can also work with regular expressions using the java.util.regex package.  You use java.util.regex.Pattern() to set the regular expression to match.  You use the returned Matcher to test matches and perform other related operations.

You should be aware of the worst-case complexity of their expression. Some can be exponential and lead to a DoS vulnerability.

String s = "The food is in the barn.";
Pattern p = Pattern.compile("foo.*bar");
Matcher m = p.matcher(s);
b = m.matches(); // false
b = Pattern.compile("The.*barn.").matcher(s).matches(); // true
b = Pattern.matches("The.*barn.",s); // true

Techniques to avoid input validation errors

As you develop code, keep this tenet in mind:

ALL input is hostile until proven otherwise.

Never trust the client. Never trust the server. Even if the entity has authenticated, it should still be considered hostile. Successful authentication does not show intent. This means that you should assume that parameters sent to object methods or functions are sent by a hostile entity. Besides being more secure, you will also catch reliability bugs in the code. Expect the unexpected.

Worth noting is that I have found that impossible conditions occur at least once a month, so you should have code to check for them. In languages that support them, use exceptions for these impossible situations.

There is a line you have to draw. Too much error checking and handling can obscure the real purpose of the code. At some point, you have to draw a line. Personally, I often draw the “trusted/untrusted” line at the object boundary. When data comes into an object, I do thorough input validation on it, calling validation methods as appropriate to encapsulate common code. Once data have made it in, I presume that it has remained valid. Note that in some circumstances, this assumption is not valid, so be sure to do the right thing for your specific project.

Assume everybody is out to get you. If they are not, you can receive a pleasant surprise. It’s better than the alternative.

What to do when input validation fails

Clearly, there is the issue of what to do when you find a problem. A robust program keeps running in the presence of problems. A correct program always produces correct results. You cannot produce correct results when you have bad input data. Note that a crashed program will probably do less damage than one that continues operating with incorrect input data. It all depends on your threat model what is the proper behavior when you discover bad input. After all, life-critical systems cannot simply stop providing life support. (Arian 5 abort issue) If you are going to fail, a general rule is that failing early tends to be better than failing late; the failure point is closer to the root bug or problem. Also note that your code is assumed to be bad until you can show that someone else’s code contains the bug Good input checking can show this earlier and faster.

options include:

  • Halting (aborting) the program, with or without a stack dump, etc

  • Have the user retry or repeatedly retry (till when?)

  • apply a “standard” fix to the input data. For example, if a value is above the maximum, lower it to the max and continue.

  • log the failure

  • Given an error message to the user. Be sure to indicate what they can do to fix the problem; “File not found” as an error message is useless.

  • Is an input validation error a security problem?

Example input validation vulnerabilities

You can find many examples of input validation vulnerabilities by searching for CVE entries with references to CWE-20. This search only returns results where input validation is the primary problem, not where it is a contributing factor. In spite of this limitation, the search returns thousands of results. This section lists a very few of the examples you can easily find.

Example: 3D3.Com ShopFactory

3D3.Com ShopFactory is an e-commerce shopping cart application. In 2002, the software stored the items in the shopping cart and their prices in cookies. However, it is trivial to change cookie values; examples include using the Firefox web developer toolbar or using a proxy such as OWSAP ZAP. This means that users could alter their cookies and thereby alter the price they paid for items.

A later version “encrypted” the cookies. The encryption/decryption code was in JavaScript running on the client. Look into Greasemonkey for Firefox as one simple way of attacking this approach.

The solution was to use server-side validation of all input data and cryptographic tamper detection (Section 16.4.1) for any values stored in the browser (e.g., cookies).

Example: VMware vSphere API

CVE-2012-5703 describes a bug in some versions of VMware ESX and ESXi. By sending an invalid value in a SOAP request toRetrieveProp or RetrievePropEx, the attacker could crash the system causing all guest systems to become unresponsive. SOAP is a remote procedure call protocol using XML and HTTP. Note that using XML or SOAP did not do the input validation for the programmers. For more information including an example exploit, see the Core Security advisory VMware vSphere Hypervisor Vulnerability.

Example: Internet Explorer URL validation error

CVE-2010-0027 describes an input validation vulnerability in several versions of Microsoft Internet Explorer. The attacker (e.g., a compromised web site sending malware, hostile email, etc) supplies a specially-crafted URL that allows the attacker to run an arbitrary program.

Example: Buffer overflow in Poster Software PUBLISH-iT

CVS-2014-0980 describes a classic buffer overflow vulnerability where the developer(s) failed to perform length validation. The result is that a remote attacker can run arbitrary code. A proof-of-concept attack exists showing the code execution capability.

Input validation: Finding input locations

For Java (and all other languages), a static analysis system that performs data flow analysis (e.g., IBM’s AppScan Source) can identify all input locations. In particular, any static analysis system that performs taint tracking or taint analysis can identify all input locations. For a free, academic example, see Andromeda: Accurate and scalable security analysis of web applications by Marco Pistoia, Patrick Cousot, Radhia Cousot, and Salvatore Guarnieri.

Another way of finding input is through a source code review or white-box testing. Any source code that includes any classes in java.io obviously performs some kind of I/O. Similarly, network I/O occurs through classes in java.net.

You can use a dynamic analysis system or debugger to watch for calls to classes that do I/O. For a dynamic analysis system example, you could use Chord from Georgia Tech, but doing so would require some work on your part.

Input validation: Finding input locations

For web applications, a web spider such as the one in OWASP ZAP, the one in the Burp suite, or the one in the Paros proxy can help.

When you find an input location, document it and record what tests you performed to help avoid duplication of effort. For web applications, you want to know:

  • The URI of the request.

  • The parameters for the URI, including any optional ones and hidden ones.

  • The method (GET, POST).

  • The cookies used and set.

  • All HTTP headers the system might use.

Testing for other vulnerabilities also needs this information.

Input validation: Finding input locations

Ideally, you or others in the development group should know all input locations. However, because of hidden inputs, you might not know of all of them. Therefore, this section discusses ways that you can find them (and then test the input validation) Depending on the type of program and the OS environment, different approaches and tools might be useful. When you find an input location, document it and record what tests you performed to help avoid duplication of effort.

GNU/Linux systems

All I/O must go through a system call. The GNU/Linux command strace will show you all system calls a program makes as it runs. This means you can identify all files opened, network connections attempted, etc. You could use other ways of finding this information such as linking with a library that records all I/O, using a debugger and setting a watchpoint at all calls that open a file or network socket. Strace is the easiest.

As an example, I ran strace on the web browser Opera. Here is part of the output:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 49
connect(49, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("91.203.99.55")}, 16) = -1 EINPROGRESS
(Operation now in progress)
...
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 48
connect(48, {sa_family=AF_INET, sin_port=htons(443),
sin_addr=inet_addr("213.236.208.94")}, 16) = -1 EINPROGRESS
(Operation now in progress)
...
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 55
connect(55, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("199.7.71.190")}, 16) = -1 EINPROGRESS
(Operation now in progress)

In that output, you can see it communicating with the following IP addresses: 91.203.99.55, 213.236.208.94, and 199.7.71.190. To learn more about the communication, I used wireshark:

operacomm1

From this, we now know:

  • The HTTP protocol parser must be tested (unsurprising for a web browser, but in another program this might be important).

  • The application uses a certificate revocation list, so we need to test the parsing of it.

Additionally, in the full strace output, we can see that this application also opens and reads from several files, performs DNS lookups, etc. In other words, a complex real-life program has many, many input sources.

Input validation: Beware hidden user input

Attackers can provide input to programs using subtle channels. Examples include:

  1. Environment variables.

  2. Windows registry values; way too many programs change things in the registry that they should never change.

  3. Configuration files, including dotfiles and files in /etc on Unix/Linux/BSD, files in the application’s directory on Mac OS and Windows systems, etc.

  4. Network configuration management, including Microsoft domain controllers, Mac OSX NetInfo.

  5. DNS lookups.

  6. Data from a database lookup.

  7. HTTP and other protocol request headers, in other words, not just content, but EVERYTHING arriving from the remote system (also known as the attacker’s system).

All of these and all of the ones not listed require input validation. In other words, ALL data that originates outside of your program must be validated. Many of these things should almost never change or are not hostile in normal usage. However, attackers will not let “normal” be normal.

Examples

At least two bugs in Windows allow an attacker to perform a privilege escalation or denial of service attack via crafted registry values: CVE-2010-3961 and CVE-2010-0238.

An Apple bug allowed attackers to use a crafted DNS PTR record to blacklist arbitrary clients. This is CVE-2010-0500.

A buffer overflow in the DNS resolver library allowed an attacker to run arbitrary code on victim machines. This is CVE-2005-0033

To summarize the take-away from these examples: Anything you did not code is attacker-supplied data.

Input validation: JSON and input validation frameworks

JSON is even harder than XML to do input validation, because schemas for it are just now being standardized. For example, you can play with a JSON schema generator. This lack of a standard for JSON schemas means that automatic JSON validation is extremely rare in practice. Even when this changes, the same comments about schemas and business rules from XML apply to JSON: the schema must be properly used and rarely can a schema be used to validate business rules.

For Java, you can use Francis Galiegue’s json–schema–validator or the JSON tools com.sdicons.json.validator. For C#, you can use the Json.NET JsonSchema and JsonValidatingReader classes. For C++, you might look at the Avro library or the one that is part of the Chromium project. Not one of the languages for this book, but for JavaScript, you can use dojox.json.schema.

With JSON, you have to also be aware of the possibility of injection attacks; I will cover these in more detail in an upcoming blog post.

Just like with XML, if you are not using a schema to validate the data, YOU must be doing all of the validation.

Input validation: input validation frameworks

Some object frameworks help with the syntax part of the input validation. Examples of input validation frameworks include the OWASP ESAPI Validation API, Struts, the Apache commons validator, the Hibernate validator, Java EE Bean validation, some uses of XML, and rare JSON libraries. When these objects or frameworks are available, you should make use of them.

Remember that using the framework or object does solve all input validation issues. When you use the framework, you must properly describe the tests to perform. Watch out for “It needs to ship yesterday. We’ll finish the input specifications later.” Also, you must ensure that you use the framework properly. For example. CWE-101 is Struts Validation Problems, of which there are 10 sub-weaknesses. In other words, while using it can help, developers regularly get it wrong and so the attackers win.

Frameworks and objects can often handle length and type. Sometimes then can handle syntax. They rarely are capable of checking business rules. This means that you are responsible for the business rules validation.