PHP: Data Validation and Cleaning Up Input

Data Validation and Cleaning Up Input

These are two similar but different things. They are fascinating, because as you search the web you see all kinds of alarming issues with hackers and others trying to destroy websites or data by passing scripts to the server or whatever. It's a crazy world out there.

Sadly, I am not as smart as some of the hackers out there, but I spent a lot of time trying to deal with issues, and have come up with some things that might be useful.

What is the Difference?

Data Validation is checking to make sure the user is entering correct or valid data. Cleaning Up Input is trying to deal with issues of a user passing a script to the server via your code, trying to deal with other issues.

One confusion is the term "Sanitizing" data -- the definitions I can find online deal with irrevocably deleting data from a server, which is not what I want to deal with at all. Quite the opposite, my goal is to preserve the data ...

Data Validation

I spent a lot of time dealing with this issue. I have been a database developer for years, and data validation done on a local PC or computer on a network can be fairly involved. Dealing with it on the web can also be pretty involved.

Required Data

One tool I find very useful is HTML itself. For example, HTML input controls input, textarea and select all have the attribute required which can be added. If nothing is entered or selected, the user cannot submit a form. A simple example:

            <input class="my_input_class" name="my_input_name" required />

Used in a form, if a user does not fill out the input, and clicks the submit button, they will be taken back to the control and a tooltip will appear stating that they must fill it out.

Data Types

Another thing that can be done with the HTML input control itself is that you can define the type of input. There are several places on the web that lists all the options, I find the one for the W3 Schools site useful: [W3 Schools Input Tag] -- this lists a lot of options you may find useful.

Minimum and Maximum Length

Also part of the HTML input tag, you can set a minimum number of characters, and/or a maximum number of characters that can be input.

Regular Expressions

If you search the web for specific types of data validation, you will find that most of the results will show you how to work with Regular Expressions (or RegEx, or RegExp). I find them complex, weird, and hard to understand. I don't think in these very shortcut methods of validation, and I find most examples confusing. What's worse is trying to work with the examples and come up with your own specific version. There are a few places I may concede and use them, but honestly, I ended up writing my own validation routines.

If Not Regular Expressions, then what?

I ended up creating some routines of my own that allowed me to do some decent validation, that I can understand, and don't seem to be much if any slower than anything else I've tried to work with.

First, I have a routine that builds an array (a list) of valid characters that can be input. Then I created some functions that use that list, and allow you to add other characters for special cases. I often need to use symbols that most Americans don't run across, so you will see in the function below a lot of letters with diacritical marks (used for accents). For my purposes these are all valid characters.

Side note: in order to view the characters listed in the function shown below, I had to force the header to display upper ansi characters, by adding this statement in the PHP at the top of this page:

            // display issues -- this forces display of upper ansi character set 
            header('Content-Type: text/html; charset=ISO-8859-1');

The function to build the list of characters is valid_char_array():

            // ----------------------------------------------------------------------------------------  
            // valid_char_array()
            // the purpose of this function is to deal with the whole
            // valid character issue. I am finding preg_match to be a PITA, it seems
            // to work, but then it doesn't in the weirdest ways. Hence, I am creating
            // my own damn routine:
            // valid_char_array returns an array of 'valid characters'. You can add
            // more once you get the returned array, by simply adding them, example:
            //    $myValArray = valid_char_array();
            //    $myValArray[] = "."; // add a period/dot symbol to the
            //                         // array
            //    You would then want to call the function:
            //      $valid_string = is_string_valid( $somestring, $myValArray )
            function valid_char_array()
               $valArray = [];
               // here's the fun part, we're going to add all the valid characters, which
               // is kind of stoopid, but ...:>
               // standard alphabet:// upper ascii/ansi alphabet --
               // letters with diacriticals
               $valArray[] = ""; // #0192
               $valArray[] = ""; // #0193
               $valArray[] = ""; // #0194
               $valArray[] = ""; // #0195
               $valArray[] = ""; // #0196
               $valArray[] = ""; // #0197
               $valArray[] = ""; // #0198
               $valArray[] = ""; // #0199
               $valArray[] = ""; // #0200
               $valArray[] = ""; // #0201
               $valArray[] = ""; // #0202
               $valArray[] = ""; // #0203
               $valArray[] = ""; // #0204
               $valArray[] = ""; // #0205
               $valArray[] = ""; // #0206
               $valArray[] = ""; // #0207
               $valArray[] = ""; // #0208
               $valArray[] = ""; // #0209
               $valArray[] = ""; // #0210
               $valArray[] = ""; // #0211
               $valArray[] = ""; // #0212
               $valArray[] = ""; // #0213
               $valArray[] = ""; // #0214
               // 215 is a small 'x' -- probably for multiplication?: 
               $valArray[] = ""; // #0216
               $valArray[] = ""; // #0217
               $valArray[] = ""; // #0218
               $valArray[] = ""; // #0219
               $valArray[] = ""; // #0220
               $valArray[] = ""; // #0221
               $valArray[] = ""; // #0222
               $valArray[] = ""; // #0223
               $valArray[] = ""; // #0224
               $valArray[] = ""; // #0225
               $valArray[] = ""; // #0226
               $valArray[] = ""; // #0227
               $valArray[] = ""; // #0228
               $valArray[] = ""; // #0229
               $valArray[] = ""; // #0230
               $valArray[] = ""; // #0231
               $valArray[] = ""; // #0232
               $valArray[] = ""; // #0233
               $valArray[] = ""; // #0234
               $valArray[] = ""; // #0238
               $valArray[] = ""; // #0239
               $valArray[] = ""; // #0240
               $valArray[] = ""; // #0241
               $valArray[] = ""; // #0242
               $valArray[] = ""; // #0243
               $valArray[] = ""; // #0244
               $valArray[] = ""; // #0245
               $valArray[] = ""; // #0246
               // 247 is the division symbol: 
               $valArray[] = ""; // #0248
               $valArray[] = ""; // #0249
               $valArray[] = ""; // #0250
               $valArray[] = ""; // #0251
               $valArray[] = ""; // #0252
               $valArray[] = ""; // #0253
               $valArray[] = ""; // #0254
               $valArray[] = ""; // #0255
               // standard numeric characters:
               $valArray[] = "0";
               $valArray[] = "1";
               $valArray[] = "2";
               $valArray[] = "3";
               $valArray[] = "4";
               $valArray[] = "5";
               $valArray[] = "6";
               $valArray[] = "7";
               $valArray[] = "8";
               $valArray[] = "9";
               // other symbols:
               $valArray[] = "_"; // underscore
               $valArray[] = " "; // space
               $valArray[] = "-"; // dash/minus
               // done loading the array, return it
               return $valArray;
            } // eof: valid_char_array()

The function above creates the array, but you then need to scan the values one character at a time to check for valid characters. The code must check each individual character, and if it finds one symbol that isn't valid the function fails:

            // ---------------------------------------------------------------------------------------
            // is_string_valid() -- Function returns true/false.
            //                      This code takes two parameters, the string you are
            //                      checking, and the valid character array created from
            //                      valArray() (with any additional symbols needed).
            //                      It loops through the string one character at a time
            //                      and checks to see if that character is in the
            //                      array. If it is NOT we return a false value and
            //                      are done. If we never hit the error, we return true.
            // ---------------------------------------------------------------------------------------
            function is_string_valid( $string, $valArray )
               // loop through each individual character in a string,
               // compare it to see if it is valid by checking to see
               // if it is contained in the valid-character-array
               // need to loop through a string:
               for( $i = 1; $i < strlen( $string ); $i++ )
                  $char = substr( $string, $i, 1 ); // get one character at position
                  // if not contained in the array:
                  if( ! in_array( $char, $valArray ) )
                     // we're done ...
                     return false;
               return true; // if here, all is good
            } // eof: is_string_valid()

So basically the code works off the idea of a "white list" -- a list of valid characters, and if a character passed to the is_string_valid() function is not in the list, there is a problem. If all the characters are in the list, then the character string is considered to be valid.

There are a couple of other functions that I use. One is for "basic" input -- names, that kind of thing. The first is valid_characters() and does not modify the list of valid characters at all:

            // ---------------------------------------------------------------------------------------
            // valid_characters() -- function returns an error message if there is a problem,
            //                       otherwise it returns nothing. This is the simplest use
            //                       of the valid_char_array() and is_string_valid() functions.
            // ---------------------------------------------------------------------------------------
            function valid_characters( $value )
               // get array of valid characters
               $aValid = valid_char_array();
               if( ! is_string_valid( $value, $aValid ) )
                  return "There is an illegal character -- use letters, numbers, spaces and underscores(_) only -- please try again!";
            } // eof: valid_characters

The second function that I created adds some characters which might be needed for some situations, called (not very originally) valid_characters2():

            // ---------------------------------------------------------------------------------------
            // valid_characters2() -- function returns an error message if there is a problem,
            //                        otherwise it returns nothing. 
            //                        This variation of valid_characters() allows
            //                        a bunch of other symbols that might be used in
            //                        a description, a paragraph of text, etc. Punctuation,
            //                        and other symbols.
            // ---------------------------------------------------------------------------------------
            function valid_characters2( $value )
               // get array of valid characters
               $aValid = valid_char_array();
               // add any other characters you want to allow:
               $aValid[] = ","; // add comma to valid array
               $aValid[] = "."; // add dot/period to valid array
               $aValid[] = "!"; // add exclamation point to valid array
               $aValid[] = "?"; // add question mark
               $aValid[] = "$"; // add dollar sign
               $aValid[] = ":"; // add colon
               $aValid[] = ";"; // add semicolon
               $aValid[] = "("; // add paren
               $aValid[] = ")"; // add close paren
               $aValid[] = "'"; // add apostrophe
               $aValid[] = ""; // add British pound
               $aValid[] = ""; // add Yen
               $aValid[] = "&"; // add ampersand
               $aValid[] = "%"; // add percent
               $aValid[] = "+"; // add plus
               $aValid[] = "="; // add equal
               $aValid[] = "#"; // add hash/pound
               // problem here -- single quote/apostrophe is one of those
               // specials that needs to be dealt with ... argh:
               $value = str_replace ( "\'", "'", $value );
               // test the test string without the single quote/apostrophe -- if
               // it returns okay, then we should be fine:
               $testString = str_replace( "'", "", $value );
               // allows for letters, numbers, dot, comma, and underscore characters:
               if( ! is_string_valid( $testString, $aValid ) )
                  return "There is an illegal character -- use letters, numbers, spaces, underscores(_) and standard punctuation only -- please try again!";
               return "";
            } // eof: valid_characters2()

For me, this is more readable than regular expressions, and works pretty well.

There is a lot more that can be done, and at the end of this page you can download my full "validation" file, and peruse at your will.

PHP Validation

PHP itself has some validation options, and if you want you can spend time digging into the PHP manual online. One example is the use of the filter_var() function to validate an email address. Note that this does not ensure that the address exists on the web, but that it is formulated properly:

            filter_var( $email, FILTER_VALIDATE_EMAIL )

The PHP manual page for this function can be found here: [PHP filter_var() Function].

Another useful tool, more for cleaning up input is a PHP function: htmlspecialchars(). The function converts a standard HTML string to using the the HTML entities for things like the less than and greater than symbols used to wrap HTML tags (< and >). When storing HTML in a table this is useful.

            $my_field = htmlspecialchars( $my_field )

At its basic, this handles just the standard HTML tags. However, you can add an option to deal with quotes (both single and double) which is great, as they can cause problems storing them into a table -- as quotes can be used for delimiters as well:

            $my_field = htmlspecialchars( $my_field, ENT_QUOTES );

When you want to display the text in a web page, or whatever, you might want to use the PHP function: htmlspecialchars_decode():

            $my_field = htmlspecialchars_decode( $my_field );

Cleaning Up Input

One of the biggest concerns with cleaning up input from your users is dealing with someone inserting Javascript into the input.

I spent some time digging around on the web, and found several variations of the following. I modified it a bit, because the original code looks for both a beginning and ending <script> tag, but I discovered while testing that if a beginning tag exists without an ending tag, the rest of a web page doesn't display. So I dealt with that in the code as well. The function is remove_js():

               This code is from this page:
               This function (lived from an anonymous user on stackoverflow,
               and renamed, cleaned up a little and commented) is designed to sanitize
               user imput by stripping out JavaScript from input. I added the call to
               htmlspecialchars_decode() because that's an easy one to miss ...:
            function remove_js( $string )
               $do = true;
               // convert encoded html (< entity for the < symbol, etc.)
               // back to HTML so we can look for the specific tags we're trying to find:
               $string = htmlspecialchars_decode( $string );
               while ($do)
                  // find occurrence (if any) of the start tag
                  $start = stripos( strtolower( $string ),'<script');
                  // find occurrence (if any) of the end tag
                  $stop = stripos( strtolower( $string ),'/script>');
                  // if both values are numeric:
                  if( (is_numeric( $start ) ) && ( is_numeric( $stop ) ) )
                     // scrub that out -- erase everything from the beginning
                     // tag to the ending tag:
                     $string = substr($string,0,$start).substr($string,($stop+strlen('/script>')));
                  // if we start with a script tag and have no endscript,
                  // strip everything from that to the end of what was entered
                  // the start tag without the end tag will kill the rest of the
                  // web page.
                  elseif( ( is_numeric( $start ) ) && ( ! is_numeric( $stop ) ) )
                     $string = substr($string,0,$start);
                     $do = false;
                     // we're done
                     $do = false;
                  } // endif
               } // endwhile
               return trim($string);
            } // eof: remove_js()

Trying to Guess Passwords

In my CMS, I decided to check the logins. One thing that hackers do to try to get passwords is run robot programs that just keep trying the same userid and attempt to guess the password. The simplest solution is to count the number of attempts a person makes to try to login on a specific userid.

My CMS has code that a) checks the number of attempts -- if they hit three and get the password wrong; b) blocks the account -- meaning the user cannot attempt to login again until the password is fixed. The code sends an email to the email address associated with the user profile and tells them the date and time the id was blocked, etc., and tells them to get in touch with the administrator to deal with the situation.

I am not presenting the code here, it's fairly involved, and this page is already long. If interested, you'll have to download the CMS itself and dig through the code (once I make it available).

Overview / Summary

Well, that's kind of a quick look at all this. The web is a tricky place. Hackers (black hat hackers, at least) have a strange attitude about the web, they treat it like a playground, some of them are intentionally malicious, some just do things to see if they can. It's impossible to catch every single permutation of what these people try, but you can do your best to get most of it.


As promised, if you want to download my own validation routines, they are given here, if you use them, please credit me -- that's the last you can do. [Ken's validation.php file]