Many web applications allow their user’s to upload files on their computer to the application’s remote server. An example of this type of application is an image sharing service where you can upload and share your vacation photos. This type of application has no reason to accept MS Word documents, PDF files, or mp3 files. It makes sense to empower the application to reject undesirable file types.
So how do you prevent users from uploading PDF files, MS Word documents, etc to your image sharing service?
One approach is to maintain a list of allowed file extensions such as .gif, .jpg, and .png. This type of list which contains only what is allowed is referred to as a whitelist. In other words, it is a list of what we accept and we reject everything else that is not in the list. (The opposite is called a blacklist – a list of what we reject and we accept everything else.) With this whitelist of file extensions, we can check the filename in the Controller/Action class handling the upload to see if it is allowed by the system by using a simple regex. If it is not valid, the application can notify the user.
The problem with this approach is the file name has nothing to do with the actual file content. A user can simply rename test.pdf to test.jpg and the system would accept it.
A better approach is to check the file’s content for its “magic number”. Wikipedia has a good explanation of magic numbers, but basically many file types have a distinct signature that can be used to identify it. For example, the content of a PDF file starts with %PDF. It even includes the version like %PDF-1.3 to indicate a PDF file version 1.3 or %PDF-1.6 to indicate PDF version 1.6. Likewise, the content of a GIF image starts with GIF89a or GIF87a. So if someone is uploading a GIF image and it does not contain the correct magic number, then you know it is not a GIF.
One way to quickly implement this idea in your application is to leverage Apache Tika. Tika is a subproject of Lucene and is used in Solr. From the Tika website “Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries”.
FileInputStream fis = ...//uploaded file Tika tika = new Tika(); String result = tika.detect(fis);
From this, you’ll get a String such as image/png, image/jpeg, text/plain, application/pdf. For a GIF image, you’ll get image/gif.
Are Magic numbers a magic solution to validating the file type of an uploaded file? Could a magic number checker be tricked? It is best to follow a defense in depth approach and check both the file extension and the content type any time you allow users to upload files to your system.