Friday, February 15, 2013

Code for File-type Identification

Off late I've got many requests to share/queries about the the code used in our work 'Statistical Learning for file-type Identification' [pdf]. Unfortunately, most of the setup and experiments were run using hacky python scripts which I no longer have access to. However, I was able to find one tool that uses 2-gram based features from files to identify the file-type. Here is the source code.

The preprocessed version of the original RealisticDC data used in the paper is available here (login with 'dluser' and 'dluser'). The list of file-types and their distribution, the training and testing split for each fold is also given.

The unpublished manuscript here (with potential typos) could have some more details which is missing from the paper.

