Abstract:
Malware is a growing threat to computer systems and networks around the world.
Ever since the malware construction kits and metamorphic virus generators became
easily available, creating and spreading obfuscated malware has become a
simple matter. The cyber-security vendors receive thousands of new malware
samples everyday for analysis. It has become a challenging task for the malware
analysts to identify if a given malware sample is a variant of a known malware
or belongs to a new breed altogether. Since making an accurate decision about
the nature of an unknown malware sample is crucial for updating of signature
databases and propagation of the update to their customers, therefore vendors of
cyber-security products need accurate malware classi cation techniques for this
purpose.
The research community has been active for providing a solution to the above
problem, and a number of diverse avenues have been explored such as machine
learning, graph theory, nite state machines, etc. Furthermore, many syntactic
and semantic aspects of computer programs have been tried out in search of the
best aspect that could be used to distinguish between harmful and harmless computer
programs, and to di erentiate malware belonging to di erent families. All
the proposed approaches have merits and demerits of their own, and the search for
a solution that maximizes the classi cation accuracy with minimal computational
costs is continued.
This dissertation formulates malware classi cation as a sequence classi cation
problem, and evaluates a widely used sequence classi cation tool, Hidden Markov
Model (HMM), for the task of malware classi cation. HMM has been a method of
choice for a broad range of sequential pattern matching applications such as speech
analysis, behavior modeling and handwriting recognition to name a few. The dissertation
rst proposes and evaluates novel methods of malware classi cation by
combining HMM and malware behavioral features, which are attributes frequently
used to distinguish between normal and malicious programs and to di erentiate
x
among malware families. As an another major contribution, the dissertation lls
a signi cant research gap by studying the role of an important HMM parameter,
the number of hidden states, in malware classi cation applications. Based on
observations from comprehensive experiments conducted on a large and diverse
dataset consisting of malware behavioral reports, the dissertation concludes that
although HMM shows encouraging results when used for malware classi cation
tasks, its potential from a practical standpoint is fairly limited. The dissertation
makes the third contribution by proposing to replace the HMM component of
malware classi cation method with Markov Chain Model (MCM), and performing
comparative evaluation between the two models. Results of the comparison prove
that classi cation performance achieved by HMM can be attained much more e -
ciently by MCM, and therefore MCM should be preferred over HMM for malware
classi cation applications.