dc.description.abstract |
Perhaps the most important aspect in maintaining software legacy systems is un-
derstanding their architecture. Architectural documentation is often unavailable. Thus
efforts need to be made to recover the architectural design from the source code. This
thesis addresses the problem of recovering the architecture of software systems for greater
understanding, and modularizing them for greater maintainability, using machine learning
techniques.
We use clustering to obtain a high-level view of a software’s architecture, by identify-
ing major sub-systems within it. For this purpose, we analyze the behaviour of existing
similarity and distance measures when applied to software artifacts and keeping in view
software characteristics, yielding explanations to some previously unanswered questions.
We develop two new hierarchical clustering algorithms that address the problem of ar-
bitrary decisions taken by existing hierarchical algorithms. We also propose a similarity
measure suitable for software clustering. The performance of the proposed algorithms
and similarity measure is evaluated using internal and external assessment. Instead of
using only one expert decomposition for external assessment, as is commonly done, we
use decompositions prepared by 4-5 experts for each test system. Such an approach allows
us to validate the idea of multiple views of a software system. Experiments carried out
on five open source legacy software systems show that the performance of our proposed
algorithm is better than previously used algorithms.
Interpreting the results of clustering algorithms is often difficult. To make clusters
easier to understand, we propose a labeling scheme for clusters and compare two alter-
native ranking schemes that can be utilized for this purpose. We demonstrate how the
labels assigned by our scheme aid understanding of the clustering process of clustering
algorithms. We also provide a comparison between cluster analysis and concept analysis
as modularization techniques, and give examples of their application to different software
structures, thus indicating the strengths and limitations of the two techniques.
Finally, we use association rule mining to gain insight into the low-level structure of
software systems by examining relationships between architectural quarks i.e. functions,
global variables and user defined types. Metarule-guided association rule mining is used to
ividentify problems within structured legacy systems. Re-engineering patterns that present
solutions to these problems are proposed. Results for the test systems reveal interesting
characteristics which allow us to understand legacy systems and their evolution. |
en_US |