Abstract:
Taxonomy is an effective means of organizing, managing and accessing large amounts of data. Data these days is however, changing at a rapid pace. Taxonomy represents theme inherent in data. Taxonomy needs to be evolved to reflect changes occurring in data, otherwise, it maynot represent the theme of the underlying data accurately. Existing taxonomy generation techniques pay less attention to the changing nature of data. Evolution of taxonomy for changing data can be handled either non-incrementally or incrementally. Non-incremental taxonomy evolution process reruns the whole taxonomy generation process from scratch and replaces the existing taxonomy with a new one. Majority of the existing taxonomy generation techniques are handling the evolution of taxonomy non-incrementally. Incremental taxonomy evolution, on the other hand, tries to accommodate changes occurring in data on the existing taxonomy without rerunning the whole taxonomy generation process from scratch. The generation from scratch can make the nonincremental taxonomy evolution a time inefficient and computationally expensive choice as compared to incremental evolution. However, a limited number of existing techniques have focused on the incremental evolution of taxonomy. This work proposes a novel Taxonomy Incremental Evolution (TIE) technique that can evolve an existing taxonomy by incrementally updating it whenever new documents are added in data. The TIE technique relies on a clustering-based taxonomy generation technique for the generation of initial taxonomy and then evolves the existing taxonomy afterward whenever changes in underlying data occur. However, it does not depend on any specific clustering technique. When new documents arrive, the TIE technique first identifies the closest cluster for each of the new documents to get adjusted in. It then checks the impact on cluster quality for the possible adjustment of new documents. In case the cluster quality does not deteriorate, new documents get simply merged in the cluster. However, in the case of quality deterioration, the impact of new documents on the cluster quality is identified by manipulating range of closeness of documents with the cluster. Based on the range of closeness of new documents, restructuring of the existing clusters is performed to adjust new documents, ultimately resulting in an evolved taxonomy. The TIE technique was compared with different non-incremental and incremental taxonomy evolution techniques based on time and quality parameters. Since the focus of this work is on unstructured textual data, so a text dataset of scholarly articles from the computing domain was selected for evaluation. The time-based evaluation clearly shows that the TIE technique takes comparatively less time to achieve evolution of taxonomy. The quality-based evaluation compares the lexical and hierarchical quality of the evolved taxonomy with the reference taxonomy. It was found that the lexical quality of TIE is overall better in comparison to both the non-incremental and incremental counterparts. However, hierarchical quality of the taxonomy evolved using TIE is lower especially in comparison to non-incremental taxonomy evolution techniques. The significance of the obtained results was also analyzed statistically using the t-test. The outcome of the t-test also supports the observations related to time and quality-based evaluation of TIE. Moreover, time and quality metrics were combined in a single metric of quality-time ratio to get an overall idea of the performance. It was found that the rate of improvement in taxonomy quality per unit time is the most in case of TIE as compared to other techniques. However, the qualitytime ratio also shows performance deterioration of TIE with the increasing size of the dataset. This aspect was then further investigated through sensitivity analysis. The result of the sensitivity analysis shows that the TIE technique is performing better when the arrival of new data is in small chunk. Thus, the scalability aspect of the TIE technique can be improved in future.