Abstract:
Dealing with air pollution presents a major environmental challenge in smart city environments. Real-time monitoring of pollution data enables local authorities to analyze the current traffic situation of the city and make decisions accordingly. Deployment of the Internet of Things-based sensors has considerably changed the dynamics of predicting air quality. Existing research has used different machine learning tools for pollution prediction; however, comparative analysis of these techniques is required to have a better understanding of their processing time for multiple datasets. In this paper, we have performed pollution prediction using four advanced regression techniques and present a comparative study to determine the best model for accurately predicting air quality with reference to data size and processing time. We have conducted experiments using Apache Spark and performed pollution estimation using multiple datasets. The Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) have been used as evaluation criteria for the comparison of these regression models. Furthermore, the processing time of each technique through standalone learning and through fitting the hyperparameter tuning on Apache Spark has also been calculated to find the best-fit model in terms of processing time and lowest error rate.