Comparison Study of Distance Methods in Nonlinear Panel Data Clustering with K-Means Method

  • Muayyad IPB University, Jl. Raya Dramaga, Babakan, Dramaga District, Bogor City, West Java, Indonesia.
  • Indahwati IPB University, Jl. Raya Dramaga, Babakan, Dramaga District, Bogor City, West Java, Indonesia.
  • Kusman Sadik IPB University, Jl. Raya Dramaga, Babakan, Dramaga District, Bogor City, West Java, Indonesia.
Keywords: Coronavirus Disease, Calinski-Harabatz, Dynamic Time Warping, K-mean

Abstract

Cluster analysis is used to group objects based on the similarity of characteristics between objects. Cluster analysis will be applied to nonlinear panel data using Indonesian Coronavirus Disease (COVID-19) data with the aim of grouping Provinces based on the number of active positive cases using the K-means method. The first stage will be a simulation to get the best distance method on nonlinear panel data. The distance method used is the Euclidean, Manhattan, Maximum, Frechet, and Dynamic Time Warping (DTW). The simulation results are obtained after running all distance methods with 36 scenarios from four generation data models, the maximum distance method is the best distance method with a total of 20 highest accuracy values compared to other distance methods. The maximum distance method will be applied to real data. The real data results showed that the optimal number of clusters is formed when three clusters are formed with the value of the Calinski Harabatz (CH) criteria of 143,459. Cluster A has 30 members, Cluster B has three members, while Cluster C has one member from DKI Jakarta Province.

References

Mattjik, AA, Sumertajaya, I,M. 2011. Sidik Peubah Ganda dengan Menggunakan SAS. Bogor: IPB Press.

Sumertajaya IM, Erfiani, Putri WDY. 2007. Analisis Gerombol Menggunakan Metode Two Step Cluster (Studi kasus: data Potensi Desa Sensus Ekonomi 2003 wilayah Jawa Barat). Forum Statistika dan Komputasi. 12(1): 18-23.

Genolini C, Alacoque X, Sentenac M, Arnaud C. 2015. KML and KML3D: R Packages to Cluster Longitudinal Data. Journal of Statistical Software. 65(4): 2-10. doi: 10.18637/jss.v065.i04.

Genolini C, Ecochard R, Benghezal M, Driss T, Andrieu S, Subtil F. 2016 kmlShape: An Efficient Method to Cluster Longitudinal Data (Time-Series) According to Their Shapes. PLoS ONE 11(6):1-12. doi: 10.1371/journal.pone.0150738

Bilgic E, Baydar V. 2018. Panel Data Clustering with R: An Application on Macroeconomic Variables of European Countries. 19, 258.

Sugiono, Adella Sari Cahyani. 2020. Kajian Perbandingan Beberapa Jarak untuk Data Panel dalam Penggerombolan Tak Berhiraki [Tesis]. Bogor: Institut Pertanian Bogor.

Montero P, Villar Jose A. 2014. TSclust: An R Package for Time Series Clustering. Journal of Statistical Software. 65 (4): 2-18.

Liu L, Li W, Jia H. 2018. Method of Time Series Similarity Measurement Based on Dynamic Time Warping. CMC. 57(1):97-106.

Gorunescu, F. 2011. Data Mining: Concepts, Model and Techniques. Berlin, Jerman: Springer.

Baarsch J, Celebi ME. 2012. Investigation of Internal Validity Measures for K-Means Clustering. International Multiconference Of Engineers And Computer Scientists.1:14-16. LA: Louisiana Board of Regents.

Johnson RA, Wichern DW. 2002. Applied Multivariate Statistical Analysis 6th Edition. New Jersey: Prentice-Hall International.

Published
2022-05-16
How to Cite
Muayyad, Indahwati, & Kusman Sadik. (2022). Comparison Study of Distance Methods in Nonlinear Panel Data Clustering with K-Means Method . International Journal of Sciences: Basic and Applied Research (IJSBAR), 62(2), 106-117. Retrieved from https://gssrr.org/index.php/JournalOfBasicAndApplied/article/view/13957
Section
Articles