Data Mining of SILC Data: Turkey Case

Authors

  • Olgun Özdemir PhD Student, Yıldız Technical University, İstanbul, 34220, Turkey
  • İbrahim Demir Assistant professor, Yıldız Technical University, İstanbul, 34220, Turkey

Keywords:

Data mining, SILC, cluster analysis, latent class analysis, k-modes, random forests

Abstract

Official data produced by the National Statistical Institutes (NSIs) have an essential place in the governmental economic and social decision-making process. Addressing official data with data mining methods rather than traditional statistical approaches is crucial to extract new information and hidden patterns. However, the usefulness of data mining methods for official statistics remains unexplored. In the present study, SILC (Survey of Income and Living Conditions) data for the year 2015 conducted by the Turkish Statistical Institute (TurkStat) are examined with data mining methods. Cross-sectional data of 36036 individuals were handled, and the variables affecting the individual income were determined, also the welfare status of the individuals was examined. To determine the socio-economic profiles of individuals, latent class analysis (LCA) and k-modes clustering analysis were used. The socio-economic status of individuals was classified using clustering and random forest (RF) algorithm models. In the LCA model with ten classes, it was obtained which probability of a newly selected individual would belong to which class. The latent class profile definitions of the individuals were obtained according to the variable values obtained from the latent classes with the highest probability. Ten clusters obtained as a result of k-modes were defined according to cluster modes, and cluster profile definitions of individuals were obtained, and also their results were compared with LCA results. In this study, in which categorical variables were considered, it was seen that LCA method provided more consistent results than k-modes method. In the RF model, where individual income is selected as a function of all nine input variables, the importance of the variables was determined. It is observed that education, occupation, and age variables were more important and made the most contribution to the RF model, respectively.

In the SILC data, which is an extensive and detailed data, methods such as LCA and RF seem to be appropriate for the application of data mining and obtaining meaningful results from the data. Similar data mining processes can be used to obtain meaningful results for different official data.

References

H. Hassani, G. Saporta and E.S. Silva. “Data Mining and Official Statistics: The Past, The Present & The Future.” Big Data, vol. 2(1), pp. 34BD-BD43, 2014.

V.S. Arora, M. Karanikolos, A. Clair, A. Reeves, D. Stuckler, M. McKee. “Data Resource Profile: The European Union Statistics on Income and Living Conditions (EU-SILC).” International Journal of Epidemiology, vol. 44(2), pp. 451-61, 2015.

S. Sumathi and S.N. Sivanandam. Introduction to Data Mining and Its Applications. New York: Springer, 2006.

H. Hassani, S. Gheitanchi and M.R. Yeganegi. “On the Application of Data Mining to Official Data.” Journal of Data Science, vol. 8, pp. 75-89, 2010.

A. Barrett and Y. McCarthy. (2007, Aug.). “The Earnings of Immigrants in Ireland: Results from the 2005 EU Survey of Income and Living Conditions.” IZA Discussion Paper No. 2990. Available at SSRN: https://ssrn.com/abstract=1012371 [Oct. 22, 2019].

L. Stejskal and J. Stávková. “Living Conditions of Czech Farmers According to the EU Statistics on Income.” Agricultural Economics: Zemědělská ekonomika, vol. 56 (7), pp. 310-316, 2010.

H. Řezanková. “Cluster Analysis and Categorical Data.” Statistika, vol. 3, pp. 216-232, 2009.

H. Řezanková. “Cluster Analysis of Economic Data.” Statistika: Statistics and Economy Journal, vol. 94(1), pp. 73-86, 2014.

Z. Šulc and H. Řezanková. “Evaluation of Recent Similarity Measures for Categorical Data”, in Proc. 17th AMSE , 2014. pp. 249-258.

N. Birčiaková, I. Antošová, F. Balák. “Determinants of Czech Disposable Household Income and Related Housing Quality.” Acta Universitatis, vol. 65(2), pp. 601-610, 2017.

P. Brunori, P. Hufe, D.G. Mahler. (2019, Feb.). “The Roots of Inequality: Estimating Inequality of Opportunity from Regression Trees and Forests.” Working Paper http://www.unicaldine.it/research/BHM_2017.pdf, [Oct. 22, 2019].

A. Alfons and M. Templ. “Estimation of Social Exclusion Indicators from Complex Surveys: The R Package laeken.” Journal of Statistical Software, vol. 54(15), pp. 1-25, 2013.

P.F. Lazarfeld and N.W. Henry. Latent Structure Analysis. Boston: Houghton Mifflin, 1968.

Z. Huang and M. Ng. “A Fuzzy k-Modes Algorithm for Clustering Categorical Data.” IEEE Transactions on Fuzzy Systems, vol. 7(4), pp. 446-452, 1999.

L Breiman. “Random Forests.” Machine Learning, vol. 45(1), pp. 5-32, 2001.

Downloads

Published

2019-12-22

How to Cite

Özdemir, O., & Demir, İbrahim. (2019). Data Mining of SILC Data: Turkey Case. International Journal of Sciences: Basic and Applied Research (IJSBAR), 48(7), 110–138. Retrieved from https://gssrr.org/index.php/JournalOfBasicAndApplied/article/view/10485

Issue

Section

Articles