A 2-means Clustering Technique for Unsupervised Spam Filtering

Kostas Fragos


Unsolicited commercial e-mail, or “Spam”, implies a waste of network bandwidth and waste of human effort in internet and mobile phones communication. It is also a hard problem to distinguish legitimate from spam emails. The majority of the proposed algorithms use supervised learning techniques. Unfortunately, these approaches have the drawback of training over a large amount of manually and costly tagged email corpora. In this paper, we present an unsupervised method to address the problem of filtering spam emails without the need of training over such corpora. Using a 2-means clustering technique we perform a 2-way classification. To overcome the serious complications imposed by the large dimensionality of the data, the algorithm first transforms the data into a low dimensional component space applying a Principal Component Analysis over the data and then performs clustering on them.  The method was proved to be promising when evaluated over the publicly available corpus, called “SpamAssasin”, which is provided by the Open Project for evaluation purposes. The achieved performance is comparable to the performance of systems based on supervised learning techniques.


Spam filtering; 2-means clustering; principal components analysis; feature selection.

Full Text:



S.-E. Kim, J.-T. Jo, and S.-H. Choi, ‘‘SMS spam filtering using keyword frequency ratio,’’ The International Journal of Security. vol. 9, no. 1, pp. 329–336, Apr 2015.

D. Mallampati. “An Evaluation of Naïve Bayesian Classifier for Anti-Spam Filtering Techniques,” International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering Vol. 6, Issue 10, Oct 2017.

K. Sharma and N Jatana. “Bayesian Spam Classification: Time Efficient Radix Encoded Fragmented Database Approach” IEEE, pp. 939-942., Feb 2014.

R. Drewes. (2202, June 1). An artificial neural network spam classifier, available: www.interstice.com/drewes/cs676/spam-nn [3-2-2018].

H. W. Feng and J. Sun. “A support vector machine based naive Bayes algorithm for spam filtering,” presented at Performance Computing and Communications Conference (IPCCC), IEEE 35th International, Las Vegas, USA, 2016.

P. Tsangaratos and I. Ilia. "Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments," journal Catena Vol. 145 pp. 164-179, Jun 2016.

Y.Lei, D.Yu, Z. Bin, and Y.Yang. "Interactive K-Means Clustering Method Based on User Behavior for Different Analysis Target in Medicine," journal Computational Mathematics Methods Medicine, Published online doi: 10.1155/2017/4915828, Oct 2017.

A. Bansal, M. Sharma and S. Goel. "Improved K-mean Clustering Algorithm for Prediction Analysis using Classification Technique in Data Mining," International Journal of Computer Applications (0975 – 8887) Volume 157 – No 6, January 2017.

U. R. Raval and C. Jani. "Implementing & Improvisation of K-means Clustering Algorithm," journal IJCSMC, Vol. 5, Issue. 5, pp.191 – 203, May 2016.

C. R. Rao. “The Use and Interpretation of Principal Component Analysis in Applied Research” Journal of Sankhya, A 26, pp. 329 -358, Feb. 1964.

http://www.csmining.org/index.php/spam-assassin-datasets.html [3-5-2018].

I. Androutsopoulos, J. Koutsias, K. Chandrinos, and C. Spyropoulos. “An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages,” presented at International Conference of SIGIR, May 2000.

J. Hidalgo. “Evaluating Cost Sensitive Bulk Email Categorization,” in Proc. SAC, 2002, pp 615-620.

X. Carreras and L. Marquez. “Boosting trees for anti-spam email filtering. In Proceedings of RANLP-01,” presented at International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 2001.

E. Michelakis, I. Androutsopoulos, G. Paliouras, G. Sakkis and P. Stamatopoulos. “Filtron: A learning based Anti-Spam Filter,” presented at Conference on email and Anti-Spam, CA, USA, 2004

M. X. Gong and M. B. Richman: “On the application of cluster analysis to growing season recipitation in North America east of the Rockies,” Journal Climata, pp. 8:897-931, 1995.


  • There are currently no refbacks.





About IJSBAR | Privacy PolicyTerms & Conditions | Contact Us | DisclaimerFAQs 

IJSBAR is published by (GSSRR).