ABSTRACT
There are many applications available for phishing detection. However, unlike predicting spam, there are only few studies that compare machine learning techniques in predicting phishing. The present study compares the predictive accuracy of several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails. A data set of 2889 phishing and legitimate emails is used in the comparative study. In addition, 43 features are used to train and test the classifiers.
- I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000.Google Scholar
- I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160--167, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
- Anti-Phishing Working Group. http://www.antiphishing.org/.Google Scholar
- M. W. Berry, editor. Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, 2004. Google ScholarDigital Library
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, October 2001. Google ScholarDigital Library
- L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. Chapman & Hall/CRC, 1984.Google Scholar
- M. Chandrasekaran, K. Narayanan, and S. Upadhyaya. Phishing email detection based on structural properties. In NYS Cyber Security Conference, 2006.Google Scholar
- H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search. Journal of the American Statistical Association, 93(443):935--997. 1998Google ScholarCross Ref
- H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian Additive Regression Trees. Journal of the Royal Statistical Society, 2006. Ser. B, Revised.Google Scholar
- L. F. Cranor, S. Egelman, J. Hong, and Y. Zhang. Phinding phish: An evaluation of anti-phishing toolbars. Technical Report CMU-CyLab-06-018, CMU, November 2006.Google Scholar
- A. Emigh. Online identity theft: Phishing technology, chokepoints and countermeasures. Technical report, Radix Labs, 2005.Google Scholar
- T. Fawcett. Roc graphs: Notes and practical considerations for researchers, 2004.Google Scholar
- I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 649--656, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
- S. G and M. MJ. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
- D. J. Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1--15, 2006.Google ScholarCross Ref
- F. E. J. Harrell. Regression Modeling Stratigies. Springer, 2001. Google ScholarDigital Library
- L. James. Phishing Exposed. Syngress, 2005. Google ScholarDigital Library
- J. P. Marques de Sa. Pattern Recognition: Concepts, Methods and Applications. Springer, 2001.Google Scholar
- B. Massey, M. Thomure, R. Budrevich, and S. Long. Learning spam: Simple techniques for freely-available software. In USENIX Annual Technical Conference, FREENIX Track, pages 63--76, 2003. Google ScholarDigital Library
- D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. Google ScholarDigital Library
- J. Nazario. Phishing corpus. http://monkey.org/jose/phishing/phishing2.mbox.Google Scholar
- Spambase. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase/.Google Scholar
- M. Wu, R. C. Miller, and S. L. Garfinkel. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI conference on Human Factors in computing systems, 2006. Google ScholarDigital Library
- L. Zhang and T. Yao. Filtering junk mail with a maximum entropy model. In Proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL03), pages 446--453, 2003.Google Scholar
- L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243--269, 2004. Google ScholarDigital Library
- A comparison of machine learning techniques for phishing detection
Recommendations
A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection
ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and SecurityPhishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, ...
Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy
ARES '08: Proceedings of the 2008 Third International Conference on Availability, Reliability and SecuritySpam is considered an invasion of privacy. Its changeable structures and variability raise the need for new spam classification techniques. The present study proposes using Bayesian Additive Regression Trees (BART) for spam classification and evaluates ...
Comparison of machine learning techniques for spam detection
AbstractEmail is a useful communication medium for better reach. There are two types of emails, those are ham or legitimate email and spam email. Spam is a kind of bulk or unsolicited email that contains an advertisement, phishing website link, malware, ...
Comments