skip to main content
10.1145/1299015.1299021acmotherconferencesArticle/Chapter ViewAbstractPublication PagesecrimeConference Proceedingsconference-collections
Article

A comparison of machine learning techniques for phishing detection

Published:04 October 2007Publication History

ABSTRACT

There are many applications available for phishing detection. However, unlike predicting spam, there are only few studies that compare machine learning techniques in predicting phishing. The present study compares the predictive accuracy of several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails. A data set of 2889 phishing and legitimate emails is used in the comparative study. In addition, 43 features are used to train and test the classifiers.

References

  1. I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000.Google ScholarGoogle Scholar
  2. I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160--167, New York, NY, USA, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anti-Phishing Working Group. http://www.antiphishing.org/.Google ScholarGoogle Scholar
  4. M. W. Berry, editor. Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Breiman. Random forests. Machine Learning, 45(1):5--32, October 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. Chapman & Hall/CRC, 1984.Google ScholarGoogle Scholar
  7. M. Chandrasekaran, K. Narayanan, and S. Upadhyaya. Phishing email detection based on structural properties. In NYS Cyber Security Conference, 2006.Google ScholarGoogle Scholar
  8. H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search. Journal of the American Statistical Association, 93(443):935--997. 1998Google ScholarGoogle ScholarCross RefCross Ref
  9. H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian Additive Regression Trees. Journal of the Royal Statistical Society, 2006. Ser. B, Revised.Google ScholarGoogle Scholar
  10. L. F. Cranor, S. Egelman, J. Hong, and Y. Zhang. Phinding phish: An evaluation of anti-phishing toolbars. Technical Report CMU-CyLab-06-018, CMU, November 2006.Google ScholarGoogle Scholar
  11. A. Emigh. Online identity theft: Phishing technology, chokepoints and countermeasures. Technical report, Radix Labs, 2005.Google ScholarGoogle Scholar
  12. T. Fawcett. Roc graphs: Notes and practical considerations for researchers, 2004.Google ScholarGoogle Scholar
  13. I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 649--656, New York, NY, USA, 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. G and M. MJ. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google ScholarGoogle Scholar
  15. D. J. Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1--15, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  16. F. E. J. Harrell. Regression Modeling Stratigies. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. James. Phishing Exposed. Syngress, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. P. Marques de Sa. Pattern Recognition: Concepts, Methods and Applications. Springer, 2001.Google ScholarGoogle Scholar
  19. B. Massey, M. Thomure, R. Budrevich, and S. Long. Learning spam: Simple techniques for freely-available software. In USENIX Annual Technical Conference, FREENIX Track, pages 63--76, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Nazario. Phishing corpus. http://monkey.org/jose/phishing/phishing2.mbox.Google ScholarGoogle Scholar
  22. Spambase. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase/.Google ScholarGoogle Scholar
  23. M. Wu, R. C. Miller, and S. L. Garfinkel. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI conference on Human Factors in computing systems, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Zhang and T. Yao. Filtering junk mail with a maximum entropy model. In Proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL03), pages 446--453, 2003.Google ScholarGoogle Scholar
  25. L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243--269, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. A comparison of machine learning techniques for phishing detection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        eCrime '07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
        October 2007
        90 pages
        ISBN:9781595939395
        DOI:10.1145/1299015

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 October 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader