Article

A comparison of machine learning techniques for phishing detection

Authors:
Saeed Abu-Nimeh

Southern Methodist University, Dallas, TX

Southern Methodist University, Dallas, TX
View Profile

,
Dario Nappa

Southern Methodist University, Dallas, TX

Southern Methodist University, Dallas, TX
View Profile

,
Xinlei Wang

Southern Methodist University, Dallas, TX

Southern Methodist University, Dallas, TX
View Profile

,
Suku Nair

Southern Methodist University, Dallas, TX

Southern Methodist University, Dallas, TX
View Profile

eCrime '07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summitOctober 2007Pages 60–69https://doi.org/10.1145/1299015.1299021

Published:04 October 2007Publication History

eCrime '07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit

Pages 60–69

ABSTRACT

There are many applications available for phishing detection. However, unlike predicting spam, there are only few studies that compare machine learning techniques in predicting phishing. The present study compares the predictive accuracy of several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails. A data set of 2889 phishing and legitimate emails is used in the comparative study. In addition, 43 features are used to train and test the classifiers.

References

I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras, and C. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000.Google Scholar
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 160--167, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
Anti-Phishing Working Group. http://www.antiphishing.org/.Google Scholar
M. W. Berry, editor. Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, 2004. Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5--32, October 2001. Google ScholarDigital Library
L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and Regression Trees. Chapman & Hall/CRC, 1984.Google Scholar
M. Chandrasekaran, K. Narayanan, and S. Upadhyaya. Phishing email detection based on structural properties. In NYS Cyber Security Conference, 2006.Google Scholar
H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search. Journal of the American Statistical Association, 93(443):935--997. 1998Google ScholarCross Ref
H. A. Chipman, E. I. George, and R. E. McCulloch. BART: Bayesian Additive Regression Trees. Journal of the Royal Statistical Society, 2006. Ser. B, Revised.Google Scholar
L. F. Cranor, S. Egelman, J. Hong, and Y. Zhang. Phinding phish: An evaluation of anti-phishing toolbars. Technical Report CMU-CyLab-06-018, CMU, November 2006.Google Scholar
A. Emigh. Online identity theft: Phishing technology, chokepoints and countermeasures. Technical report, Radix Labs, 2005.Google Scholar
T. Fawcett. Roc graphs: Notes and practical considerations for researchers, 2004.Google Scholar
I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 649--656, New York, NY, USA, 2007. ACM Press. Google ScholarDigital Library
S. G and M. MJ. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
D. J. Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1--15, 2006.Google ScholarCross Ref
F. E. J. Harrell. Regression Modeling Stratigies. Springer, 2001. Google ScholarDigital Library
L. James. Phishing Exposed. Syngress, 2005. Google ScholarDigital Library
J. P. Marques de Sa. Pattern Recognition: Concepts, Methods and Applications. Springer, 2001.Google Scholar
B. Massey, M. Thomure, R. Budrevich, and S. Long. Learning spam: Simple techniques for freely-available software. In USENIX Annual Technical Conference, FREENIX Track, pages 63--76, 2003. Google ScholarDigital Library
D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. Google ScholarDigital Library
J. Nazario. Phishing corpus. http://monkey.org/jose/phishing/phishing2.mbox.Google Scholar
Spambase. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/spambase/.Google Scholar
M. Wu, R. C. Miller, and S. L. Garfinkel. Do security toolbars actually prevent phishing attacks? In Proceedings of the SIGCHI conference on Human Factors in computing systems, 2006. Google ScholarDigital Library
L. Zhang and T. Yao. Filtering junk mail with a maximum entropy model. In Proceeding of 20th International Conference on Computer Processing of Oriental Languages (ICCPOL03), pages 446--453, 2003.Google Scholar
L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243--269, 2004. Google ScholarDigital Library

A comparison of machine learning techniques for phishing detection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection
ARES '21: Proceedings of the 16th International Conference on Availability, Reliability and Security

Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, ...
Read More
Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy
ARES '08: Proceedings of the 2008 Third International Conference on Availability, Reliability and Security

Spam is considered an invasion of privacy. Its changeable structures and variability raise the need for new spam classification techniques. The present study proposes using Bayesian Additive Regression Trees (BART) for spam classification and evaluates ...
Read More
Comparison of machine learning techniques for spam detection
Abstract
Email is a useful communication medium for better reach. There are two types of emails, those are ham or legitimate email and spam email. Spam is a kind of bulk or unsolicited email that contains an advertisement, phishing website link, malware, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
eCrime '07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
October 2007
90 pages
ISBN:9781595939395
DOI:10.1145/1299015
General Chair:
Lorrie Faith Cranor
Carnegie Mellon University
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 October 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BART
CART
NNet
SVM
classification
logistic regression
machine learning
phishing
random forests
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 272
  Total Citations
  View Citations
- 5,545
  Total Downloads
- Downloads (Last 12 months)462
- Downloads (Last 6 weeks)68
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A comparison of machine learning techniques for phishing detection

eCrime '07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit

ABSTRACT

References

Cited By

Recommendations

A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection

Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy

Comparison of machine learning techniques for spam detection