Journal of Pundra University of Science and Technology

Abstract

In this paper, we propose a machine learning-based framework for detecting phishing websites using URL-derived features. Phishing remains one of the most prevalent cyber security threats, exploiting deceptive links to steal sensitive user credentials. Traditional blacklist and rule-based methods fail to identify zero-day and rapidly evolving phishing attacks. To address this, we developed and evaluated nine machine learning models, including ensemble and hybrid approaches such as Artificial Neural Network (ANN) + Random Forest (RF) and Logistic Regression (LR) + Gradient Boosting (GB). The system utilizes a dataset comprising 11,430 URLs with 88 lexical, structural, and domain-related attributes. Experimental results show that the hybrid ANN + RF model achieved the best performance, obtaining an accuracy of 96.94%, precision of 97.57%, recall of 96.19%, F1-score of 96.88%, and AUC of 0.99. Moreover, comparative analysis confirmed that ensemble models outperform individual classifiers in detecting phishing URLs while maintaining high generalization capability. The proposed framework demonstrates strong potential for real-time deployment in web browsers, email filters, and cybersecurity gateways, thereby contributing to the development of adaptive, data-driven defenses against modern phishing threats.

Keywords

Phishing, Cyber-security, URL, Supervised Learning, Website Security, Machine Learning.

Introduction

Phishing attacks have become a dominant threat vector in modern digital ecosystems, exploiting human vulnerabilities rather than software flaws to pilfer private information, including financial data, passwords, usernames, and personal identification numbers. These attacks often mimic legitimate websites through deceptive URLs and visual spoofing, tricking users into voluntary disclosure of private data. As organizations increasingly adopt online services, phishing poses a severe threat to data privacy, financial stability, and digital trust. Conventional phishing detection systems primarily blacklist based or rule to cope with the rapidly evolving and ephemeral nature of phishing domains. These methods suffer from high false negatives and an inability to detect newly launched phishing websites, also known as zero-day attacks.1, 2 As a result, there is increasing interest in using machine learning (ML) to create phishing attacks that are more resilient and flexible detection mechanisms. Machine learning enables models to learn URL patterns and structural behaviors that distinguish phishing from legitimate websites, without relying on prior knowledge or manual rules.3, 4 Several research have effectively used ML algorithms—such as Support Vector Machines (SVM), Logistic Regression(LG), Random Forest(RF) and Gradient Boosting(GB) to detect phishing using handmade features taken from URL strings.5, 6, 7 Advanced ensemble methods and neural networks have further improved classification performance by combining multiple learning paradigms.8, 9 Recent works, such as Abdul Samad et al.1, demonstrated the effectiveness of fine-tuned Random Forest models on phishing URLs, achieving notable accuracy improvements. Similarly, Ahammad et al.2 and Alam et al.3 explored multiple ML classifiers and reported encouraging detection rates using lexical and host- based URL features. Aljammal et al.5 emphasized the benefits of integrating multiple datasets and models for better generalization. Meanwhile, Aljabri and Mirza4 explored deep learning enhancements, proposing hybrid ML-DL frameworks for improved robustness. Despite these advances, challenges remain in selecting optimal feature sets, mitigating over fitting, and ensuring performance on unseen phishing variants. In this study, we address these limitations by conducting a Comparison evaluation of nine machine learning models, including individual classifiers and ensemble combinations. Using a dataset of 11,430 labeled URLs and 88 extracted features, we assess each model’s effectiveness Using critical performance measures, such accuracy, recall, precision, F1-score, and AUC. Our findings show that ensemble approaches particularly Random Forest combined with Artificial Neural Networks (ANN) outperform standalone models, demonstrating strong generalization and high detection accuracy. These results reinforce the potential of intelligent ML-driven systems in safeguarding users from dynamic and sophisticated phishing threats.

Journal of

Pundra University of Science & Technology

Abstract

Keywords

Introduction

Journal of Pundra University of Science and Technology