Printer Friendly

Complementing blacklists: An enhanced technique to learn detection of zero-hour phishing URLs.

ABSTRACT

Increased phishing attacks despite existing anti-phishing tools suggests that the tools are not catching up with the attacks technically. Majority of the tools depend on blacklists which are way short in tackling zero-hour attacks, while existing heuristic tools are also less performing. We propose a machine learning classifier to complement a blacklist approach. The classifier uses a wide range of predictive features compared to those in similar studies, categorized as URL characteristics, web page contents, domain features and domain reputation/ranking. Using six different machine learning algorithms and a dataset of 890 URLs, our classifier achieved the best performance compared to similar solutions, attaining an accuracy of 99.89%, false positive 0.0% and false negative 0.1%. Domain reputation features were the most predictive while web page content features were the least ones. Individually, blacklist reputation and Alexa ranking were the most influential features whereas popup login windows and hexadecimal number were the least ones.

KEYWORDS

Phishing, blacklist, heuristics, anti-phishing, phishing characteristics, URL, web page, domain, machine learning, phishing classifier.

1. INTRODUCTION

Despite increased global technological anti-phishing efforts, phishing attacks over the years have been increasing at a tremendous rate (as shown in figure 1). This indicates that the existing anti-phishing tools are not catching up with the innovation rate of the attacks. Web browsers, browser plug-ins and anti-virus suites are the main anti-phishing tools deployed to detect and block phishing attacks. These tools have built-in features which use blacklists and/or URL/web page heuristics to detect phishing sites.

Blacklist is a frequently updated database managed by the security vendor, often in collaboration with its partners and user community, containing reported and confirmed phishing URLs [2]. If a downloaded web page is in a blacklist, the page is blocked otherwise user is allowed to access it. Known characteristics of phishing URLs and/or web pages are used to develop phishing heuristics that predict phishing nature of unknown URLs [2].

Some of these tools have adopted both blacklist and heuristic approaches while others have opted only one of them. Internet Explorer, Netcraft toolbar and ESET smart security, for instance, use both while Firefox browser and TrustWatch toolbar use blacklist only [2]. Generally, majority of the anti-phishing tools depends on blacklists. However, effectiveness of a blacklist depends on the size of the vendor's user community, its number of reporting agents scattered in the wild and the time taken to identify and report the phishing attack. The first two are usually unknown to the public and may greatly vary among vendors and therefore tools' performances should significantly vary as well.

Average uptime of a phishing website is 30 hours but typically range from few hours to several months before being reported, blacklisted and taken down [3]. Google's blacklist, for instance, takes between 20 minutes and 11 days to record a new reported phishing site in the wild while Microsoft's blacklist takes between 9 minutes to 9 days to do the same [4]. With these time allowances, phishing attacks are able to do massive damages in their first few hours/days before being blacklisted.

Heuristics of most used tools deploying URLs/web pages heuristic analysis are not known and therefore it is difficult to determine which features are used and how each affects the overall performance of the tools. Several studies including [4], [5], [6], [7], [8], [9] and [10] have done performance evaluations of common anti-phishing tools. Generally, majority of the tools have achieved less than 92% success rate while produced significant high false positives and negatives. In a study comprised of 2,291 unique phishing URLs by [9], Chrome v21 browser achieved a highest blocking rate of phishing URLs at 94% whereas Firefox v15 had the lowest, at 90%. However, [7]'s study showed Chrome and Firefox produce as high as 20% of false negatives. Studies in [7] and [8] have reported relatively low performances for browsers in similar tests.

[5]'s study compared anti-phishing capabilities of popular plug-ins, showing different blocking rates of phishing URLs at 0 hour and after 24 hours (see table 1). Only SpoofGuard achieved a high rate of 93% (though with a false negative rate of 37%) at 0 hour but the rest performed at 81% or below. [10] evaluated blocking performances of eleven popular anti-virus suites against phishing URLs, only to observe that two products had a rate of 92% and 85% while the rest performed below 64%. Despite combining blacklist and heuristic techniques in some of the tools, their performances are still short from the ideal rates both in blocking as well as in false alarms. Shortcomings of blacklists can be covered to improve overall performances of the tools by complementing them with an improved heuristic technique which uses a wide range of predominant phishing features.

This study, thus, is proposing a machine learning technique which combines URL characteristics, web page contents, domain name features, Alexa's site ranking and top search engines' results to predict unknown phishing URLs. The technique complements a blacklist approach by including Google's blacklist reputation as one of the features. The algorithm is using a total of 24 attributes, more than other similar proposed tools (discussed in related works). This was observed to be the first study to incorporate such a wide range of features.

Using a datasets of 384 phishing URLs and 506 legitimate URLs, a classifier was developed from this technique with an accuracy of 99.89%, false positive and negative rates of 0.0% and 0.01% respectively. Six different machine learning algorithms were trained, tested and compared to get the highest performance. The performance was observed to be higher than the currently used and the proposed tools. The classifier can be built in any vendor neutral tool and can be adopted in different platforms such as browsers, email applications, chatting applications and others.

Next sections are explaining classifier's predictive features, related works, the study design, study results and lastly the conclusion.

2. FEATURES FOR THE CLASSIFIER

In this study, used features were categorized as 1) URL characteristics 2) web page contents 3) domain name features 4) Google's blacklist reputation 5) Alexa's site ranking and 6) top search engine results. 24 attributes were produced out of these categories. This section elaborates the features.

2.1. URL Characteristics

Special Words

As phishers design fake web pages to capture user credentials, URLs redirecting users to these pages are often included with particular special words related to credentials to convince users on the originality of the pages. Popular used words include confirm, validate, verify, update, username, password, email, account, banking and security. Others are secure, pay, webscr, login, log and signin, among others [11], [12]. We grouped all the words to represent one attribute, that means, if at least one of the words exists, then we counted on the attribute.

Proposed rule:

At least one special word exist [right arrow] Phishing Otherwise [right arrow] Legitimate

URL Obfuscation Characters

To successfully redirect users to a phishing domain server without users' knowledge, phishers in some cases opt to use a character '@' or its equivalent hexadecimal number '%40' in the URLs [12], [13]. The left side of the character, which often contain a genuine domain name of the site, is faked to be a server's username while the right hand side is now the actual destination. Legitimate URLs are not expected to use the character.

To obfuscate genuine domain names to generate fake ones, phishers use mostly '-'and '=' characters [11]. In this study, we considered use of more than once of these characters a suspicion to a phishing attempt. Phishing URLs are also known to have long paths and therefore extensive uses of '/' was considered to indicating a phishing attempt. In this study, the threshold number for the character is set to be 4. Phishers, in other scenarios, tend to use multiple subdomains of hacked genuine domains to host their phishing sites, resulting in increased number of dots in the domain part of the URLs [14]. According to studies by [15] and [16], the average number of dots (in a URL excluding its path) is four or less for non-phishing URL while above four, the webpage is suspicious to phishing.

Non-standard ports can be used by phishers to divert traffic to phishing servers [13]. Standard port for https is 443, use of other ports may present a potential phishing attack. Non-phishing websites are not expected to use non-standard ports.

In our classifier, existence of any mentioned character or their combinations represented a flag for phishing.

Proposed rule:

@ or %40 [right arrow] Phishing

OR

- > 1 [right arrow] Phishing

OR

= > 1 [right arrow] Phishing

OR

/ > 4 [right arrow] Phishing

OR

dots > 4 [right arrow] Phishing

OR

Non-standard ports used [right arrow] Phishing

Otherwise [right arrow] Legitimate

IP and Digits in URLs

Phishers use IP addresses instead of domain names to hide their phishing servers. IP numbers can be presented in decimal, hexadecimal or octal forms [13], [15]. The IP number in hexadecimal requires at least seven '%' and therefore was used in this study as a threshold value to detect hexadecimals [12]. Use of digits in a host part or in a path is popular among phishing URLs [11].

Proposed rules:

IP in a URL [right arrow] Phishing

Otherwise [right arrow] Legitimate

% > 6 [right arrow] Phishing

Otherwise [right arrow] Legitimate

Use of digits in a URL [right arrow] Phishing

Otherwise [right arrow] Legitimate

URL and Domain Lengths

Long URLs are usually used by phishers to hide their suspicious components in them. A number of researches have shown that URL and domain name lengths are among the key identifiers of phishing sites. [13] and [17] have suggested that a URL less than 75 long indicates a non-phishing site while above 75, the site is potentially a phishing one. In terms of a domain name, length more than 30 characters is a phishing feature as suggested by [13].

Proposed rule:

URL length > 75[right arrow] Phishing

Domain length > 30 [right arrow] Phishing

Otherwise [right arrow] Legitimate

Shortened URL Services

These are services which convert long URLs into short forms for their easy management [18]. Shortened URLs redirect users to their original URLs when visited [19]. Some phishers use this technique to hide their phishing URLs by providing links in phishing emails using short URLs [18], [12]. We expect relatively fewer uses of this service to legitimate sites but generally we flagged an alarm when a URL uses the service. URLs using tinyurl.com, bit.ly, ow.ly and goo.gl short URL services were observed in our phishing dataset.

Proposed rule:

Short URL [right arrow] Phishing

Otherwise [right arrow] Legitimate

2.2. Web Page Contents

Redirection of HTML Hyperlinks

Phishing sites, in some cases, contain links which intend to redirect users to malicious servers. To achieve this, links in a site's anchors are designed to display domains which are not the same as the actual domains to be visited. We observed all anchors in all web pages and those with this difference were flagged as phishing ones.

Proposed rule:

Redirection links exist [right arrow] Phishing

Otherwise [right arrow] Legitimate

Redirection in Server Form Handlers

Designing a web page to submit captured user credentials to the phishing server is the eventual aim of any phishing attack. To hide the server's name from the URL of a credential submitting web page, a form's action link is assigned a different domain from the one appearing in the page's URL [20]. When we observed this scenario, we categorized the page as a phishing one.

Proposed rule:

Redirection of form handler [right arrow] Phishing

Otherwise [right arrow] Legitimate

On Mouse Over to Hide the Links

To hide phishing links, phishers sometimes use 'onMouseOver' feature to display a genuine looking link on top of a phishing link when a mouse is placed over it [20]. This site design is not popular in legitimate sites.

Proposed rule:

'OnMouseOver' feature is used [right arrow] Phishing

Otherwise [right arrow] Legitimate

Disabled Right Click

To hide suspicious source code of phishing web pages, phishers tend to disable right click just to hide a view page source feature [20].

Proposed rule:

Right click is disabled [right arrow] Phishing

Otherwise [right arrow] Legitimate

Pop Up Login Windows

Pop-up windows requesting user credentials are widely deployed by phishers to mimic original web pages doing the same. A fake pop-up window can be inserted with an image of the true URL at the URL address bar, tricking users to think a genuine web page is being accessed [21]. Pop-up windows can also be placed on top or beside genuine pages to capture the credentials [22]. This design is very rare in legitimate sites. In this study, we flagged a site as phishing if it has a pop up window asking user's sensitive information.

Proposed rule:

A pop-up window to capture credential is used [right arrow] Phishing

Otherwise [right arrow] Legitimate

2.3. Domain Name Features

Number of Domains in the URL

Use of domain names of spoofed brands in URL paths is a common trend in phishing attacks [12], [14]. This technique is deployed in the assumption that users do not know exact domain location and the number of the domains in the URL but as longer they see brand's domain they relate to they will not bother about the rest. In these scenarios, the URLs will thus have more than one domain. A genuine site's URL is expected to have only one domain and should be before the URL's path. For instance, http://2iphoto.cn/https://www.paypal.com/cgi-bin/webscr?cmd=_login-run is a phishing URL found in our dataset, spoofing PayPal. The URL has two domains, 2iphoto.cn, the actual domain of the URL, and paypal.com placed in the path to fool user into thinking paypal site is being accessed.

Proposed rule:

Number of domains in the URL > 1 [right arrow] Phishing

Otherwise [right arrow] Legitimate

DNS Records

Phishers using their fraudulently registered domains know that they have a small time window to effectively phish data before being suspected, taken down and possibly caught. To maximize their impacts, they register domains only for specific attacks and allow them to stay alive for a short time, usually between few days and few months [14], [23]. For this study, any site domain whose age is below 3 months is regarded as a phishing site otherwise it is a legitimate one. To determine domain age, we developed a query for each site in the WHOIS site, an online database providing details information of every registered domain including its date of registration.

Other sites in our dataset were observed to miss their records in the WHOIS. We also categorized them as phishing ones as any legitimate site is expected to have a valid registration. Absence in WHOIS could be due to deletion of their registrations by their registrars after being reported to be phishing.

Proposed rules:

Domain age = < 3 months [right arrow] Phishing

Otherwise [right arrow] Legitimate

WHOIS records is missing [right arrow] Phishing

Otherwise [right arrow] Legitimate

Free Subdomain Services

Instead of compromising legitimate websites to launch phishing attacks, phishers have been using subdomain services to hide their identities [3]. This approach contributes to 14% of all phishing attacks [3]. Most of these services are free of charge and allow anonymous registration which in turn encourage malicious registrations. Hostinger (890m.com), 5gbfree.com and altervista.org are the most used services observed in our phishing dataset. Established businesses and other organizations are not expected to use these services.

Proposed rule:

Subdomain service used [right arrow] Phishing

Otherwise [right arrow] Legitimate

Extended Validation SSL Certificate (EV)

To ensure confidentiality, message integrity and entity authentication are achieved between client's browser and a web server, legitimates web pages prompting user credentials are expected to deploy a valid extended validation SSL certificate (EV) [12], [21]. Indicators for EV, which were also used to cross check usage of EV in this study, are https protocol (URL must begin with https://), padlock at the start of URL and green color for the address bar or URL text if the certificate is valid. If all these indicators are missing, a site was regarded as a phishing one.

Proposed rule:

URL without EV [right arrow] Phishing

Otherwise [right arrow] Legitimate

To fake use of https protocol in URLs, phishers use the word 'https' in a URL path or other parts apart from the beginning of the URL. Any URL with this feature was, therefore, regarded as a phishing one.

Proposed rule:

Https not at beginning of the URL [right arrow] Phishing

Otherwise [right arrow] Legitimate

2.4. Google's Blacklist Reputation

To enhance performance of the classifier, we have included Google's blacklist reputation test as one of the features. We have chosen Google's database because it is widely deployed by other anti-phishing tools such Firefox and Safari browsers suggesting that it is has a large user community and partners thus more reliable and accurate. Each URL was opened using a latest version of Firefox browser (at the time of testing) with a setting 'Block reported web forgeries' enabled. Blocked URL was flagged as a phishing one. This feature was also used in studies by [11] and [23].

Proposed rule:

URL exist in Google's blacklist [right arrow] Phishing

Otherwise [right arrow] Legitimate

2.5. Alexa's Website Ranking

Alexa (1) is an online database measuring global popularity of websites. The database measures average number of daily users visiting a website and its page views in the last 3 months to determine its ranking position. It also determines the website's number of links in other sites. The higher the numbers, the higher the ranking position. Legitimate websites are expected to rank high but phishing sites, since are younger thus fewer visitors and page views, should have negligibly low ranking indexes or not ranked at all. Each URL's domain was searched in Alexa to determine its ranking position and number of links. This feature was adopted from [24]'s study.

Proposed rule:

URL's domain does not have a reputation in Alexa [right arrow] Phishing

Otherwise [right arrow] Legitimate

2.6. Top Search Engine Results

Search engines can also be used to measure popularity of sites. A known and established site should generate several results related to its web pages but a newer and unpopular site such as phishing ones may not be indexed in the engines. In this study, we have used three largest search engines, Google, Yahoo and Bing, to optimize performance of this feature. Each URL's domain was searched in each search engine and if we found no page listed among the results of the first page related to the domain, we flagged the domain as a phishing one. [11] and [25] adopted similar features.

Proposed rules:

URL's domain has no related results in the first page of Google search [right arrow] Phishing Otherwise [right arrow] Legitimate

URL's domain has no related results in the first page of Yahoo search [right arrow] Phishing Otherwise [right arrow] Legitimate

URL's domain has no related results in the first page of Bing search [right arrow] Phishing Otherwise [right arrow] Legitimate

3. RELATED WORKS

A number of researches have done a similar study. Some studies focused on determining URL and/or web page content features that can differentiate phishing sites from legitimate ones while others went further to design classifiers to learn and detect phishing sites.

[26] analyzed trends of phishing attacks by looking at the developed features harvested from two datasets of phishing URLs obtained from APWG landing page, one in 2008 and the other one in 2014. A number of features were observed similar to some of those used in this study. These include use of multiple domains in the same URL, IP addresses in the URLs, domain age, use of free subdomain services and use of shortened URL services. They also compared performances of five popular web browsers in detecting phishing URLs from each dataset. [23] reviewed common features of phishing sites such as domain properties, URL characteristics and Google's blacklist that could be used to identify them.

Using a dataset of about 30,000 legitimate and phishing URLs from Yahoo directory, DMOZ, phishTank and Spamscatter, [27] developed a classifier to detect phishing URLs. They focused on URL features only and not web page contents. These were phishing URL characters and domain properties. Results of Bayes Naive, SVM and Linear Regression learning models were compared to obtain prediction accuracy between 95-99% with errors between 0.9 - 3%.

[25] developed a unique classifier to detect phishing, malicious and spam URLs. Attributes of the learner were lexical features, link popularity (in Altavista, AllTheWeb, Google, Yahoo and Ask search engines), web content features, network features and DNS properties. With a dataset of 72,000 URLs, multi-label classification achieved an accuracy of 98% with an error rate of up to 1.1%.

[28]'s study designed an automatic phishing classifier using a dataset of more than 500,000 and 2,778 phishing and clean URLs respectively. Attributes used were presence of IP addresses, number of host components, use of special keywords, Google's page ranking and domain reputation score by Gmail anti-spam system. Others were TF-IDF value of terms in a web page as well as domain nature of links in html hyperlinks and images. The classifier performed at an accuracy rate of 90% with false positive rate below 0.1%.

A study by [11] is another related work which developed a classifier base on relatively close features to ours. With a dataset of 16,000 phishing and 31,000 non-phishing URLs, the classifier obtained an accuracy of 99% with error rates of 0.2% false positive and 0.5% false negative rates. Classifier's attributes were lexical features, URL phishing words, search engine page ranking (in Google, Yahoo and Bing) as well as site's domain, IP and URL reputations in phishTank, stopBadware and Google's blacklist.

Our study has an edge over the others due to a use of wide range of attributes for the classifier. For instance, [27]'s study have not used page ranking by Alexa and popular search engines, web page contents or blacklist checkup, which in our study, have shown high significance in predicting phishing URLs (see table 2). Similar cases can be established in the other mentioned studies. Our study's classifier has also has achieved a very high accuracy rate using a relatively small dataset. Other studies have used large datasets to achieve relatively the same or below performances compared to ours.

4. STUDY DESIGN

4.1. Dataset Collection and Analysis

The study used a dataset of 384 phishing URLs and 506 clean URLs. Phishing URLs were collected from phishTank (2), an online database managed by openDNS to store reported phishing sites in the wild by online community. Clean URLs were collected partly from DMOZ (3), the largest and most comprehensive human-edited web directory and a list of top 500 popular websites in the world.

URLs were collected between July 25 and August 30, 2015. To qualify for a study, we ensured each URL was of a web page that prompts user credentials. For a phishing URL, it must be alive and confirmed to be a phishing one at the time of data collection. All URLs were collected, tested against each classifier's feature and their results recorded manually.

A total of 24 main features/attributes and one label were selected to design a classifier. Each attribute was assigned with a binary value, either 1 if a URL has a phishing feature or 0 if it does not have one. We decided to scale all attributes including those with real values to binary values to ensure that all attributes carry the same importance to this classification. The label was of a two class values.

The following table summarizes attributes and their distribution in the dataset.

The dataset suggests great importance of EV, redirection in the hyperlinks, Google's blacklist as well as domain ranking in Alexa and search engines in prediction due to significant differences in their numbers between phishing and clean URLs. Few numbers of OnMouseOver, disabled right click and domain length features show that they are probably the least important attributes. Specific importance measure for each attribute is analyzed in the results section.

4.2. The Setup

The experiments were done in a 64-bit Windows 10 host with 12GB RAM and Intel core i7 @ 2.5GHz specifications. Firefox browser version 40 was used to access web pages and tested against Google's blacklist.

RapidMiner Studio 6.0 Starter Edition was used to develop the classifier and measure its performance. A dataset was stored and imported into the software from MS Excel 2013. 10 cross validation technique was used to divide training and testing data to effectively test predictive accuracy of our classifier.

Two experiments were designed for this study. Experiment 1 was to train and test our dataset to develop and measure performance of the classifier. Six different machine learning algorithms relevant to our study were used to develop the classifier with the highest possible performance. These are k-NN, Logistic Regression (LR), Support Vector Machine (SVM), Perceptron, Naive Bayes and Random Forest. In experiment 2, we wanted to understand how influential each feature is in detecting phishing URLs. To achieve this, related features were grouped into four main sets and each set was then trained and tested using the most performing learning method (from experiment 1). The results for these two experiments are discussed in section 5.

5. RESULTS

4.3. Classifier's Performance

In experiment 1, our objective was to train and test a complete dataset to develop a classifier to detect phishing URLs using the discussed features. To find the best performance of the classifier, we tested it with six selected machine learning algorithms using 10 cross validation technique. Each algorithm was tested using its default parameter settings. Performance criteria for this experiment was accuracy, false positive (FP) and false negative (FN).

Table 3 shows recorded performance of each algorithm. LR and SVM appeared to attain the highest performances in terms of accuracy, FP and FN, both at 99.89%, 0.0% and 0.1% respectively. All methods except RF attained an accuracy of at least 97%, FP below 2% and FN between 0 and 0.1%. RF had the worst performance in terms of accuracy and FP. With parameter settings of each method tuned randomly, the low performing methods would have performed much better.

4.4. Performances By Features

In experiment 2, we wanted to assess contribution of each feature as well as the sets of features in our classifier for comparison purpose. LR was selected to be used for this experiment because it had the best classifier performance and also it generates weight table through which we can measure influence weight of each feature. To determine individual feature's contribution, we extracted a weight table when we ran LR with a full dataset (from experiment 1). From table 4, large negative weights show strong predictive contribution to phishing URLs while small negative or positive ones indicate weak contribution in prediction. Blacklist, Alexa ranking and EV certificate were the most contributors as they produced largest negative numbers. URL length, popup window and hexadecimal numbers were the least contributors.

To analyze predictive influence of sets of features, we grouped the features into four main sets, URL characteristics, web page contents, domain features and domain reputation. These sets represent the actual related features we have used in our classifier. In the experiment, only features of the same set of attributes, one set at a time, were picked and their dataset trained and tested with LR.

Domain reputation which represented ranking in blacklist, Alexa and search engines was observed to score the highest accuracy and the lowest classification error rates and FP. This means it was the most discriminative set of features. Web page contents had the lowest accuracy and the highest error rate and FP, meaning it was the least influential set of features in the classifier. URL characteristics were observed to have a better contribution when compared to domain features, as indicated in table 5.

6. CONCLUSION

Accuracy of 99.89%, 0.0% FP and 0.1% FN of our classifier was the best performance compared to the results reported by other similar studies. A major reason for this achievement is the use of many various features, which were missed by other studies, but have proven to be very influential in predicting phishing URLs. Domain reputation features observed to be the most discriminative set of features, followed by URL characteristics. However, web page content features appeared to be the least significant in phishing prediction. This is the only study that has used Alexa ranking, Google's blacklist reputation and top results from the three major search engines to determine domain reputations.

EV certificate and redirections in hyperlinks, apart from domain reputation's blacklist and Alexa ranking, were the other most decisive individual features in the classifier. URL length, https inside and hexadecimal numbers were one of the least predictive features but when combined with other URL related characteristics, their performance was way better, only second to that of domain reputation.

Despite of very high performance of the classifier, this study was limited in terms of small dataset size, data gathered from very few sources as well as the fact that data was obtained in a small time window. As part of our future work, it would be interesting to see how performance of the classifier responds to a much bigger dataset, gathered from a wide range of sources and in a spread time intervals, let's say 6 to 12 months. To collect more data, we also aim at building an automatic approach of retrieving attributes' values instead of manual approach. Moreover, it would be vital to assess overall time taken by the tool to collect values automatically and classify a new URL so as to determine practical viability of the tool in real time environments.

7. REFERENCES

[1] RSA (2013), "The year in phishing 2012", EMC. Available at: http://www.emc.com/collateral/fraud-report/online-rsa-fraud-report-012013.pdf [Accessed in August 2015].

[2] Nagunwa, T. (2014), 'Towards Mitigation of Phishing: The State of web Client Anti-phishing Technologies', International Journal of Advanced Research in Computer Science and Software Engineering, 4 (10): 720-734.

[3] APWG, (2014), "Global phishing survey 1H2014: Trends and domain name use", APWG. Available at: http://docs.apwg.org/reports/APWG_Global_Phishing_Report_1H_2014.pdf [Accessed in August 2015].

[4] Ludl, C., McAllister, S., Kirda, E. & Kruegel, C. (2007), 'On the Effectiveness of Techniques to Detect Phishing Sites', Proceeding of DIMVA '07 Proceedings of the 4th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp 20 - 39.

[5] Cranor, L., Egelman, S., Hong, J. & Zhang, Y. (2006), "Phinding Phish: An Evaluation of Anti-Phishing Toolbars", CyLab Carnegie Mellon University. Available at: https://www.cylab.cmu.edu/files/pdfs/tech_reports/cmucylab06018.pdf [Accessed in August 2015].

[6] Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J. & Zhang, C. (2009), 'An Empirical Analysis of Phishing Blacklists', CEAS 2009 - Sixth Conference on Email and Anti-Spam, California USA, July 16-17, 2009.

[7] Zhang, J., Wu, C., Li, D., Jia, Z., Ouyang, X. & Xin, Y. (2012), 'An Empirical Analysis of the Effectiveness of Browser-based Anti-phishing Solutions', International Journal of Digital Content Technology and its Applications (JDCTA), 6 (7): 216-224.

[8] AV-Comparatives (2012), "Anti-phishing protection of popular web browsers", AV-Comparatives. Available at: http://www.av-comparatives.org/images/docs/avc_phi_browser_201212_en.pdf [Accessed in July 2015].

[9] Abrams, R., Barrera, O. & Pathak, J. (2012), "Browser Security Comparative Analysis: Phishing Protection", NSS Labs. Available at: https://library.nsslabs.com/reports/browser-security-comparative-analysis-phishing-protection-edition-2 [Accessed in July 2015].

[10] Abrams, R., Barrera, O. & Pathak, J. (2013), "Consumer AV/EPP Comparative Analysis: Phishing Protection", NSS Labs. Available at: https://library.nsslabs.com/reports/consumer-avepp-comparative-analysis-phishing-protection-edition-1 [Accessed in July 2015].

[11] Basnet, R., Sung, A. & Liu, Q. (2014), 'Learning To Detect Phishing URLs', International Journal of Research in Engineering and Technology (IJRET), 3 (6): 373-383.

[12] Nagunwa, T. (2008), Investigation of data privacy threats in online retail industry and assessment used in mitigating their impact, MSc Thesis, Dublin Institute of Technology.

[13] Basnet, R., Sung, A. & Liu, Q. (2011), 'Rule-based phishing attack detection', International Conference on Security and Management (SAM'11), Las Vegas. Available at: http://weblidi.info.unlp.edu.ar/worldcomp2011-mirror/SAM8471.pdf [Accessed August 2015].

[14] Fette I., Sadeh, N. & Tomasic, A. (2006), 'Learning to Detect Phishing Emails', WWW '07 Proceedings of the 16th international conference on World Wide Web, ACM Digital Library, pp. 649-656.

[15] Mohammad, R., McCluskey, T.L. & Thabtah, F. A. (2012), 'An Assessment of Features Related to Phishing Websites using an Automated Technique', International Conference for Internet Technology and Secured Transactions (ICITST 2012), pp. 492-497. Available at: IEEE [Accessed August 2015].

[16] Mannan, M. & Oorschot, P.C. (2007), 'Security and Usability: The gap in real-world online banking', NSPW '07 Proceedings of the 2007 Workshop on New Security Paradigms, ACM Digital Library, pp. 1-14.

[17] McGrath, D. & Gupta, M. (2008), "Behind phishing: An examination of phisher modi operandi", APWG. Available at: http://docs.apwg.org/reports/behindPhishingWhitePaper.pdf [Accessed August 2015].

[18] Ollman, G. (2004), The Phishing Guide: Understanding and Preventing Phishing Attacks, The Next Generation Security Software. Available at: https://www.nccgroup.trust/uk/our-research/the-phishing-guide-understanding-preventing-phishing-attacks/ [Accessed in July 2015].

[19] Gilby (2008), TinyURL. Available at: http://tinyurl.com [Accessed in August 2015].

[20] Damodaram, R. & Valarmathi, M. (2012), 'Phishing website detection and optimization using Modified bat algorithm', International Journal of Engineering Research and Applications (IJERA), 2 (1): 870-876.

[21] Stebila, D. (2010), 'Reinforcing bad behavior: The misuse of security indicators on popular websites', Proceedings of the 22nd Conference of the Computer-Human Interaction, ACM Digital Library, pp. 248-251.

[22] Dhamija, R., Tygar, J. & Hearst, M. (2006), 'Why Phishing Works?', Proceedings of the conference on Human factors in Computing Systems (CHI-2006), ACM Digital Library, pp. 581-590.

[23] Singh, N. & Patil, M. (2014), "Identification of Phishing Web Pages and Target Detection", International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 3 (2): 260-263.

[24] Kausar, F., Al-Otaibi, B., Al-Qadi, A. & Al-Dossari, N. (2014), 'Hybrid Client Side Phishing Websites Detection Approach', International Journal of Advanced Computer Science and Applications (IJACSA), 5 (7): 132-140.

[25] Choi, H., Zhu, B. & Lee, H. (2011), 'Detecting Malicious Web Links and Identifying Their Attack Types', Proceedings of WebApps'11 Proceedings of the 2nd USENIX conference on Web application development, ACM Digital Library, pp 11-11.

[26] Gupta, S. & Kumaraguru, P. (2014), 'Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page', Cornell University Library. Available at: http://arxiv.org/pdf/1406.3682.pdf [Accessed in August 2015].

[27] Ma, J., Saul, L., Savage, S. & Voelker, G. (2009), 'Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs', Proceeding KDD '09 Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Digital Library, pp 1245-1254.

[28] Whittaker, C., Ryner, B. & Nazif, M. (2010), 'Large-Scale Automatic Classification of Phishing Pages', Proceedings of 17th Annual NDSS Symposium 2010, Internet Society. Available at: http://www.internetsociety.org/sites/default/files/whit.pdf [Accessed in August 2015].

Department of Computer Science Institute of Finance Management Tanzania

nagunwa@ifm.ac.tz

(1) http://www.alexa.com/

(2) https://www.phishtank.com/index.php

(3) http://www.dmoz.org/
Table 1: Blocking rates of phishing URLs by anti-phishing plug-ins at 0
and 24 hrs [5]

Toolbar     Rate at  Rate at
             0 hr     24 hr

Netcraft     81%       95%
TrustWatch   59%       80%
SpoofGuard   93%       95%
Cloudmark    40%       32%

Table 2: Attributes and dataset distribution.

Attribute              %        %
                    Phishing  Clean
                      URLs     URLs

URL special            70.8    39.1
words
URL Obfuscation        50.0    30.4
characters
IP addresses           16.7     0.0
Hexadecimal            33.3    22.9
numbers in the
URL
Digits in a URL        62.5    34.8
Shortened URL          18.5     0.0
services
Number of              29.2    12.1
domains in a
URL
Subdomain              12.5     0.0
services
EV certificate        100.0    47.8
WhoIs DNS              29.2     0.0
records
Domains age            50.0     0.0
Domain length           8.3     0.0
URL length             58.3    30.4
https inside the       20.8     8.7
URL
Redirection in the     70.8    13.0
HTML hyperlink
OnMouseOver to          0.5     0.0
hide the link
Disabled right          0.5     0.0
click
Redirection in the     25.0     0.0
form handler
Pop up window          20.8     8.7
Google Blacklist       66.7     0.0
Ranking in Alexa       58.3     0.0
database
Top results of         41.7     0.0
Google search
Top results of         41.7     0.0
Yahoo search
Top results of         41.7     0.0
Bing search
Phishing?             384     506

Table 3: Classifier's performance with different learning algorithms.

Machine        Accuracy   FP   FN
Learning         (%)      (%)  (%)

Method
k-NN            97.75     2.0  0.0
Logistic        99.89     0.0  0.1
Regression
Perceptron      97.53     2.2  0.0
SVM             99.89     0.0  0.1
Naive Bayes     98.43     1.4  0.0
Random Forest   88.31    10.4  0.0

Table 4: Weight of each feature in predicting phishing URLs.

Google        -3.41  Yahoo         -0.36
Blacklist
Alexa         -2.65  Bing          -0.36
ranking
EV            -2.28  Subdomain     -0.215
certificate          Serv.
Redirections
in            -1.90  Port No       -0.20
hyperlinks
Digits        -1.35  Domain        -0.17
                     length
Domain age    -1.34  Form          -0.15
                     handler
WhoIs         -1.30  OnMouseOver   -0.05
No.           -1.13  Disabled R    -0.03
Domains              click
Special       -0.78  https Inside  -0.02
                     words
Shortened     -0.78  URL length     0.05
URLs
URLs          -0.59  Popup          0.36
Characters           window
IP            -0.55  Hexadecimal    0.50
                     no
Google        -0.36

Table 5: Classifier's performance with different sets of features.

Category of         Accuracy  Error     FP   FN
features            (%)       rate (%)  (%)  (%)


URL                 91.12      8.88     4.2   3.7
characteristics
Web page            76.74     23.26     9.6  11.1
contents
Domain features     88.76     11.24     8.0   2.0
Domain              96.4       3.6      3.2   0.0
reputation/ranking

Figure 1 Global growth of new phishing attacks 2010 - 2012 as detected
by RSA [1].

Total 2010            187,203
Total 2010            258,461
Total 2010            445,004

Note: Table made from bar graph.
COPYRIGHT 2015 The Society of Digital Information and Wireless Communications
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2015 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:uniform resource locator
Author:Nagunwa, Thomas
Publication:International Journal of Cyber-Security and Digital Forensics
Article Type:Report
Date:Oct 1, 2015
Words:6361
Previous Article:A framework for integrating multimodal biometrics with digital forensics.
Next Article:Cyber warfare awareness in Lebanon: Exploratory research.
Topics:

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters