An immunological approach to computer viruses detection.ABSTRACT Within this digital era having large amount of data processed by the means of computer devices, computer security has become a necessity in protecting those assets. Since the first computer virus has been found, classical detection methods that depend on virus signature extraction has been used. In this paper we discuss the principles of artificial immune system An artificial immune system (AIS) is a type of optimisation algorithm inspired by the principles and processes of the vertebrate immune system. The algorithms typically exploit the immune system's characteristics of learning and memory to solve a problem. , introduces the terminology of malicious software and proposes an artificial immune antivirus system that can detect unknown viruses. Our system depends on simulating the biological immune system immune system Cells, cell products, organs, and structures of the body involved in the detection and destruction of foreign invaders, such as bacteria, viruses, and cancer cells. Immunity is based on the system's ability to launch a defense against such invaders. features and also uses a finite state machine See state machine. (mathematics, algorithm, theory) Finite State Machine - (FSM or "Finite State Automaton", "transducer") An abstract machine consisting of a set of states (including the initial state), a set of input events, a set of output events, and a state transition called Hidden Markov models that enables us to detect viruses according to according to prep. 1. As stated or indicated by; on the authority of: according to historians. 2. In keeping with: according to instructions. 3. their behaviour. Several experiments were conducted showing great results in virus detection. This system can be a substitute to classical detection methods. 1. INTRODUCTION Natural Biological systems act as a good source of inspiration to computational security specialists in this digital age. The immune system is from the biological systems that embody many features that are desirable for the imperfect, uncontrolled, and open environments in which most computers currently exist. It also includes Distributivity, Diversity, Disposability, Adaptability, Autonomy, Dynamic Coverage, Anomaly detection An approach to intrusion detection that establishes a baseline model of behavior for users and components in a computer system or network. Deviations from the baseline cause alerts that direct the attention of human operators to the anomalies. See IDS and anomaly. , multiple layers, Identity via behaviour, no trusted components, and imperfect detection (Charles A Janeway, 2001). Current studies and researches believe that understanding the components and algorithms of the immune system cause a production for better artificial computational systems to solve real-world problems in alternative disciplines such as computer science. For example, we can draw parallels between the problem faced by the immune system and that faced by computer security, hence deriving mechanisms and principles that enable us to design better computer security systems. Many scientists believe that this metaphor is useful because the immune system is a prime example of a computational system that successfully solves a complex real-world problem in a novel way. The claim that the immune system performs "computation" is controversial, but if we view computation as information processing information processing: see data processing. information processing Acquisition, recording, organization, retrieval, display, and dissemination of information. Today the term usually refers to computer-based operations. and storage, then the immune system certainly performs computation. It does this in a way that is scalable, flexible, robust and resilient to subversion and errors. 2. THE HUMAN IMMUNE SYSTEM 2.1 The Immune System as a Model Protecting our body from external pathogens and harmful diseases and the ability to eliminate these threats from the human body is the immune system main concern. In fact, security related software companies try to incorporate the immune system approaches and algorithms in their antivirus software See antivirus program. (tool) antivirus software - Programs to detect and remove computer viruses. The simplest kind scans executable files and boot blocks for a list of known viruses. products since there is a problem they face now in the existence of a wide range of viruses that vary in complexity. Here are some features in the immune system that make it very suitable to be a model for security applications (R. E. Vance, 2000). Adaptability: A very powerful feature in the immune system is the ability to adapt to viruses and microorganism microorganism /mi·cro·or·gan·ism/ (-or´gah-nizm) a microscopic organism; those of medical interest include bacteria, fungi, and protozoa. structures that are obscure or not known and retaining information about already known viruses in the immune memory. A computer immune system should be similarly adaptable, both learning to recognize new intrusions and remembering the signatures of previous attacks. Diversity: A feature that makes the system is harder to break. Since immune system diversity, makes a person's vulnerability to the same pathogen Pathogen Any agent capable of causing disease. The term pathogen is usually restricted to living agents, which include viruses, rickettsia, bacteria, fungi, yeasts, protozoa, helminths, and certain insect larval stages. differs from one person in a population to another so it will provide highly diverse pattern recognition, therefore the immune system can detect a wide range of harmful micro-organisms. In contrast with security breaches in computer systems, it is easy for an intruder to compromise other system easily if these systems have a common framework. This kind of problem is for example seen with Internet worms attached to e-mails, which are using the same kind of security flaw in a widely spread email reader See e-mail program. program to replicate itself to other systems. A layered system In telecommunication, a layered system is a system in which components are grouped, i.e., layered, in a hierarchical arrangement, such that lower layers provide functions and services that support the functions and services of higher layers. of defense: This feature makes the immune system more robust and strong in facing different types of diseases since every layer has its own protection strategy, the computer system can face different types of threats and viruses hence there isn't a certain mechanism that confers complete security. Rather, multiple layers of different mechanisms provide overall security when combined. Autonomy: The immune system is self-repairable since it replaces the damaged cells. In addition, it has the ability to classify and eliminate pathogens autonomously, although we cannot be sure that we will reach this level of independence from our computers due to the expansions in computer networks , existence of fast CPU's and the spread of mobile code, it will be increasingly important for computers to manage most security problems automatically. Disposability: Every part in the immune system is disposable which means that there is no single component of the human immune system is essential; Equilibrium of cell death and cell production manages this feature in the human body. Although we do not currently have self-reproducing hardware, death and reproduction at the process/agent, level is certainly possible and would have some advantages if we can control it. Distributivity: the immune cells are distributed all over the body so there is no central control takes place, which means there is no single point of failure. This feature can be used in security applications and a system overall failure is nearly impossible to happen. Identity via behaviour: Immune system has the ability to detect pathogens through their behaviour. Therefore, it does not need to know exact procedures or steps to detect a disease. Monitoring Sequences of system calls to detect abnormal behaviour of programs is analog to this feature. for example if it was found that a program tries to read a part of the computer memory and it doesn't have privilege or a program need to write some malicious content to the hard disk, then we start to doubt that this program is malicious and begin to take actions towards these programs. Anomaly detection: it is the ability of the immune system to detect unknown viruses in the human body and will be an appealing feature if incorporated in security application, until now there is no fully functional computational approach that can detect computer threats without the human intervention. 2.2 The Artificial Immune System Artificial immune systems (AIS) are adaptive systems, inspired by theoretical immunology immunology, branch of medicine that studies the response of organisms to foreign substances, e.g., viruses, bacteria, and bacterial toxins (see immunity). Immunologists study the tissues and organs of the immune system (bone marrow, spleen, tonsils, thymus, lymphatic and observed immune functions, principles and models, which are applied to problem solving problem solving Process involved in finding a solution to a problem. Many animals routinely solve problems of locomotion, food finding, and shelter through trial and error. (S. A. Hofmeyr and S. Forrest, 1999). It is important to remark that theoretical immunology refers to all mathematical and non mathematical mechanisms, principles, models and theories used to describe the functioning of the immune system. Immunology scientists traditionally described the human immune system as the problem of distinguishing "self" from dangerous "non-self" and eliminating dangerous nonself nonself /non·self/ (non´self) in immunology, pertaining to foreign antigens. non·self n. That which the immune system identifies as foreign to the body. . Similarly, the computer security problem is similar to distinguishing harmless program from malicious program such as computer virus or worm. Artificial immune system should have anomaly detection capability for facing unknown viruses. Adaptability is also a necessary property of artificial immune system to learn unknown malicious programs and to response quickly. Other properties such as Distributivity, Autonomy, Diversity and Disposability are required for the flexibility and stability of artificial immune system (S. A. Hofmeyr and S. Forrest, 2000). In figure (1) there is a proposed categorization for artificial immune system according to which computer component will be identified as a protected cell. As highlighted below in figure (1) two main categories that we are concerned about are: 1- Protecting Static Files (using Inactive Analysis) This category assumes the program's code as a cell. It is called inactive analysis because the static code for each program is fully analyzed through a system that uses Hidden Markov Models; we do not take into account newly installed programs. In addition, we use a decoy DECOY. A pond used for the breeding and maintenance of water-fowl. 11 Mod. 74, 130; S. C. 3 Salk. 9; Holt, 14 11 East, 571. program that is not changeable. If a virus modifies this program then there is an infection. Distributed Computer Virus Immune System combined Inactive analysis and decoy programs. It uses decoy program to trap viruses and employs inactive analysis to identify detected unknown virus (Richard Marko, 2002). [FIGURE 1 OMITTED] 2- Protecting Processes (using Active Analysis) This category assumes the process as a cell. In the active analysis we scan system call traces so we can detect a malicious program during its execution and this method compensate the defect found in the inactive analysis so we have a completer robust system for virus detection. 2.3 Immune Inspired Antivirus System Here we suggest a high- level view of the immune antivirus system and it's workflow which is depicted in figure(2) The main aim for a computer security system is the protection of the system from unauthorized intruders, which is analogous in functionality to the immune system in protecting the body from harmful substances. Further, a computer security system should protect against insider attacks, malfunctioning software (analogous to misbehaving cells) and other internal errors, maintaining the computer within normal operating tolerances. Because of the compelling similarity between the computer security problem and the problem of protecting a body against damage from internally and externally generated threats, we designed an artificial immune system to protect the computer system based on immunological principles, algorithms and architecture (Steven A. Hofmeyr, 1999). In figure (2) we illustrate the way that the Immune Antivirus System follows to detect and eliminate a virus or a threat through seven steps labeled on the figure: 1-A monitoring program on each PC uses a variety of heuristics based on HMMs, system calls analysis, behaviour, suspicious changes to programs, or family signature to infer that a virus may be present. The monitoring program forwards a copy of the suspected program to an administrative machine within the organization. 2-The administrative machine sends the sample to the central virus analysis machine in an encrypted format. 3-The central virus analysis machine creates an environment which makes the infected program can run safely for analysis. Techniques used for this purpose include emulation, or the creation of a protected environment that can execute and monitor the suspect program. The virus analysis machine then produces a prescription for identifying and removing the virus. 4-The central virus analysis machine sent back the resulting prescription to the administrative machine. 5-The administrative machine forwards the prescription to the infected client. 6-The central virus analysis machine forwards the prescription to other clients in the organization. 7-Subscribers around the world would receive regular antivirus updates that protect them from new viruses. [FIGURE 2 OMITTED] 3. MALICIOUS SOFTWARE 3.1 Malicious Programs Perhaps programs exploiting vulnerabilities present the most sophisticated types of threats to the computer system (S. Forrest and L. Allen, 1994). This section begins with an overview of the spectrum of such software threats. The reminder of the section is devoted to virus's nature, structure and countermeasures. As depicted below in figure (3) software threats comes into two categories: 1-Software threats that need a host program. 2-Software threats that is self replicating. The former are essentially fragments of programs that cannot exist independently of some actual application programs, utility, or system program. The latter are self-contained programs that can be scheduled and run by the operating system operating system (OS) Software that controls the operation of a computer, directs the input and output of data, keeps track of files, and controls the processing of computer programs. . Malicious programs that needs a host program performs its action according to a function done by the host program. Malicious programs that do not need a host consists of either a program fragment (virus) or an independent program (worm) that when executed, may reproduce on or more copies of itself to be activated later on the same system or other systems. Although the taxonomy taxonomy: see classification. taxonomy In biology, the classification of organisms into a hierarchy of groupings, from the general to the particular, that reflect evolutionary and usually morphological relationships: kingdom, phylum, class, order, of figure (3) is useful in organizing the information it is not the whole picture, in particular, logic bombs or Trojans horses may be part of a virus or worm. 3.2 The Nature of Viruses A virus is a program that can infect other programs by modifying them; the modification includes a copy of the virus program, which can then go on to infect other programs. Biological viruses are tiny scraps of genetic code DNA DNA: see nucleic acid. DNA or deoxyribonucleic acid One of two types of nucleic acid (the other is RNA); a complex organic compound found in all living cells and many viruses. It is the chemical substance of genes. or RNA RNA: see nucleic acid. RNA in full ribonucleic acid One of the two main types of nucleic acid (the other being DNA), which functions in cellular protein synthesis in all living cells and replaces DNA as the carrier of genetic that can take over the machinery of a living cell and trick it into making thousands of flawless replicas of the original virus. Like its biological counterpart, a computer virus carries in its instructional code the recipe for making perfect copies of itself, lodged in a host computer; the typical virus takes temporary control of the computer's disk operating system See DOS. 1. (operating system) Disk Operating System - (DOS) The original disk operating system from IBM. DOS was the low-end OS of choice on the IBM 360, the high-end system was called just "OS". . Then, whenever the infected computer encounters an uninfected piece of software, a fresh copy of the virus passes into the new program. Thus, unsuspecting users who either swap disks or send programs to each other over the network can spread the infection from one computer to another. In a network environment, the ability to access applications and system services on other computers provides a perfect culture for the virus to spread. [FIGURE 3 OMITTED] Four stages that are very important during the virus lifetime and we list them as follows: Silent stage: In this stage the virus still idle, it cannot take any action unless some event, such as a certain date, the existence of another program or file activates it. Not all viruses have this stage. Proliferation proliferation /pro·lif·er·a·tion/ (pro-lif?er-a´shun) the reproduction or multiplication of similar forms, especially of cells.prolif´erativeprolif´erous pro·lif·er·a·tion n. stage: The virus copies itself into other programs or inside system areas in important folders for the operating system on the disk. When other program is infected, it contains a clone of the virus, which will enter a proliferation phase. Release stage: The virus performs its function. As with the silent stage, a variety of system events, including a count of the number of times that the virus has made copies of itself inside other programs can cause the triggering of the virus and entering the release phase. Execution stage: The final stage is the execution stage in which the viruses do its job. 3.2.1 Virus Structure Viruses can propagate prop·a·gate v. 1. To cause an organism to multiply or breed. 2. To breed offspring. 3. To transmit characteristics from one generation to another. 4. by many ways but the traditional one is to pre-pending or post-pending to an executable program See executable code. , or embedding 1. (mathematics) embedding - One instance of some mathematical object contained with in another instance, e.g. a group which is a subgroup. 2. (theory) embedding - (domain theory) A complete partial order F in [X -> Y] is an embedding if to the program by other fashion. Figure (4) illustrates a virus structure. In this case, the virus code is pre-ended to the infected program and the entry point to the program when invoked is the first line of the virus code. [FIGURE 4 OMITTED] Here is an example that is shown in figure (4) that shows the work of the virus in which we assume that program [P.sub.1] is infected with the computer virus CV. When this program is invoked, control passes to the virus, which performs the following steps: 1-For each uninfected file [P.sub.2], the virus first compresses that file to produce [P'.sub.2], which is shorter that the original program by the size of the virus 2- A copy of the virus is pre-pended to the compressed Program. 3- The compressed version of the original infected program, [P'.sub.1] is uncompressed. 4- The uncompressed original program is executed. 3.3 Virus Countermeasures The best solution for protecting a computer system is to prevent viruses to get into the computer system; of course, this is an impossible goal to achieve so instead we suggest a sequence for detecting and removing viruses, which is the subject for the coming section. We summarize it as follows: 1-Detection: Once the infection has occurred, determination by using heuristic A method of problem solving using exploration and trial and error methods. Heuristic program design provides a framework for solving the problem in contrast with a fixed set of rules (algorithmic) that cannot vary. 1. methods (HMM HMM heavy meromyosin. ) is done and we locate the virus. 2-Identification: Here investigations made to know what does the virus made to the infected program and identifying modifications made by the virus to start eliminating it. 3-Removal: The final step is the removal for all traces of the virus from the infected program and restores the program to its original state. Afterwards we remove the virus from all infected systems so that it cannot spread further. 4. ARTIFICIAL IMMUNE ANTIVIRUS SYSTEM 4.1 System Assumptions There are some important steps in order to represent our clean files and to identify the infected files easily and we thought of these steps as assumption for running our system safely and smoothly and here are some assumptions: * Prior to our system setup, we must not connect the computer device to any network type such as (local area, internet, etc ...) * Install a new operating system in which we assure that system and user files are clean. * Install the Immune antivirus system. 4.2 Representing Self and Non-Self The Main aim of the system is to distinguish between self and non-self. The human immune system solves this problem by randomly generating millions of cells and afterwards killing those cells reacting to self. By this way the immune system holds a complete set of nonself, therefore all other substances which match with this set are considered to be nonself. The following steps illustrate the how the biological immune system detects nonself: 1-Random generation of receptors. 2-Killing receptors that react to self 3-Anything that matches the receptors must be nonself. Since the set of nonself in a computer system can be very large to be represented, the only choice is to try to represent self instead of nonself. In our system we don't randomly generate anything but just match with the set of self; here we show the steps to detect self from nonself. 1-Match with self 2-If it does not match with self it is nonself. An arising question is how we can identify self from nonself, since we detect viruses from their patterns of behaviour we can put a rule that self represents normal behaviour which is not harmful and nonself represents the abnormal behaviour which is harmful. As a result our computer immune system will maintain a list of normal behaviour for files and programs as self and anything that deviates from this profile will be regarded as nonself. 4.3 Matching Approach When the lymphocytes Lymphocytes Small white blood cells that bear the major responsibility for carrying out the activities of the immune system; they number about 1 trillion. in the human immune system recognize an infectious agent infectious agent Pathogen, see there , it is due to binding of peptides to the lymphocyte's receptors. The binding of the peptides to the lymphocyte's receptors is not a perfect match, but more like a loose match. The term known as affinity describes the strength of the binding between a peptide and a receptor. The stronger the binding between the peptide and the receptor is, the stronger a signal is sent to the core of the lymphocyte lymphocyte: see blood; immunity. lymphocyte Type of leukocyte fundamental to the immune system, regulating and participating in acquired immunity. Each has receptor molecules on its surface that bind to a specific antigen. . When enough receptors are bound, the signals to the core of the lymphocyte are so strong, then the lymphocyte will start to differentiate. The lymphocyte has now recognized an infectious agent. How should we simulate this kind of loose matching and the strength of the match, the affinity? What are we going to match, and what are our requirements to the matching algorithms? As illustrated in figure (5) we will decompose de·com·pose v. de·com·posed, de·com·pos·ing, de·com·pos·es v.tr. 1. To separate into components or basic elements. 2. To cause to rot. v.intr. 1. the information in a sequence of words, bytes or bits to speed up the matching algorithm and make it easier for the matching algorithm to work. We will assume that we are going to match sequences of discrete symbols, which could be anything like bits, bytes, chars, colours and so on. The output of the filter will be objects with references to the original bit stream, and symbol sequences, which we are going to match with. Through our research for finding different ways of doing the loose matching, we have found that the best matching approach for our problem is Hidden Markov Models. [FIGURE 5 OMITTED] 4.3.1 Hidden Markov Models Hidden markov models (HMMs) depend on the concept of the finite state machine where all the state transitions have fixed probabilities and every state in the model has a probability distribution Probability distribution A function that describes all the values a random variable can take and the probability associated with each. Also called a probability function. probability distribution for observing the symbols in a given state. In figure (6) we illustrate a HMM where the two symbols A and B can be observed in each of the three stats S1, S2, and S3 (Anders Krogh, 1998) With HMMs we are able to express that some state transitions are more likely to happen than others, for instance in figure (6) we can see that it is more likely to make a state transition from state S1 to S2 (3/5) than making a state transition from state S1 to S3 (1/5).Furthermore, with HMMs we are also able to express how likely it is to observe different kinds of symbols in each state, from figure (6) we can for instance see that it is more likely to observe symbol B (3/5) than it is to observe symbol A (2/5) in state S1. HMMs are often used in the area of speech recognition and biological sequence analysis searching for patterns in the DNA. Normally HMMs is randomly generated and then trained or re-estimated to maximize the probability of observing a certain sequence of symbols, the trained HMM is then used to match with other symbol sequences to figure out how well these match with the original one(Lawrence R. Rabiner,1989). There are some general notations we will use in the coming sections and here are their definitions: This is the observation sequence of symbols O = [O.sub.1] [O.sub.2] ......[O.sub.T] [lambda]: This is the Hidden markov Model A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. [FIGURE 6 OMITTED] 4.4 Inactive Analysis This type of analysis is used to detect abnormal behavior in programs due to the infection caused by viruses. It is very extensible to a wide range of virus types. We call it inactive because we analyze the static code of programs. The ideas behind inactive analysis is building a normal behavior profile for all normal and uninfected programs and keep watching these files to see if they will deviate from this profile. Building the normal behavior profile is done by training one or more HMMs on the uninfected programs; finally the resulting probability of observing the uninfected programs in the HMMs is saved. On the other side abnormal behavior is detected from the computation of saved probability for uninfected programs and the probability resulting from observed programs in the trained HMMs and therefore any deviation result from the trained HMMs will be reported as a change in the programs' behavior, hence we claim that this program is infection. The general procedure for Inactive analysis is divided into two parts: Training: 1. Train HMM [lambda] for program O. 2. Extract probability of observing program O in [lambda]: P(O|[lambda]). 3. Save [lambda] and P(O|[lambda]) in database. Detecting: 1. Calculate probability of observing possibly infected program O' in [lambda]: P(O'|[lambda]). 2. If P(O|[lambda]) [not equal to] P(O|[lambda]) the behavior of the program has changed. 4.4.1 Building a Normal Behaviour Profile For Building a normal behaviour Profile we will train HMMs on the code and data segments since they are our target at knowing how the suspected program behave, we have neglected the header of the program since they will not help us in building the profile quickly but will be a burden and an extra effort on HMM to compute every time while comparing probabilities. In order to build the Normal Behaviour profile we need to train the HMM on every program and to train one HMM on several programs, which will be conducted as follows: Training HMM on every program: 1. Train HMM [lambda] for program. 2. Extract probability of observing program O in [lambda]: P(O|[lambda]). 3. Save [lambda] and P(O|[lambda]) in database. The HMM together with the probability P(O|[lambda]) make up the normal behavior profile; this is our basic solution as we have illustrated above in the inactive analysis section. Training HMM on several programs: 1. Train HMM [lambda] for the programs 2. Calculate the probability of observing in [lambda]: P([O.sup.(1)]|[lambda]), P([Osup.(2)]|[lambda]), P([O.sup.(n)]|[lambda]). 3. Save [lambda] and P([O.sup.(1)|[lambda]), P([O.sup.(2)]|[lambda]), P([O.sup.(n)|[lambda]) in database. This Method will maximize the average probability of observing the programs in the HMM, the HMM [lambda] with the probabilities P([O.sup.(i)|[lambda]) make up the normal behavior profile for the ith program. 4.4.2 Detecting Abnormal Behaviour After we have built a normal behaviour profile by the method discussed in the above section, the next step is detecting abnormal behaviour for programs through these steps: 1-Compute the probability of observing the suspicious program O' in [lambda]. 2-If P(O'|[lambda]) [not equal to] P(O|[lambda]) then the behaviour of the program has changed. If we want to know about the exact matching between the program and its normal behaviour profile we calculate the match affinity as follows: [P.sub.ma] = P(O|[lambda])/P(O'|[lambda]) Where the infected program is O', O is the original program and [lambda] is the HMM trained for the original program. This detection strategy depends on the program's state, in which we can't do this process of comparing probabilities every few second but is only done when the program is in the write mode and some process need to write on it some information and eventually we will suspect this until we see it normal behaviour profile, then after the results we decide it is a virus or not. Therefore the cost for the detection operation is optimum and is done only on purpose. 4.5 Active Analysis The operating system depend heavily on some procedures called system calls, any program that need to do any action like reading, writing, coping must invoke a system call with an attribute for the needed operation. In order to computer viruses analysis more dynamic we scan traces of system calls to detect abnormal behaviour. This method is called active analysis in which we monitor every transaction done by the system calls; hence it is possible to catch viruses when trying to execute or injecting themselves inside other programs to complete their propagation stage. 4.5.1 Learning from Synthetic Behaviour Here we will try to make a simulating environment for programs in which they behave as if they are really working in real environment, this can be achieved by executing the program under normal circumstances with all kind of parameters in order to generate traces of system calls which will be used to train a HMM. For best results our system must be executed as we mentioned above, afterwards the average probability of observing normal behaving traces in the HMM is computed and saved together with the model. Finally we would like to train the HMM to be able to detect abnormalities for programs and we have devised a procedure for doing so in the following steps: 1-Execute a program with all different kind of parameters and obtain the traces of system calls generated: 2-Train HMM on these traces. 3-Compute the average probability of observing system call traces in the model [lambda]: [P.sub.average] = [n.summation summation n. the final argument of an attorney at the close of a trial in which he/she attempts to convince the judge and/or jury of the virtues of the client's case. (See: closing argument) over (i = 1)] P(O|[lambda])/n Where n is the number of generated traces of system calls. 4-Save [lambda] and [P.sub.average] in the database. 5-Obtain the traces of system calls O' that represent abnormal behaviour whenever the program is executed. 6-Compute the probability of observing O' in [lambda]. 7-Evaluate the match affinity: [P.sub.ma] = P(O|[lambda])/P(O|[lambda]) 8-If [P.sub.ma] is less than some threshold then this system call trace is thought to be abnormal. Finally we can use repeated abnormalities which is a technique adopted from the biological immune system to assure that the detected abnormal behaviour is due to a virus infection, and not just some rare program execution. In this approach we see if repeated abnormalities are detected within the traces of the same program then it is a virus infection because all subsequent traces will reflect the infection of the virus. On the contrary if single small abnormality occurs it is due to a program execution which was not simulated during the training period. 4.5.2 Learning from Real Behaviour In this section we will train the normal behaviour profile from the executing programs in real environments, the following steps show how to train HMM to detect abnormalities found in programs while they are running inside they real environment: 1-We will train the [lambda] for the first trace [O.sup.(1)] and initialize To start anew, which typically involves clearing all or some part of memory or disk. the average probability [P.sub.average] to P([O.sup.(1)]|[lambda]). 2-For every subsequent trace [O.sup.(i)] in [lambda] we will compute the probability of P([O.sup.(1)|[lambda]). [FIGURE 8 OMITTED] [FIGURE 9 OMITTED] 3-If P([O.sup.(i)]|[lambda]) differs a lot from [P.sub.average], the HMM model [lambda] is retrained to include [O.sup.(i)] then we set: [P.sub.average] = [i.summation over (j = 1)] P([O.sup.(j)|[lambda])/i 4-If the probability of the subsequent traces does not seem to differ a lot from the HMM has converged and we can stop training. 5-Finally we save [lambda] and in the database. Working in an uncontrolled environment is a big problem since a program might get infected while we are training our HMM, therefore we must monitor the environment very carefully while training since we want to be assured that there is no virus infected traces are used to train our HMM, prior to running our system we have proposed some assumptions to ensure that this problem wouldn't happen. So we must apply them before any detection or training takes place inside the system. 5. EXPERIMENTAL RESULTS This section discusses the output resulted from running our system. Graphs were generated to illustrate statistical results from using HMM Algorithms for virus detection and system training. This system is implemented in java programming language because of its compatibility and portability with a wide range of operating systems Operating systems can be categorized by technology, ownership, licensing, working state, usage, and by many other characteristics. In practice, many of these groupings may overlap. . Our Artificial immune system is deployed on a fresh copy of windows XP The previous client version of Windows. XP was a major upgrade to the client version of Windows 2000 with numerous changes to the user interface. XP improved support for gaming, digital photography, instant messaging, wireless networking and sharing connections to the Internet. professional edition; afterwards we inject a virus inside our system. Our virus relates to the family of file infector viruses and is named "Avpo". Avpo is a harmless virus but can do actions that tease the user and may exploit important information about the user's data, the Avpo virus hides itself in order not to be seen. After the user opens the device that contain the virus for example a USB USB in full Universal Serial Bus Type of serial bus that allows peripheral devices (disks, modems, printers, digitizers, data gloves, etc.) to be easily connected to a computer. device attached to the PC, Avpo begins to work do the following actions: 1-We can't see hidden folders and files. 2-The PC is running slowly, the virus affect the performance of running programs like the windows explorer See Explorer. and internet explorer Microsoft's Web browser, which comes with Windows starting with Windows 98. Commonly called "IE," versions for Mac and Unix are also available. Internet Explorer is the most widely used Web browser on the market. It has also been the browser engine in AOL's Internet access software. . 3-This virus is polymorphic polymorphic - polymorphism and can change its structure. 4-The virus adds itself to a file named autorun.inf and to a registry entry to replicate itself on system start up, or on its deletion from the system. [FIGURE 7 OMITTED] In the following sections we will estimate a log likelihood in order to observe a sequence in a HMM. The Probability of observing the sequence O in a HMM [lambda] is P(O|[lambda]). The log likelihood of observing the sequence O in a HMM [lambda] is given by log10 [P(O|[lambda])]. As mentioned before we use the HMMs to represent the normal behavior profile of programs. This means that we are training the HMM on sequence representing the program's behavior. To train a HMM for one or more sequence we use a reestimation formulae that maximize the log likelihood of observing the sequence in HMM. The reestimation formulae below will run about 500 iterations to reach our threshold. log10 [P(O|[[lambda].sub.i + 1]) - log10 [P(O|[[lambda].sub.i]) < [10.sup.-3] 5.1 Results for building the normal Behaviour profile for Single Program with Multiple HMMs In this section we have trained 29 HMMs states on a single program. The training with this number of states is done to show how it will affect the normal behaviour profile. The experiments illustrated in this section and all coming sections is implemented on a Pentium (4) 3.2 GHz processor power with hyper threading technology running windows XP professional edition on a software that emulates all the work for the experiments as if in real mode. Figure (7) shows the training of the 29 HMMs for the ntdetect.com file in which our virus try to overwrite (1) A data entry mode that writes over existing characters on screen when new characters are typed in. Contrast with insert mode. (2) To record new data on top of existing data such as when a disk record or file is updated. , the ntdetect.com file is a component of Microsoft Windows See Windows. (operating system) Microsoft Windows - Microsoft's proprietary window system and user interface software released in 1985 to run on top of MS-DOS. Widely criticised for being too slow (hence "Windoze", "Microsloth Windows") on the machines available then. NT operating systems that operate on the x86 architectures. It is used during the Windows NT (Windows New Technology) A 32-bit operating system from Microsoft for Intel x86 CPUs. NT is the core technology in Windows 2000 and Windows XP (see Windows). Available in separate client and server versions, it includes built-in networking and preemptive multitasking. start-up process, and is responsible for detecting basic hardware that will be required to start the operating system. The size of this file is 47 Kb and the virus uses this file to inject itself to every storage device on the system. Figure (8) illustrates the log likelihood as a function over the number of states in HMM; we can see the improvement in the log likelihood function when the number of states increases. 5.2 Results for detecting abnormal behaviour in a Program As we have trained 29 HMMs on our ntdetect.com file before the virus Avpo infected it, figure (9) shows how the 29 HMM states will do towards detecting the abnormal behaviour in the ntdetect.com file with respect to the log likelihood and the speed of detection. 5.4 Detecting Viruses using Traces of System Calls In this section we show the experiments done with detecting abnormal behaviour through system call traces with a program named "API (Application Programming Interface) A language and message format used by an application program to communicate with the operating system or some other control program such as a database management system (DBMS) or communications protocol. SPY" that trace all system calls and then write then to a text file (S. A. Hofmeyr, S. Forrest and A. Somayaji, 2005). Figure (10) shows the time it took to train 29 HMMs on 41 binary traces of system calls. In figure (11) we see the average log likelihood of observing the 41 binary traces with respect to HMM states. [FIGURE 8 OMITTED] The above results have convinced us to use traces of system calls generated by a program to train an HMM in order to represent the normal behaviour of a system or a user program. The experiments also showed that the deviation from the normal behaviour profile of the tested program gets very smaller. 6. CONCLUSION [FIGURE 9 OMITTED] By using the human immune system as our source for inspiration we have built an artificial immune system for computer viruses detection that resembles the function in the biological immune system from both innate and adaptive immunity, In addition our system nearly covers most features like distinguishing self from nonself, detecting normal and abnormal behaviour in programs . Finally statistical results from the experiments showed that using hidden markov model is the best approach for detecting abnormal behaviour in programs for both Active and inactive analysis for protecting our files and system processes from attack by malicious programs. [FIGURE 10 OMITTED] [FIGURE 11 OMITTED] 7. REFERENCES D. Dasgupta and F. Nino. "A comparison of negative and positive selection algorithms in novel pattern detection", in Proc. IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields. Int. Conf. Systems, Man, and Cybernetics cybernetics [Gr.,=steersman], term coined by American mathematician Norbert Wiener to refer to the general analysis of control systems and communication systems in living organisms and machines. , vol. 1, Oct. 2000, pp. 125-130 S. Forrest, L. Allen, A. S. Perelson, and R. Cherukuri. (May 1994), "Self-nonself discrimination in a computer", in Proc IEEE Symp Research in Security and Privacy, pp. 202-212. S. A. Hofmeyr, S. Forrest. (2000), "Architecture for an artificial immune system", Evolutionary Computation evolutionary computation - Computer-based problem solving systems that use computational models of evolutionary processes as the key elements in design and implementation. Journal, 8(4):443-473. S. A. Hofmeyr, S. Forrest, A. Somayaji. (1998), "Intrusion detection See IDS and IPS. using sequences of system calls", Journal of Computer Security, 6:151-180. S. A. Hofmeyr, S. Forrest. (1999), "Immunizing Computer Networks: Getting All the Machines in Your Network to Fight the Hacker Disease", In IEEE Symposium on Security & Privacy. S. A. Hofmeyr, S. Forrest, A. Somayaji. (June 2005), "Intrusion Detection using Sequences of System Calls", In Journal of Computer Security. Steven A. Hofmeyr. (April 1999), "An Immunological Model of Distributed Detection and its Application to Computer Security", PhD thesis, Department of Computer Sciences, University of New Mexico The University of New Mexico (UNM) is a public university in Albuquerque, New Mexico. It was founded in 1889. It also offers multiple bachelor's, master's, doctoral, and professional degree programs in all areas of the arts, sciences, and engineering. . Charles A Janeway, Paul Travers, Mark Walport, Mark Shlomchik. (2001), "Immunobiology: The Immune System in Health and Disease", 5th Ed., Garland Publishing. Anders Krogh. (1998), "An Introduction to Hidden Markov Models for Biological Sequences", Computational Methods in Molecular Biology molecular biology, scientific study of the molecular basis of life processes, including cellular respiration, excretion, and reproduction. The term molecular biology was coined in 1938 by Warren Weaver, then director of the natural sciences program at the Rockefeller , Elsevier, Steve R. White, Jeffrey O. Kephart, David M. Chess. (1995), "Computer Viruses: A Global Perspective", In Proceedings of the Fifth International Virus Bulletin Conference, Boston, pages 185-191. Lawrence R. Rabiner. (February 1989), "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2, Richard Marko. (2002), "Heuristics Retrospective and Future", Virus Bulletin Conference. W. W. Cohen cohen or kohen (Hebrew: “priest”) Jewish priest descended from Zadok (a descendant of Aaron), priest at the First Temple of Jerusalem. The biblical priesthood was hereditary and male. . (1995), "Fast effective rule induction Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data. In Machine Learning", the 12th International Conference", (Ed. Morgan Kaufmann). R. E. Vance. (2000), "Cutting edge commentary: A Copernican revolution The Copernican Revolution refers to the paradigm shift away from the Ptolemaic model of the heavens, which placed Earth at the center of the Universe. It was one of the starting points for the Scientific Revolution of the 17th Century. ? Doubts about the danger theory", Journal of Immunology The Journal of Immunology (The JI) is an academic journal that publishes basic and clinical studies in all aspects of immunology. It is owned and published by The American Association of Immunologists. Having an impact factor of 6. , 165:1725-1728, Moataz Samy Mottaz@gmail.com Dept. of Computer Science Faculty of Computers and Information Cairo University Cairo University (previously the Egyptian University and later Fouad the First University) is an institute of higher education located in Giza, Egypt. The university was founded on December 21, 1908 as the result of an effort to establish a national center for Amr Badr ruaab@rusys.eg.net Dept. of Computer Science Faculty of Computers and Information Cairo University Salwa El Gamal s.elgamal@fci-cu.edu.eg Dept. of Computer Science Faculty of Computers and Information Cairo University |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion