TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
“INTELLIGENT NETWORK INTRUSION DETECTION SYSTEM”
By:
Puneet Khanal
Rajiv Shrestha
Raju KC
A PROJECT SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND COMPUTER
ENGINEERING IN PARTIAL FULLFILLMENT OF THE REQUIREMENT FOR THE
BACHELOR’S DEGREE IN ELECTRONICS & COMMUNICATION / COMPUTER
ENGINEERING
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
LALITPUR, NEPAL
March, 2010
ii
LETTER OF APPROVAL
The undersigned certify that they have read, and recommended to the Institute of
Engineering for acceptance, a project report entitled “Intelligent Network Intrusion
Detection System" submitted by Puneet Khanal, Rajiv Shrestha and Raju KC in partial
fulfillment of the requirements for the degree “Bachelor of Computer Engineering”.
______________________________ ______________________________
Project Supervisor Project Supervisor
Babu Ram Dawadi Manoj Ghimire
Assistant Professor Lecturer
Department of Electronics and Computer Department of Electronics and
Engineering Computer Engineering
______________________________ ______________________________
Internal Examiner External Examiner
Purushottam Sigdel Krishna Prasad Bhandari
Director Senior Engineer
Center for Information Technology Nepal Telecom
________________________________
Project Coordinator and Deputy Head
Surendra Shrestha, Ph.D.
Department of Electronics and Computer Engineering
Institute of Engineering
DATE OF APPROVAL: 17
th
March, 2010
iii
COPYRIGHT
The author has agreed that the Library, Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering may make this report freely
available for inspection. Moreover, the author has agreed that permission for extensive
copying of this project report for scholarly purpose may be granted by the supervisors who
supervised the project work recorded herein or, in their absence, by the Head of the
Department wherein the project report was done. It is understood that the recognition will
be given to the author of this report and to the Department of Electronics and Computer
Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this
project report. Copying or publication or the other use of this report for financial gain
without approval of to the Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in
whole or in part should be addressed to:
Head
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering
Lalitpur, Kathmandu
Nepal
iv
ACKNOWLEDGEMENT
We are sincerely thankful to the Department of Electronics and Computer Engineering for
providing the opportunity to do this project.
We are indebted to our supervisor Mr. Babu Ram Dawadi and Mr. Manoj Ghimire for their
valuable suggestions and constant guidance for the accomplishment of the project. Besides,
we are also thankful to the Project Coordinator Mr. Surendra Shrestha for assisting and
guiding us in the project.
Last but not the least we are thankful towards our friends as well as teachers who
supported us all the way in the course of the project
Network Intrusion Detection Systems (NIDS) aim at preventing network attacks and
unauthorized remote use of computers. More accurately, depending on the kind of attack it
targets, an NIDS can be oriented to detect misuses (by defining all possible attacks) or
anomalies (by modeling legitimate behavior and detecting those that do not fit on that
model). Still, since their problem knowledge is restricted to possible attacks, misuse
detection fails to notice anomalies and vice versa. Against this, we present here Intelligent
Network Intrusion Detection System (INIDS), the misuse and anomaly detection system
based on Naive Bayes Classifier, trained with a KDDCup’99 dataset traffic, to analyze
completely network packets, and the strategy to create a consistent knowledge model that
integrates misuse and anomaly-based knowledge.
Finally, we evaluate against well-known and new attacks showing how it outperforms a
well-established industrial NIDS.
PAGE OF APPROVAL……………………………………………..……………………...II
COPYRIGHT……………………………………………………………………………...III
ACKNOWLEDGEMENT..……………………………………………………………….IV
ABSTRACT………………………………………………………….…………………….V
TABLE OF CONTENTS………………………………………………………………….VI
LIST OF FIGURES……………………………………………………………………...VIII
LIST OF TABLES……………………………………………………………………...…IX
LIST OF SYMBOLS AND ABBREVIATIONS…………………..………………………X
1 INTRODUCTION…………………………………………………..…………………….1
1.1 What is an IDS?......................................................................................................1
1.2 What is not an IDS?................................................................................................3
1.3 Attack Types…………………………………………….…….………………….3
1.4 Existing System……………………………………………….………………….4
1.5 Problem Statement……………………………………….…….…………………4
1.6 Objectives……………………………………………………….………………..4
1.7 Scope of the Project……………………………………….…….………………..5
2 LITERATURE REVIEW………………………………………………..………………..6
2.1 The TCP/IP Reference Model…………………………….…….………………6
2.1.1 Internet Protocol (IP)…………………………….……..……………..7
2.1.2 Internet Control Message Protocol (ICMP)…………….…………...10
2.1.3 User Datagram Protocol (UDP)………………………….………….12
2.1.4 Transmission Control Protocol (TCP)…………………….…………13
2.2 Naive Bayes Classifier…………………………………….…………………..16
2.3 Some Well-Known Attacks………………………………………….………...18
2.3.1 DoS………………………………………………………….……….18
2.3.2 Probe………………………………………………………….……...22
2.4 jNetPcap…………………………………………………….…………………25
vii
2.5 jSMILE………………………………………………………………….……..25
3 SYSTEM DESIGN……………………………………………………………………...26
3.1 System Block Diagram……………………………………….………………..27
3.2 Data Flow Diagrams (DFDs)………………………………….………………27
3.3 Unified Modeling Language (UML)…………………………………….…….30
6 TESTING………………………………………………………………………………..34
6.1 Level of Testing……………………………………………….………………34
6.2 Software Testing Strategies…………………………………….……………...35
7 RESULT…………………………………………………………………………….…...36
7.1 Screenshots………………………………………………………………….....36
7.2 Comparison with Other Existing System……………………….……………..41
8 CONCLUSIONS AND FURTHER WORK…………………………………………….42
8.1 Conclusions……………………………………………………....……………42
8.2 Further Work……………...…………………………………….…….……….42
REFERENCES…………………………………………………………………………… 43
APPENDIX A: RFCs……………………………………………………………………...45
APPENDIX B: UDP and TCP Ports………………………………………………………47
APPENDIX D: CD Contents….…………………………………………………………..50
viii
LIST OF FIGURES
Figure 2.1 TCP/IP Internet Model………………...………………………………………...7
Figure 2.2 IP Header Format…………………...…………………………………………...8
Figure 2.3 ICMP Header Format…………………..………………………………………11
Figure 2.4 UDP Header Format…………………….……………………………………..12
Figure 2.5 TCP Header Format……………………….…………………………………...13
Figure 2.6 Smurf attack………………………………..…………………………………..20
Figure 3.1 System Block Diagram…………………….…………………………………..27
Figure 3.2 Level-0 DFD……………………………….…………………………………..28
Figure 3.3 Level-1 DFD……………………………….…………………………………..28
Figure 3.4 Level-2 DFD……………………………….…………………………………..29
Figure 3.5 Use Case Diagram……………………………………………………………..30
Figure7.1 Naive Bayes Classifier………………………………………………………….36
Figure 7.2 GUI Layout…………………………………………………………………….37
Figure 7.3 Detection of normal packets only……………………………………………...38
Figure 7.4 Detection of anomalous packets only……………….…………………………39
Figure 7.5 Detection of both normal and anomalous packets ..….………………………..40
Figure 7.6 Accuracy of known attack…………………………….……………………….41
Figure 7.7 Accuracy of unknown attack………………………….……………………….41
Figure 7.8 Ease of Use…………………………………………….………………………41
ix
LIST OF TABLES
Table 2.1 Types of Service…………………………………………….………………….. 9
Table 2.2 Description of flags in the control field………………………………………...15
Table A.1 RFCs for each protocol………………………………………………………...45
Table B.1 List of UDP and TCP ports…………………………………………………….47
Table C.1 List of permitted ICMP messages……………………………………………...48
x
LIST OF SYMBOLS AND ABBREVIATIONS
Product
ACK Acknowledgment
API Application Programming Interface
DFDs Data Flow Diagrams
DNS Domain Name System
DoS Denial-of-Service
DS Dataset
DSCP Differentiated Services Code Point
GUI Graphical User Interface
HIDS Host-based Intrusion Detection System
ICMP Internet Control Message Protocol
IDS Intrusion Detection System
INIDS Intelligent Network Intrusion Detection System
IP Internet Protocol
NIDS Network Intrusion Detection System
OS Operating System
TCP Transmission Control Protocol
TCP/IP Transmission Control protocol / Internetworking Protocol
TOS Type of Service
TTL Time to Live
UDP User Datagram Protocol
1
1. INTRODUCTION
Nowadays, as more people make use of the internet, their computers and valuable data in
their computer systems become a more interesting target for the intruders. Attackers scan
the Internet constantly, searching for potential vulnerabilities in the machines that are
connected to the network. Intruders aim at gaining control of a machine and to insert a
malicious code into it. Later on, using these slaved machines (also called Zombies)
intruder may initiate attacks such as worm attack, Denial-of-Service (DoS) attack and
probing attack.
1.1. What is an IDS?
Intrusion is any set of actions that threaten the integrity, availability, or confidentiality of a
network resource. An intrusion detection system (IDS) monitors network traffic and
monitors for suspicious activity and alerts the system or network administrator. In some
cases the IDS may also respond to anomalous or malicious traffic by taking action such as
blocking the user or source IP address from accessing the network.
IDS come in a variety of “flavors” and approach the goal of detecting suspicious traffic in
different ways. There are network based (NIDS) and host based (HIDS) intrusion detection
systems.
a) NIDS: Network Intrusion Detection Systems (NIDS) are a subset of security
management systems that are used to discover inappropriate, incorrect, or anomalous
activities within networks.
b) HIDS: Host-based intrusion detection system (HIDS) monitors and analyzes the
internals of a computing system rather than the network packets on its external interfaces.
There are IDS that detect based on looking for specific signatures of known threats- similar
to the way antivirus software typically detects and protects against malware- and there are
2
IDS that detect based on comparing traffic patterns against a baseline and looking for
anomalies.
a) Signature Based: A signature based IDS will monitor packets on the network and
compare them against a database of signatures or attributes from known malicious threats.
This is similar to the way most antivirus software detects malware. The issue is that there
will be a lag between a new threat being discovered in the wild and the signature for
detecting that threat being applied to the IDS. During that lag time, the IDS would be
unable to detect the new threat. The limitation of this approach lies in its dependence on
frequent updates of the signature database and its inability to generalize and detect novel or
unknown intrusions.
b) Anomaly Based: An IDS which is anomaly based will monitor network traffic and
compare it against an established baseline. The baseline will identify what is “normal” for
that network- what sort of bandwidth is generally used, what protocols are used, what ports
and devices generally connect to each other- and alert the administrator or user when
traffic is detected which is anomalous, or significantly different, than the baseline.
However, statistical anomaly detection is not based on an adaptive intelligent model and
cannot learn from normal and malicious traffic patterns.
There are IDS that simply monitor and alert and there are IDS that perform an action or
actions in response to a detected threat.
a) Passive IDS: A passive IDS simply detects and alerts. When suspicious or malicious
traffic is detected an alert is generated and sent to the administrator or user and it is up to
them to take action to block the activity or respond in some way.
b) Reactive IDS: Reactive IDS will not only detect suspicious or malicious traffic and
alert the administrator, but will take pre-defined proactive actions to respond to the threat.
Typically this means blocking any further network traffic from the source IP address or
user.
3
Intrusion detection systems help network administrators prepare for and deal with network
security attacks. These systems collect information from a variety of systems and network
sources, and analyze them for signs of intrusion and misuse. A variety of techniques have
been employed for analysis ranging from traditional statistical methods to new machine
learning approaches.
1.2. What is not an IDS?
Contrary to popular marketing belief and terminology employed in the literature on
intrusion detection systems, not everything falls into this category. In particular, the
following security devices are not IDS:
Network logging systems used, for example, network traffic monitoring systems.
Anti-virus products designed to detect malicious software such as viruses, trojan
horses, worms, logic bombs.
Firewalls.
Security/cryptographic systems, for example VPN, SSL, S/MIME, Kerberos,
Radius etc.
1.3. Attack Types
Attack can be classified into three types. They are as follows:
a) Reconnaissance: These attacks involve the gathering of information about a system in
order to find its weaknesses such as port sweeps, ping sweeps, port scans, and Domain
Name System (DNS) zone transfers.
b) Exploits: These attacks take advantage of a known bug or design flaw in the system.
c) Denial-of-Service (DoS): These attacks disrupt or deny access to a service or resource.
4
1.4. Existing System
One of the most well known and widely used intrusion detection systems is the open
source, freely available Snort. It is available for a number of platforms and operating
systems including both Linux and Windows. Snort has a large and loyal following and
there are many resources available on the Internet where we can acquire signatures to
implement to detect the latest threats.
1.5. Problem Statement
The classical signature-based approach:
Cannot detect unknown or new intrusions.
Patches and regular updates are required.
The statistical anomaly-based approach:
Not based on an adaptive intelligent model.
Cannot learn from normal and malicious traffic patterns.
An alternative approach based on machine learning must be developed.
1.6. Objectives
To implement intrusion detection system using Naïve Bayes Classifier,
To protect secure information of an organization from outside and inside intruders,
To detect novel or unknown intrusions in real-time.
5
1.7. Scope of the Project
Increased network complexity, greater access, and a growing emphasis on the Internet have
made network security a major concern for organizations. The number of computer
security breaches has risen significantly in the last three years. In February 2000, several
major web sites including Yahoo, Amazon, E-Bay, Datek, and E-Trade were shut down
due to denial-of-service attacks on their web servers.
Today, a large amount of sensitive information is processed through computer networks,
thus it is increasingly important to make information systems, especially those used for
critical functions in the military and commercial sectors, resistant and tolerant to network
intrusions. Hence Intrusion Detection has become an integral part of the information
security process.
6
2. LITERATURE REVIEW
2.1. The TCP/IP Reference Model
The TCP/IP layer is a multi-layered architecture. This means that we have one
functionality running at one depth, and another one at another level, and so forth. We can
add new functionality to the application layers, for example, without having to re-
implement the whole TCP/IP stack code, or to include a complete TCP/IP stack into the
actual application.
The following four layers comprise the TCP/IP Internet model:
a) Application layer
Handles implementation of user applications.
b) Transport layer
Manages end-to-end communications between hosts.
Two transport layers protocols are TCP and UDP.
c) Network layer
Gets data from source to destination.
d) Link layer
Manages data transfer to and from physical medium.
7
Figure 2.1 TCP/IP Internet Model
2.1.1. Internet Protocol (IP)
The IP protocol resides in the Internet layer. It is an unreliable and connectionless
datagram protocol-a best-effort delivery service. The term best-effort means that IPv4
provides no error control or flow control (except for error detection on the header). IPv4
assumes the unreliability of the under- lying layers and does its best to get a transmission
through to its destination, but with no guarantees. If reliability is important, IPv4 must be
paired with a reliable protocol such as TCP.
IP Header
A datagram is a variable-length packet consisting of two parts: header and data.
The header is 20 to 60 bytes in length and contains information essential to routing and
delivery. The header has a 20-byte fixed part and a variable length optional part of
maximum of 40-bytes. The header format is shown below:
Web
browser
TCP
IP
Ethernet
driver
Ethernet
driver
Version (VER): This four bits field tells the version of IPV4 protocol in binary which
value is 0100.
Header Length (HLEN): This four bits field defines the total length of the datagram
header in four byte words. This field is needed because the length of the header is variable
(between 20 and 60 bytes). When there are no options, the header length is 20 bytes, and
the value of this field is five (5 x 4 = 20). When the option field is at its maximum size, the
value of this field is 15 (15 x 4 = 60).
Service: This has two interpretations. They are:
a) Service Type
In this interpretation, the first three bits are called precedence bits. The next four bits are
called type of service (TOS) bits, and the last bit is not used.
9
Table 2.1 Types of Service
TOS Bits Description
0000 Normal (default)
0001 Minimize cost
0010 Maximize reliability
0100 Maximize throughput
1000 Minimize delay
b) Differentiated Services
According to this standard bits [0-5] is Differentiated Services Code Point (DSCP) and the
remaining two bits [6-7] are still unused.
Total Length: This field defines the total length (header plus data) of the IPv4 datagram in
bytes. The maximum size is 65535 octets, or bytes, for a single packet.
Identification: This field is used in reassembly of fragmented packets.
Flags: This field is used in fragmentation. The first bit is reserved, but still not used, and
must be set to zero. The second bit is set to zero if the packet may be fragmented and to
one if it may not be fragmented. The third and last bit can be set to zero if this was the last
fragment and one if there are more fragments of this same packet.
Fragmentation Offset: The fragmentation offset field tells where in the datagram that this
packet belongs. The fragments are calculated in 64 bits, and the first fragment has offset
zero.
Time to Live: The TTL field defines how long the packet may live, or rather how many
"hops" it may take over the Internet. After processing the datagram, each router
decrements this number by one. If this value, after being decremented, is zero, the router
discards the datagram.
10
Protocol: This field indicates the protocol of the next level layer. This can be TCP, UDP
or ICMP.
Checksum: This field is used for error detection.
Source Address: This field contains the source address.
Destination Address: This field contains the destination address.
Option: If the Header Length is greater than five, it means that the Options field is present
and must be considered. The options field contains different optional settings such as
Internet timestamps, SACK or record route options.
Padding: This field is used to make the header end at an even 32 bit boundary. The field
must always be set to zeroes straight through to the end.
2.1.2. Internet Control Message Protocol (ICMP)
The Internet Control Message Protocol (ICMP) is gives important information about the
health of the network.
Types of Messages
ICMP messages are divided into two broad categories:
a) error-reporting messages, and
b) query messages.
The error-reporting messages report problems that a router or a host (destination) may
encounter when it processes an IP packet. Five types of errors are handled: destination
unreachable, source quench, time exceeded, parameter problems, and redirection. The
query messages, which occur in pairs, help a host or a network manager get specific
information from a router or another host. For example, nodes can discover their
11
neighbors. Also, hosts can discover and learn about routers on their network, and routers
can help a node redirect its messages. Four types of query messages are – echo request and
reply, timestamp request and reply, address-mask request and reply, & router solicitation
and advertisement.
ICMP Header
8-bits 8-bits 16-bits
Type Code Checksum
Rest of the header
Data Sections
Figure 2.3 ICMP Header Format
ICMP Header Field Description
Type: The type field contains the ICMP type of the packet. This is always different from
ICMP type to type.
Code: All ICMP types can contain different codes as well. Some types only have a single
code, while others have several codes that they can use.
Checksum: This field is used for error detection.
12
2.1.3. User Datagram Protocol (UDP)
The User Datagram Protocol (UDP) is called a connectionless, unreliable transport
protocol. It does not add anything to the services of IP except to provide process-to-
process communication instead of host-to-host communication. Also, it performs very
limited error checking.
If UDP is so powerless, why would a process want to use it? With the disadvantages come
some advantages. UDP is a very simple protocol using a minimum of overhead. If a
process wants to send a small message and does not care much about reliability, it can use
UDP.
UDP Header
The UDP header can be said to contain a very basic and simplified TCP header. It contains
destination-ports, source-ports, header length and a checksum as seen in the image below.
16-bits 16-bits
Source Port Destination Port
Total Length Checksum
Figure 2.4 UDP Header Format
UDP Header Field Description
Source Port: This field indicates the port number used by the process running on the
source host. It is 16-bits long. The port number can range from 0 to 65,535.
Destination Port: This field indicates the port number used by the process running on the
destination host. It is also 16-bits long.
13
Total Length: The length field specifies the length of the whole packet (header and data
portions).
Checksum: This field is used to detect errors over the entire user datagram (header plus
data).
2.1.4. Transmission Control Protocol (TCP)
TCP, like UDP, is a process-to-process (program-to-program) protocol. TCP, therefore,
like UDP, uses port numbers. Unlike UDP, TCP is a connection- oriented protocol; it
creates a virtual connection between two TCPs to send data. In addition, TCP uses flow
and error control mechanisms at the transport level. In brief, TCP is called a connection-
oriented, reliable transport protocol. It adds connection-oriented and reliability features to
the services of IP.
TCP Header
32-bits
Source Port Address(16-bits) Destination Port Address(16-bits)
Sequence Number(32-bits)
Acknowledge Number(32-bits)
HLEN
(4-bits)
Reserved
(6-bits)
U
R
G
A
C
K
P
S
H
R
S
T
S
Y
N
F
I
N
Window Size(16-bits)
Checksum(16-bits) Urgent Pointer(16-bits)
Options and Padding
Figure 2.5 TCP Header Format
14
TCP Header Field Description
Source Port: This field indicates the source port of the packet. The source port is directly
bound to the process on the sending system.
Destination Port: This field indicates the destination port of the TCP packet. Just as with
the source port, this port is directly bound to the process on the receiving system.
Sequence Number: This field is used to set a number on each TCP packet so that the TCP
stream can be properly sequenced. The Sequence number is then returned in the ACK field
to acknowledge that the packet was properly received.
Acknowledgement Number: This field is used to acknowledge a specific packet a host
has received. For example, we receive a packet with one Sequence number set, and if
everything is okay with the packet, we reply with an ACK packet with the
Acknowledgment number set to the same as the original Sequence number.
Header Length: This four bits field indicates the number of four byte words in the TCP
header. The length of the header can be between 20 and 60 bytes. Therefore, the value of
this field can be between five (5 x 4 = 20) and 15 (15 x 4 = 60).
Reserved: This is a six bits field reserved for future usage.
Control: This field defines six different control flags as:
15
Table 2.2 Description of flags in the control field
Flag Description
URG The value of the urgent pointer field is valid.
ACK The value of the acknowledgment field is valid.
PSH Push the data.
RST Reset the connection.
SYN Synchronize sequence numbers during connection.
FIN Terminate the connection.
Window: This field is used by the receiving host to tell the sender how much data the
receiver permits at the moment. This can be done by sending an ACK back, which contains
the Sequence number that we want to acknowledge, and the Window field then contains
the maximum accepted sequence numbers that the sending host can use before he receives
the next ACK packet. The next ACK packet will update accepted Window which the
sender may use.
Checksum: This field contains the checksum of the whole TCP header. The checksum
also covers a 96 bit pseudo header containing the destination-address, source-address,
protocol, and TCP length. This is for extra security.
Urgent Pointer: This field contains a pointer that points to the end of the data which is
considered urgent. If the connection has important data that should be processed as soon as
possible by the receiving end, the sender can set the URG flag and set the Urgent pointer to
indicate where the urgent data ends.
Option: The Option field is a variable length field and contains optional headers that we
may want to use.
Padding: This padding field pads the TCP header until the whole header ends at a 32-bit
boundary. This ensures that the data part of the packet begins on a 32-bit boundary, and no
data is lost in the packet. The padding always consists of only zeros.
16
2.2. Naive Bayes Classifier
A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem
with strong (naive) independence assumptions. A more descriptive term for the underlying
probability model would be "independent feature model".
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a
particular feature of a class is unrelated to the presence (or absence) of any other feature.
Depending on the precise nature of the probability model, naive Bayes classifiers can be
trained very efficiently in a supervised learning setting. In spite of their naive design and
apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in
many complex real-world situations.
An advantage of the naive Bayes classifier is that it requires a small amount of training
data to estimate the parameters (means and variances of the variables) necessary for
classification. Because independent variables are assumed, only the variances of the
variables for each class need to be determined and not the entire covariance matrix. The
Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales
linearly with the number of predictors and rows. The build process for Naive Bayes is
parallelized. Naive Bayes can be used for both binary and multiclass classification
problems.
The Naive Bayes algorithm is based on conditional probabilities. It uses Bayes' Theorem, a
formula that calculates a probability by counting the frequency of values and combinations
of values in the historical data.
Bayes' Theorem
Bayes' Theorem finds the probability of an event occurring given the probability of another
event that has already occurred. If B represents the dependent event and A represents the
prior event, Bayes' theorem can be stated as follows.
17
Prob(B given A) = Prob(A and B)/Prob(A)
To calculate the probability of B given A, the algorithm counts the number of cases where
A and B occur together and divides it by the number of cases where A occurs alone.
Naive Bayes Algorithm
X be a set of instances xi = (a1,a2,…,an)
V be a set of classifications vj
Naive Bayes assumption:
P ( a1, a2, . an | vj ) = ……………………………….… “(2.1)”
This leads to the following algorithm:
Naive_Bayes_Learn ( examples )
for each target value vj
estimate P ( vj )
for each attribute value ai of each attribute a
estimate P ( ai | vj )
Classify_New_Instance ( x )
We generally estimate P ( ai | vj ) using m-estimates:
P ( ai | vj ) = ……………………………………………………. “(2.2)”
where:
n = the number of training examples for which v = vj
nc = number of examples for which v = vj and a = ai
p = a priori estimate for P ( ai | vj )
m = the equivalent sample size
18
2.3. Some Well-Known Attacks
2.3.1. DoS
A denial of service attack (DoS attack) or distributed denial of service (DDos) is an
attempt to make a computer resource unavailable to its intended users. Perpetrators of DoS
attacks typically target sites or services hosted on high-profile web servers such as banks,
credit card payment gateways, etc. The term is generally used with regards to computer
networks, but is not limited to this field, for example, it is also used in reference to CPU
resource management.
One common method of attack involves saturating the target (victim) machine with
external communications requests, such that it cannot respond to legitimate traffic, or
responds so slowly as to be rendered effectively unavailable. In general terms, DoS attacks
are implemented by either forcing the targeted computer(s) to reset, or consuming its
resources so that it can no longer provide its intended service or obstructing the
communication media between the intended users and the victim so that they can no longer
communicate adequately.
Denial-of-service attacks are considered violations of the IAB's Internet proper use policy,
and also violate the acceptable use policies of virtually all Internet Service Providers. They
also commonly constitute violations of the laws of individual nations.
There are many varieties of denial of service (or DoS) attacks. Some DoS attacks (like a
mailbomb, neptune, or smurf attack) abuse a perfectly legitimate feature. Others (teardrop,
Ping of Death) create malformed packets that confuse the TCP/IP stack of the machine that
is trying to reconstruct the packet. Still others (apache2, back, syslogd) take advantage of
bugs in a particular network daemon.
Some Captured DoS attacks are as follows:
a) Smurf
b) Neptune
c) Teardrop
19
d) Pod
e) Land
f) Nuke
Smurf
The smurf attack is a way of generating significant computer network traffic on a victim
network. This is a type of denial-of-service attack that floods a target system via spoofed
broadcast ping messages.
In the "smurf" attack, attackers use ICMP echo request packets directed to IP broadcast
addresses from remote locations to create a denial-of-service attack. There are three parties
in these attacks: the attacker, the intermediary, and the victim (note that the intermediary
can also be a victim). The attacker sends ICMP “echo request” packets to the broadcast
address (xxx.xxx.xxx.255) of many subnets with the source address spoofed to be that of
the intended victim. Any machines that are listening on these subnets will respond by
sending ICMP “echo reply” packets to the victim. The smurf attack is effective because the
attacker is able to use broadcast addresses to amplify what would otherwise be a rather
innocuous ping flood. In the best case (from an attacker’s point of view), the attacker can
flood a victim with a volume of packets 255 times as great in magnitude as the attacker
would be able to achieve without such amplification. This amplification effect is illustrated
by Figure 2.6. The attacking machine sends a single spoofed packet to the broadcast
address of some network, and every machine that is located on that network responds by
sending a packet to the victim machine. Because there can be as many as 255 machines on
an Ethernet segment, the attacker can use this amplification to generate a flood of ping
packets 255 times as great in size as would otherwise be possible. This figure is a
simplification of the smurf attack. In an actual attack, the attacker sends a stream of icmp
“ECHO” requests to the broadcast address of many subnets, resulting in a large,
continuous stream of “ECHO” replies that flood the victim.
20
Hundreds of echo reply’s flood
One echo request sent to
broadcast address.
Figure 2.6 Smurf attack
Teardrop
A teardrop attack is a denial of service attack. The teardrop attack uses IP to create packet
reassembly problems so the target computer crashes. The teardrop attack uses erroneous
packet header information indicating overlapping fragments of packets so some data in
some packets must overwrite data in other packets to re-assemble the packet. Attempts to
re-assemble these packets with overlapping data can cause the computer to crash if the
software is not prepared to handle erroneous packet header information.
Neptune
Neptune (SYN Flood) is a denial of service attack to which every TCP/IP implementation
is vulnerable (to some degree). For distinguishing a Neptune attack network traffic is
monitored for a number of simultaneous SYN packets destined for a particular machine.
The host sending these packets is usually unreachable.
Internet
Attacker
Victim
Echo Request
From attacker
To 192.168.0.225
Echo Reply
from 192.168.0.20
to victim
Echo Reply
from 192.168.0.20
to victim
Echo Reply
from 192.168.0.20
to victim
Echo Reply
from 192.168.0.20
to victim
21
Each half-open TCP connection made to a machine causes the “tcpd” server to add a
record to the data structure that stores information describing all pending connections. This
data structure is of finite size, and it can be made to overflow by intentionally creating too
many partially-open connections. The half-open connections data structure on the victim
server system will eventually fill and the system will be unable to accept any new
incoming connections until the table is emptied out. Normally there is a timeout associated
with a pending connection, so the half-open connections will eventually expire and the
victim server system will recover. However, the attacking system can simply continue
sending IP-spoofed packets requesting new connections faster than the victim system can
expire the pending connections. In some cases, the system may exhaust memory, crash, or
be rendered otherwise inoperative.
POD
A ping of death (abbreviated "POD") is a type of attack on a computer that involves
sending a malformed or otherwise malicious ping to a computer. A ping is normally 64
bytes in size (or 84 bytes when IP header is considered); many computer systems cannot
handle a ping larger than the maximum IP packet size, which is 65,535 bytes. Sending a
ping of this size can crash the target computer.
Traditionally, this bug has been relatively easy to exploit. Generally, sending a 65,536 byte
ping packet is illegal according to networking protocol, but a packet of such a size can be
sent if it is fragmented; when the target computer reassembles the packet, a buffer overflow
can occur, which often causes a system crash.
This exploit has affected a wide variety of systems, including Unix, Linux, Mac, Windows,
printers, and routers. However, most systems since 1997-1998 have been fixed, so this bug
is mostly historical.
In recent years, a different kind of ping attack has become wide-spread - ping flooding
simply floods the victim with so much ping traffic that normal traffic fails to reach the
system (a basic denial-of-service attack).
22
Land
The Land attack occurs when an attacker sends a spoofed SYN packet in which the source
address is the same as the destination address. The reason a LAND attack works is because
it causes the machine to reply to itself continuously. Directed against vulnerable systems,
this attack caused systems to lock up or become unstable.
Nuke
Nuke is an old dos attack against computer network consisting of fragmented or otherwise
invalid ICMP packets sent to the target, achieved by using modified ping utility to
repeatedly send the corrupt data, thus slowing down the affected computer until it comes to
complete stop.
2.3.2. Probe
Probing is a class of attacks in which an attacker scans a network of computers to collect
information or find known vulnerabilities. An intruder with a map of machines and
services that are available on a network can use this information to look for exploits. There
are different types of probing: some of them abuse the computer’s legitimate features;
other ones use social engineering techniques. This class of attacks is the most commonly
heard and requires very little technical expertise. Examples are Ipsweep, Mscan, Nmap,
Saint, Satan, Pingsweep and Portsweep attacks.
Following are the captured attacks.
a) Satan
b) Ipsweep
c) Portsweep
d) Nmap
23
Nmap
Nmap is a "Network Mapper", used to discover computers and services on a computer
network, thus creating a "map" of the network. Just like many simple port scanners, Nmap
is capable of discovering passive services on a network despite the fact that such services
aren't advertising themselves with a service discovery protocol. In addition Nmap may be
able to determine various details about the remote computers. These include operating
system, device type, uptime, software product used to run a service, exact version number
of that product, presence of some firewall techniques and, on a local area network, even
vendor of the remote network card.
Nmap can be used for black hat hacking, or attempting to gain unauthorized access to
computer systems. It would typically be used to discover open ports which are likely to be
running vulnerable services, in preparation for attacking those services with another
program.
System administrators often use Nmap to search for unauthorized servers on their network,
or for computers which don't meet the organization's minimum level of security.
Satan
Satan is a probing intrusion which automatically scans a network of computers to gather
information or find known vulnerabilities.
SATAN is an early predecessor of the SAINT scanning program described in the
lastsection. While SAINT and SATAN are quite similar in purpose and design, the
particular vulnerabilities that each tools checks for are slightly different. Like SAINT,
SATAN is distributed as a collection of perl and C programs that can be run either from
within a web browser or from the UNIX command prompt. SATAN supports three levels
of scanning: light, normal, and heavy. The vulnerabilities that SATAN checks for in heavy
mode are:
24
NFS export to unprivileged programs
NFS export via portmapper
NIS password file access
REXD access
tftp file access
remote shell access
unrestricted NFS export
unrestricted X Server access
write-able ftp home directory
several Sendmail vulnerabilities
several ftp vulnerabilities
Scans in light and normal mode simply check for smaller subsets of these vulnerabilities.
Ipsweep
An Ipsweep attack is a surveillance sweep to determine which hosts are listening on a
network. This information is useful to an attacker in staging attacks and searching for
vulnerable machines. There are many methods an attacker can use to perform an Ipsweep
attack. The most common method and the method used within the simulation is to send
ICMP Ping packets to every possible address within a subnet and wait to see which
machines respond.
Portsweep
Port Sweep is a network testing tool that will let attacker learn a lot about Internet and its
functionality. It is like more applications combined together to get more efficient results in
easier way. Attacker can gather information about the computer and some other computers
that are connected to Internet. This professionally designed application can be handy in
finding all information (location, network type) about certain computer (IP, server, e-
mail).Attacker can sweep their network to see if there is any open ports waiting to be
hacked, to see what data is sent etc.
25
2.4. jNetPcap
jNetPcap is a java wrapper around libpcap and WinPcap native libraries found on various
unix and windows platforms. jNetPcap exposes the functionality as a java programming
interface (API) which helps in capturing packets in the network.
The main classes which implement libpcap and WinPcap functionality are:
org.jnetpcap.Pcap class - core libpcap methods available on all platforms
org.jnetpcap.winpcap.winpcap class - extensions based on WinPcap library
typically only available on windows based system
The core libpcap implementation of jNetPcap, provides methods to do the following
functions
Find a complete list of network interfaces the system has
Open either a network interface or a PCAP capture file for reading packets
Apply a packet filter
Dump packets into a PCAP capture file
Transmit raw link layer packets over a network interface
Gather statistics on network interface and report counters
2.5. jSMILE
jSMILE is a platform independent library of java classes for reasoning in graphical
probabilistic models, such as Bayesian networks and influence diagrams. It can be
embedded in programs that use graphical probabilistic models as their reasoning engines.
It is enough for jSMILE to have JRE installed so it be used to create stand-alone
applications, applets, and servlets. Model building and inference are under full control of
the application program, as the jSMILE library serves merely as a set of tools and
structures that facilitates them.
26
3. SYSTEM DESIGN
Our aim is to design and develop an Intelligent Network Intrusion Detection System
(INIDS) that would be accurate, low in false alarms, not easily cheated by small variations
in patterns, adaptive and real time detection.
Attributes Used
For our INIDS, we have extracted 18 features from tcpdump files which can identify
packet characteristics. The features are:
protocol type,
ip length,
don’t fragment flag(df),
more fragment flag(mf),
fragmentation offset,
syn flood,
urgent pointer,
tcp flags(urg, ack, psh, rst, syn, fin),
tcp window size,
udp checksum,
icmp flood,
icmp checksum, and
type (packet is normal or attack)
27
3.1. System Block Diagram
Figure 3.1 System Block Diagram
3.2. Data Flow Diagrams (DFDs)
DFD is a structured, diagrammatic technique for showing the functions performed by a
system and the data flowing into, out of, and within it.
The 'Context Diagram 'or ‘level-0 DFD’ is an overall, simplified, view of the target
system, which contains only one process box and the primary inputs and outputs.
Network
Sniffer
Detector
File
System
Knowledge
Based
Engine
Training
DataSet
Captured
Normal
Attack
Trained
28
Figure 3.2 Level-0 DFD
The ‘level-1 DFD’ shows all processes at the first level of numbering, data stores, external
entities and the data flows between them. The purpose of this level is to show the major
high-level processes of the system and their interrelation.
Figure 3.3 Level-1 DFD
29
The ‘level-2 DFD’ is a decomposition of a process shown in a level-1 diagram. Here we
have decomposed “inference engine” process.
Figure 3.4 Level-2 DFD
30
3.3. Unified Modeling Language (UML)
UML is now the most widely used graphical representation scheme for modeling object-
oriented systems. An attractive feature of the UML is its flexibility. The UML is extensible
and is independent of any particular OOAD process. We have created a use case diagram
to model the interactions between network administrators or crackers with theirs use cases.
Network Admin
Cracker
INIDS
Train Dataset
Test Dataset
Attack System
Add to Dataset
Run System
Figure 3.5 Use Case Diagram
31
4. METHODOLOGY
To develop our system, we have adopted the traditional waterfall model. The waterfall
model is a sequential software development process, in which progress is seen as flowing
steadily downwards like a waterfall through the phases of conception, analysis, design,
construction, testing and maintenance. To follow the waterfall model, one proceeds from
one phase to the next in a sequential manner. For example, when the requirements are fully
completed, one proceeds to design. When the design is fully completed, an implementation
of that design is made by coders. Towards the later stages of this implementation phase,
separate software components produced are combined to introduce new functionality and
reduced risk through the removal of errors. Thus the waterfall model maintains that one
should move to a phase only when its preceding phase is completed and perfected.
As this project is based on knowledge-based, a sizeable proportion of time was spent
researching strategies for implementation. In order to achieve our desired goal regarding
our project, we had come across several books and websites along with the remarkable
suggestions of friends and seniors. We studied different existing systems that are
applicable in several fields. We went through those existing systems and found out their
characteristics, applicability and limitations as well. In this regard, the existed intrusion
detection system "snort" became the inspiring software for us which is signature-based and
failed to detect unknown intrusions and rely on the signatures extracted by human experts.
A learning algorithm is good if it produces better prediction for the classifications of
unseen examples. First we train our model with training dataset and then we test with test
dataset. So, it is more convenient to adopt the following methodology:
Collect a large set of examples.
Divide it into two disjoint sets: the training set and the test set.
Apply the learning algorithm to the training set.
Measure the percentage of examples in the test set that are correct classified.
32
For the training and testing of our INIDS, we have used the 1998 DARPA’s dataset
provided by MIT Lincoln Laboratory. It is widely used dataset to train and test the
intrusion detection system. It provides around 4 gigabytes of compressed Tcpdump data
for 7 weeks of the network traffic. Each week has five days, and each day has the TCP
dump data. It also provides TCP dump list file, which labels every flow whether the flow is
attack or not. Every entries consists of the flow identifier number, date, time when the first
packet of the flow is arrived, duration, service name, source port number, destination port
number, source IP address, destination IP address, attack score, and the name of the attack.
With this file, we are able to recognize which flow is an attack and to extract the data from
the TCP dump data with the information in the TCP dump list file.
First week and second week of training data consists of normal traffic and other week
consists of mixed dataset i.e. normal traffic and attack traffic. For the purpose of training
our intrusion detection system, we have extracted normal traffic from outside tcpdump of
the day Wednesday and Thursday of second week. Similarly, we have extracted attack
traffic from other week’s traffic. We have used editcap tool to split the huge tcpdump file
and wireshark to filter the desired packets.
For our INIDS, we have extracted 18 features from tcpdump files which can identify
packet characteristics. The features have to be preprocessed to be suitable for naive bayes
algorithm because naive bayes algorithm cannot handle continuous value. So, while
making dataset the continuous features are discretized. Then, this dataset is fed for the
purpose of learning naive bayes classifier. Again, when inferencing we extract all the
features for each packet and we feed them to naive bayes classifier which calculates the
probability of packet is normal and based on the threshold the packet is classified as
normal or attack.
33
5. IMPLEMENTATION
5.1. Object-Oriented Design
In this technique, various objects that occur in the problem domain and the solution
domain are first identified and different kinds of relationships that exist among these
objects are identified. This object structure is further refined to obtain the detailed design.
This approach has several advantages such as less development effort, and time and better
maintainability.
During this implementation phase, each component of the design is implemented as a
program module, and each of these programs modules is unit tested, debugged and
documented.
Tools Used:
Netbeans 6.5 IDE
API Used:
JSmile API
JNetPcap
Language Used:
Java
System Installation Requirement:
Operating System - XP, Vista, Window - 7
CPU - 500 MHz (or above)
Memory - 128MB (or above)
34
6. TESTING
Testing is necessary to carry-out whether the modules or system is working properly or
not.
6.1. Level of Testing
While implementing our system, we go through various levels of testing which are as
follows:
a) Unit Testing: The purpose or unit testing is to determine the correct working of the
individual modules.
b) Integration Testing: During this phase the different modules are integrated in a
planned manner. The different modules making up a system are never integrated in a single
shot. Integration is normally carried out through a number of steps. During each integration
step, the partially integrated system is tested.
c) System Testing: Finally when all the modules have been successfully integrated and
tested, system testing is carried out.
35
6.2. Software Testing Strategies
Two of the most prevalent strategies that we performed are black-box testing and white-
box testing.
a) Black-Box testing: Demonstrates that software functions are operational and the input
is properly accepted and output is correct produced.
b) White-Box testing: Examines the fundamental aspect of the system with complete
information and access to the internal logical structure, code and algorithms.
A lot of features are still to be added in our project. There are many limitations which are
still to be corrected. Before releasing the final version of software, alpha testing, beta
testing and acceptance testing can be done additionally.
36
7. RESULT
7.1. Screenshots
Figure7.1 Naive Bayes Classifier
37
Figure 7.2 GUI Layout
38
Figure 7.3 Detection of normal packets only
39
Figure 7.4 Detection of anomalous packets only
40
Figure 7.5 Detection of both normal and anomalous packets
41
7.2. Comparison with Other Existing System
Our INIDS can be compared with the existing IDS system such as snort which is regarded
as ideal intrusion detection system. Snort is signature-based, whereas our system is
machine learning-based. In terms of known attacks, we see that snort is better, whereas in
case of unknown attacks, our system is better. Snort has command line configuration mode
whereas our system has GUI mode for the configuration. As a result, one can find that our
system is easy to use.
High
Low
High
Figure 7.6 Accuracy of known attack Figure 7.7 Accuracy of unknown attack
High
Low
Figure 7.8 Ease of Use
S
N
O
R
T
I
N
I
D
S
I
N
I
D
S
S
N
O
R
T
I
N
I
D
S
S
N
O
R
T
S
Low
or
0
42
8. CONCLUSIONS AND FURTHER WORK
8.1. Conclusions
We accomplished the project regarding the detection of network intrusions based on Naive
Bayes algorithm. The completed project can detect the novel attacks with the learning
techniques which were not detected by the existing system, Snort. Comparing with snort,
although it provides high accuracy, it was more time consuming requiring regular updates.
Our system can detect the intrusions more efficiently with less time consuming.
After completing this project we are able to do teamwork and knew the way to task
dividing and cooperating in the task. Successful work not only made us feel proud but we
also became good companions. In this way we completed our project successfully.
8.2. Further Work
Our system works only for IPv4 network. In future, it can be extended to IPv6 network.
We have analyzed only packet header. So, our system could not detect “Exploits”
intrusions. So, we could add payload analyzing features in our system in future.
As a naïve Bayesian network is a restricted network that has only two layers and assumes
complete independence between the information nodes. This poses a limitation to this
research work. In order to alleviate this problem so as to reduce the false positives, active
platform or event based classification may be thought of using Bayesian network. We
continue our work in this direction in order to build an efficient intrusion detection model.