For a past few decades, there has been quick progress in internet based
applications and technology in the area of computer networks. Data is most important
asset of any organization and organizations require proper protection and management
of private and highly sensitive information. Nowadays cyber-attacks have become very
common and network security can be provided with Intrusion Detection Systems. An
intrusion detection system analyses and gathers information from various areas within
a network or computer to identify possible security breaches, which include both
misuse and intrusion. Researchers are interested in intrusion detection system using
data mining techniques as a deceitful skill. Data mining is the knowledge discovery
process by analysing the huge volume of data from various perspectives and
summarizing it into useful information. Data mining is used to find cloaked patterns
from a large data set. Classification is one of the most important applications of data
mining. Classification techniques are used to classify data items into predefined class
label. During data mining the classifier builds classification models from an input data
set, which are used to predict future data trends. This work aims to give an intrusion
detection system using Bagging Ensemble Selection.
ெதாழில்�ட்பம் வ�ைரவான �ன்ேனற்றம் இ�ந்த�. அைமப்ப�ன்
மிக �க்கியமான ெசாத்தா�ம் மற்�ம் தகவல் பா�காப்� மற்�ம்
ெநட்ெவார்க் பா�காப்� ஊ��வைலக் கண்டறி�ம் �ைறகைள
வழங்க ���ம். ஊ��வல் கண்டறிதல் மற்�ம் தவறாக மற்�ம்
பா�காப்� ேதால்வ�க�க்கான கண்டறிய ெநட்ெவார்க் அல்ல�
கண�ன� உள்ள, பல்ேவ� ப�திகள�ல் இ�ந்� தகவல் திரட்�ம்.
ஆராய்ச்சியாளர்கள், �ரங்க �ட்பங்கைள பயன்ப�த்தி ஊ��வல்
கண்டறிதல் அைமப்� ஆர்வமாக உள்ளனர். ேடட்டா ைமன�ங்
பல்ேவ� ேகாணங்கள��ம் ெப�ய அள� ப�ப்பாய்� மற்�ம்
பய�ள்ள தகவல்கைள அைத ��க்�தல் அறி� கண்�ப��ப்�
ெபாய்த்ேதாற்றப் வ�வங்கள் கண்�ப��க்க பயன்ப�த்தப்ப�ம். தள
தர� �ரங்க மிக �க்கியமான பயன்பா�கள் ஒன்றா�ம்.
First of all, I extend my gratitude from the bottom of my heart to the honorable
Chairman Dr. L P Thangavelu, MS., FAIS, FIAGES, FICS, and to our respected
Correspondent Mrs. Shanthi Thangavelu of PPG Institute of Technology, for
providing me the best platform that satisfies my quest for knowledge throughout my
curriculum, it has also paved the way to take up a challenging project.
I feel elated in recording my heart-felt gratitude to our Principal
Dr. R Prakasam, B.E. (Hons)., M.E., Ph.D., FIE., C.Engg., MISTE, for his
motivation and his ardent attitude in providing all necessary facilities that has helped
me in shaping up this project.
I am highly grateful to Mr. P P Joby., M.Tech., (Ph.D.), Associate Professor
and Head, Department of Computer Science and Engineering, for offering incessant
help in all possible ways towards the execution of this work.
I take immense pleasure in expressing my humble note of gratitude to
Ms. Santhamani V, M.E., Mrs. T Poongodi, M.Tech., (Ph.D.), Project Coordinators,
Department of Computer Science and Engineering, for their valuable suggestions and
remarkable guidance throughout the completion of the project.
I thank my project guide Mrs. D Sumathi, M.E (Ph.D.), Assistant Professor
[Senior Grade], Department of Computer Science and Engineering, who has guided me
throughout the project with her enormous knowledge transfer.
I also extend my thankfulness to all our department faculty members, my
parents and friends for their moral support that helped me in carrying out this project.
Above all, I thank the “ETERNITY” for giving me the strength and courage in
accomplishing this project work.
As the network technology is expanding quickly, the security of that
innovation is turning into a requirement for survival, for an organization. A large
portion of the organizations are relying upon the web to correspond with the
individuals and frameworks to give them news, web shopping, email and individual
data. Because of the quick development in the engineering and boundless utilization of
the Internet, a considerable measure of issues have been confronted to secure the
organization's discriminating data inside or over the systems in light of the fact that
there are many individuals endeavouring to attack on systems to extract information.
An enormous number of assaults have been seen in the last few years. Intrusion
Detection System assumes a monstrous part against those assaults by securing the
system's discriminating data. As firewalls and antiviruses are insufficient to give full
assurance to the system, organizations need to execute the Intrusion Detection System
to ensure their critical data against different sorts of attacks.
Intrusions are activities that endeavour to sidestep security systems of
computer systems. So they are many activities that debilitate the trustworthiness,
accessibility, or secrecy of a system asset. These properties have the following
• Confidentiality – implies that data is not made accessible or unveiled to
unapproved people, substances or procedures;
• Integrity – implies that information has not been adjusted or obliterated in an
unapproved way;
• Availability – implies that a system or a system resource that guarantees that it
is available and usable upon interest by an approved client.
Intrusion Detection is the methodology of observing the occasions happening
in a computer network or system and dissecting them for indications of interruptions,
in the same way as unapproved doorway, movement, or record alteration.
The detailed literature study related to the project is done and the best part of the
knowledge gained is presented in the following section.
2.1 Intrusion Detection System
An Intrusion Detection System (IDS) is a defence system, which detects
hostile activities in a network. The key is then to detect and possibly prevent activities
that may compromise system security, or a hacking attempt in progress including
reconnaissance/data collection phases that involve for example, port scans. One key
feature of intrusion detection systems is their ability to provide a view of unusual
activity and issue alerts notifying administrators and/or block a suspected connection.
According to Amoroso, intrusion detection is a process of identifying and responding
to malicious activity targeted at computing and networking resources". In addition, IDS
tools are capable of distinguishing between insider attacks originating from inside the
organization (coming from own employees or customers) and external ones (attacks
and the threat posed by hackers).
2.2 What is not an IDS?
Contrary to popular market belief and terminology employed in the literature
on intrusion detection systems, not everything falls into this category. In particular, the
following security devices are NOT IDS:
• Network logging systems used, for example, to detect complete vulnerability to
any Denial of Service (DoS) attack across a congested network. These are
network traffic monitoring systems.
• Vulnerability assessment tools that check for bugs and flaws in operating
systems and network services (security scanners), for example Cyber Cop
• Anti-virus products designed to detect malicious software such as viruses,
Trojan horses, worms, bacteria, logic bombs. Although feature by feature these
are very similar to intrusion detection systems and often provide an effective
security breach detection tool.
• Firewalls
• Security/cryptographic systems, for example VPN, SSL, Kerberos, Radius etc.
2.3 Taxonomy of attacks and intrusions
Since intrusion detection systems deal with hacking breaches, let us take a
closer look at these dangerous activities. To assist in the discussion of their taxonomy,
some definitions will be helpful although they may vary [10]:
• Intrusion – a series of concatenated activities that pose threat to the safety of IT
resources from unauthorized access to a specific computer or address domain;
• Incident – violation of the system security policy rules that may be identified as
a successful intrusion;
• Attack – a failed attempt to enter the system (no violation committed)[19].
• Modelling of intrusions – a time based modelling of activities that compose an
intrusion. The intruder starts his attack with an introductory action followed by
auxiliary ones (or evasions) to proceed to successful access; in practice, any
attempts undertaken during the attack by any person, for example by the IT
resource manager can be identified as a threat.
Generally, attacks can be categorized in two areas:
• Passive (aimed at gaining access to penetrate the system without compromising
IT resources),
• Active (results in an unauthorized state change of IT resources).
In terms of the relation intruder-victim, attacks are categorized as:
• Internal, coming from own enterprise’s employees or their business partners or
• External, coming from outside, frequently via the Internet.
Attacks are also identified by the source category, namely those performed
from internal systems (local network), the Internet or from remote dial-in sources [11].
Now, let us see what types of attacks and abuses are detectable (sometimes hardly
detectable) by IDS tools to put them in the ad-hoc categorization. The following types
of attacks can be identified:
• Those related to unauthorized access to the resources (often as introductory
steps toward more sophisticated actions):
• Password cracking and access violation,
• Trojan horses,
• Interceptions; most frequently associated with TCP/IP stealing
and interceptions that often employ additional mechanisms to
compromise operation of attacked systems [16] (for example by
flooding; man in the middle attacks),
• Spoofing
masquerading the host identity by placing forged data in the
cache of the named server i.e. DNS spoofing)
• Scanning ports and services, including ICMP scanning (Ping),
UDP, TCP Stealth Scanning TCP that takes advantage of a
partial TCP connection establishment protocol.) Etc.
• Remote OS Fingerprinting, for example by testing typical
responses on specific packets, addresses of open ports, standard
application responses (banner checks), IP stack parameters etc.,
• Network packet listening (a passive attack that is difficult to
detect but sometimes possible),
• Stealing information, for example disclosure of proprietary
• Authority abuse; a kind of internal attack, for example,
suspicious access of authorized users having odd attributes (at
unexpected times, coming from unexpected addresses),
• Unauthorized network connections,
• Usage of IT resources for private purposes, for example to
access pornography sites,
• Taking advantage of system weaknesses to gain access to
resources or privileges,
• Unauthorized alteration of resources (after gaining unauthorized access):
• Falsification of identity, for example to get system administrator
• Information altering and deletion,
• Unauthorized transmission and creation of data (sets), for
example arranging a database of stolen credit card numbers on a
government computer.
• Unauthorized configuration changes to systems and network
services (servers).
• Denial of Service (DoS):
• Flooding – compromising a system by sending huge amounts of useless
information to lock out legitimate traffic and deny services:
• Ping flood (Smurf) – a large number of ICMP packets sent to a
broadcast address,
• Send mail flood - flooding with hundreds of thousands of
messages in a short period of time; also POP and SMTP
• SYN flood – initiating huge amounts of TCP requests and not
completing handshakes as required by the protocol,
• Distributed Denial of Service (DDoS); coming from a multiple
• Compromising the systems by taking advantage of their vulnerabilities:
• Buffer Overflow, for example Ping of Death — sending a very
large ICMP (exceeding 64 KB),
• Remote System Shutdown,
• Web Application attacks; attacks that take advantage of application bugs may
cause the same problems as described above.
It is important to remember, that most attacks are not a single action, rather a
series of individual events developed in a coordinated manner.
2.4 Structure and architecture of Intrusion Detection System
An intrusion detection system always has its core element - a sensor (an
analysis engine) that is responsible for detecting intrusions. This sensor contains
decision-making mechanisms regarding intrusions [17]. Sensors receive raw data from
three major information sources (Fig.2.1): own IDS knowledge base, syslog and audit
trails. The syslog may include, for example, configuration of file system, user
authorizations etc. This information creates the basis for a further decision-making
process. The arrow width is proportional to the amount of information flowing between
system components
Fig.2.1: A sample IDS
The sensor is integrated with the component responsible for data collection
(Fig.2.2)an event generator [8]. The collection manner is determined by the event
generator policy that defines the filtering mode of event notification information. The
event generator (operating system, network, application) produces a policy-consistent
set of events that may be a log (or audit) of system events, or network packets. This, set
along with the policy information can be stored either in the protected system or
outside. In certain cases, no data storage is employed for example, when event data
streams are transferred directly to the analyser. This concerns the network packets in
Fig.2.2: IDS components
The role of the sensor is to filter information and discard any irrelevant data
obtained from the event set associated with the protected system, thereby detecting
suspicious activities. The analyser uses the detection policy database for this purpose.
The latter comprises the following elements: attack signatures, normal behaviour
profiles, necessary parameters (for example, thresholds). In addition, the database
holds IDS configuration parameters, including modes of communication with the
response module. The sensor also has its own database containing the dynamic history
of potential complex intrusions (composed from multiple actions).
Intrusion detection systems can be arranged as either centralized (for example,
physically integrated within a firewall) or distributed. A distributed IDS consists of
multiple Intrusion Detection Systems (IDS) over a large network, all of which
communicate with each other. More sophisticated systems follow an agent structure
principle where small autonomous modules are organized on a per-host basis across the
protected network. The role of the agent is to monitor and filter all activities within the
protected area (depending on the approach adopted ) and make an initial analysis and
even undertake a response action. The cooperative agent network that reports to the
central analysis server is one of the most important components of intrusion detection
systems. DIDS can employ more sophisticated analysis tools, particularly connected
with the detection of distributed attacks. Another separate role of the agent is
associated with its mobility and roaming across multiple physical locations. In
addition, agents can be specifically devoted to detect certain known attack signatures.
This is a decisive factor when introducing protection means associated with new types
of attacks. IDS agent-based solutions also use less sophisticated mechanisms for
response policy updating.
One multi-agent architecture solution, which originated in 1994, is AAFID
(Autonomous Agents for Intrusion Detection) is shown in Fig.2.3. It uses agents that
monitor a certain aspect of the behaviour of the system they reside on at the time. For
example, an agent can see an abnormal number of telnet sessions within the system it
monitors. An agent has the capacity to issue an alert when detecting a suspicious event.
Agents can be cloned and shifted onto other systems (autonomy feature). Apart from
agents, the system may have transceivers to monitor all operations effected by agents
of a specific host [15]. Transceivers always send the results of their operations to a
unique single monitor. Monitors receive information from a specific network area (not
only from a single host), which means that they can correlate distributed information.
Additionally, some filters may be introduced for data selection and aggregation.
Fig.2.3: AAFID representation of an intrusion detection system employing
autonomous agents.
2.5 Classification of Intrusion Detection Systems
Intrusion Detection Systems are partitioned into the following categories:
host-based (HIDS), network-based (NIDS), and Hybrid Intrusion Detection [4]. A
HIDS demands small programs (agents) to be installed on individual systems to be
supervised. The programs monitor the operating system and write down results to log
files and/or trigger alarms. A NIDS customarily consists of a network application with
a Network Interface Card (NIC) working in unchaste mode and a discrete management
of the interface [19]. Intrusion Detection Systems are placed on a boundary or network
segment and observe all traffic on that segment. The prevailing tendency in intrusion
detection is to mix both network based and host based information to develop hybrid
systems that have more efficiency.
• Host Based Intrusion Detection System (HIDS): Host-based Intrusion
Detection System places monitoring agents on network resource nodes to
monitor the audit logs which are generated by the application program or
Network Operating System. Audit logs accommodate records for activities and
events taking place at every Network resources. HIDS can detect attacks that
cannot be seen by NIDS such as misuse by trusted insider and Intrusion. The
site-specific security policy which determines Signature rule base is utilized by
HIDS. HIDS overcomes the problems associated with the N IDS by alarming
the security personnel who can identify the source provided by site specific
security policy. HIDS can also validate if any attack was foiled, either because
of the immediate response to alarm or any other reason. HIDS can also maintain
user log off and log in user action and all activities that evolve audit records [1].
• Network Based Intrusion Detection System (NIDS): A NIDS is used to
analyse and monitor the network traffic to screen a system from the network
based threats where the data is traffic through the network. A NIDS tries to find
out malicious activities such as port scans, Ping sweeps, denial-of-service (Dos)
attacks, and Packet sniffers attacks. NIDS includes one or more than servers for
management functions, a number of sensors to oversee packet traffic, and one or
more management relieves for the human interface. NIDS explores the traffic
packet by packet in near real time or real time, to detect intrusion patterns. The
analysis of traffic to detect intrusions is done by the agents on the management
servers [12]. These network based procedures are regarded as the active
• Hybrid Intrusion Detection: The network and host-based Intrusion Detection
System solutions have their own unique benefits and strengths over one another
and that is why the next generation Intrusion Detection System evolves to
embrace a tightly fused network and host components. Hybrid intrusion
detection system increases the security level and promises better flexibility. It
reports attacks that are aimed at entire network or particular segments and
combines Intrusion Detection System agent locations [20].
Each technique has a unique methodology for checking and securing
information and every classification has qualities and shortcomings that ought to be
measured against the prerequisites for each different target environment. The two sorts
of Intrusion Detection Systems vary fundamentally from one another, however
supplement each other well. But in the case of a proper Intrusion Detection System
implementation, it would be better to completely integrate the network intrusion
detection system, such that it would channel alarms and warnings in an
indistinguishable way to the host-based part of the system, controlled from the same
central area. In doing so, this gives a helpful means of overseeing and responding to
attack utilizing both sorts of intrusion detection.
2.6 Intrusion detection approaches
The desirable elements of an Intrusion Detection System can be achieved
through variety of approaches. There are two popular approaches to intrusion detection,
Abuse detection and Anomaly detection [2, 13].
Abuse detection:
Systems possessing information on abnormal, unsafe
behaviour (attack signature-based systems) are often used in real-time intrusion
detection systems (because of their low computational complexity).
The misbehaviour signatures fall into two categories [6]:
• Attack signatures – they describe action patterns that may pose a
security threat. Typically, they are presented as a time-dependent
relationship between series of activities that may be interlaced with
neutral ones.
• Selected text strings – signatures to match text strings which look for
suspicious action (for example – calling /etc/passwd).
Any action that is not clearly considered prohibited is allowed. Hence, their
accuracy is very high (low number of false alarms). Typically, they do not achieve
completeness and are not immune to novel attacks.
There are two main approaches associated with signature detection (already mentioned
in the section describing real-time detectors):
• Verification of the pathology of lower layer packets— many types of
attacks (Ping of Death or TCP Stealth Scanning) exploit flaws in IP,
TCP, UDP or ICMP packets. With a very simple verification of flags
set on specific packets it is possible to determine whether a packet is
legitimate or not. Difficulties may be encountered with possible packet
fragmentation and the need for re-assembly. Similarly, some problems
may be associated with the TCP/IP layer of the system being protected.
It is well known that hackers use packet fragmentation to bypass many
IDS tools [7].
• Verification of application layer protocols — many types of attacks
exploit programming flaws, for example, out-of-band data sent to an
established network connection. In order to effectively detect such
attacks, the IDS must have implemented many application layer
The signature detection methods have the following advantages: very low false alarm
rate, simple algorithms, and easy creation of attack signature databases, easy
implementation and typically minimal system resource usage.
Some disadvantages:
• Difficulties in updating information on new types of attacks (when
maintaining the attack signature database updated as appropriate).
• They are inherently unable to detect unknown, novel attacks. A
continuous update of the attack signature database for correlation is a
• Maintenance of IDS is necessarily connected with analysing and
patching of security holes, which is a time-consuming process.
• The attack knowledge is operating environment–dependent, so
misbehaviour signature-based intrusion detection systems must be
configured in strict compliance with the operating system (version,
platform, applications used etc.)
• They seemed to have difficulty handling internal attacks. Typically,
abuse of legitimate user privileges is not sensed by the system as a
malicious activity (because of the lack of information on user privileges
and attack signature structure).
Anomaly detection: Normal behaviour patterns are useful in predicting both
user and system behaviour. Here, anomaly detectors construct profiles that represent
normal usage and then use current behaviour data to detect a possible mismatch
between profiles and recognize possible attack attempts [5, 10].
In order to match event profiles, the system is required to produce initial user
profiles to train the system with regard to legitimate user behaviours. There is a
problem associated with profiling: when the system is allowed to “learn” on its own,
experienced intruders (or users) can train the system to the point where previously
intrusive behaviour becomes normal behaviour. An inappropriate profile will be able to
detect all possible intrusive activities. Furthermore, there is an obvious need for profile
updating and system training which a difficult and time-consuming task. Given a set of
normal behaviour profiles, everything that does not match the stored profile is
considered to be a suspicious action. Hence, these systems are characterized by very
high detection efficiency (they are able to recognize many attacks that are new to the
system), but their tendency to generate false alarms is generally a problem.
Advantages of this anomaly detection method are: possibility of detection of
novel attacks as intrusions; anomalies are recognized without getting inside their
causes and characteristics; less dependence of IDSs on operating environment (as
compared with attack signature-based systems); ability to detect abuse of user
The biggest disadvantages of this method are:
• A substantial false alarm rate. System usage is not monitored during the
profile construction and training phases. Hence, all user activities
skipped during these phases will be illegitimate.
• User behaviours can vary with time, thereby requiring a constant
update of the normal behaviour profile database (this may imply the
need to close the system from time to time and may also be associated
with greater false alarm rates).
• The necessity of training the system for changing behaviour makes a
system immune to anomalies detected during the training phase (false
3.1 Existing System
Framework of proposed EDADT (Efficient Data Adapted Decision Tree)
algorithm: The pseudo code of the proposed EDADT algorithm shown in Fig.3.1
utilizes the hybrid PSO technique to identify the local and global best values for n
number of iterations to obtain the optimal solution. The best solution is obtained by
calculating the average value and by finding the exact efficient features from the given
training data set.
Fig.3.1: Pseudo code of EDADT algorithm
For each attribute ‘a’, select all unique values of ‘a’ to find the unique values
belong to the same class label. If n unique values belong to the same class label, split
them into ‘m’ intervals, and ‘m’ must be less than ‘n’. If the unique values belong to
different ‘c’ class label, check whether the probability of the value belongs to same
class. If it is found then change the class label of values with the class label of highest
probability. Split the unique values as ‘c’ interval then repeat checking of unique
values in the class label for all values in the data set. Find out the normalized
information gain for each attribute and decision node forms a best attribute with the
highest normalized information gain. Sub lists are generated using best attributes and
those nodes forms the child nodes. These processes continue until the data set
3.1.1 Drawbacks of existing system
• KDD 99 Cup dataset was used for performance evaluation.
• Better technique in terms of performance needed.
3.2 Feasibility Study
All projects are feasible, given unlimited resources and infinite time. Before
going further in to the steps of software development, the system analyst has to analyze
whether the proposed system will be feasible for the organization and must identify the
customer needs. The main purpose of feasibility study is to determine whether the
problem is worth solving. The success of a system is also lies in the amount of
feasibility study done on it. Many feasibility studies have to be done on any system.
But there are three main feasibility tests to be performed. They are Operation
Feasibility, Technical Feasibility and Economic Feasibility.
3.2.1 Operational Feasibility
Operational feasibility is mainly concerned with issues like whether the
system will be used if it is developed and implemented. Whether there will be
resistance from users that will affect the possible application benefits. The essential
questions that help in testing the operational feasibility of a system are following.
• Does management support the project?
• Are the users not happy with current business practices? Will it reduce the time
(operation) considerably?
• Have the users been involved in the planning and development of the project?
• Will the proposed system really benefit the organization? Does the overall
response increase? Will accessibility of information be lost?
3.2.2 Economical Feasibility
For any system if the expected benefits equal or exceed the expected costs, the
system can be judged to be economically feasible. In economic feasibility, cost benefit
analysis is done in which expected costs and benefits are evaluated. Economic analysis
is used for evaluating the effectiveness of the proposed system. In economic feasibility,
the most important is cost-benefit analysis. As the name suggests, it is an analysis of
the costs to be incurred in the system and benefits derivable out of the system.
3.2.3 Technical Feasibility
In technical feasibility the following issues are taken into consideration.
• Whether the required technology is available or not.
• Whether the required resources are available – Manpower - programmers,
testers & debuggers- Software and hardware.
Once the technical feasibility is established, it is important to consider the monetary
factors also. Since it might happen that developing a particular system may be
technically possible but it may require huge investments and benefits may be less. For
evaluating this, economic feasibility of the proposed system is carried out.
4.1 Hardware Requirements
Hard Disk
Core i3 1.8 GHz
500 GB
4 GB
4.2 Software Requirements
Operating System
Windows XP and above
Programming language
Integrated Development Environment:
Weka 3.7.11
5.1 Weka
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of
machine learning software written in Java, developed at the University Of Waikato,
New Zealand. Weka is free software available under the GNU General Public License
The Weka (pronounced Weh-Kuh) workbench contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together
with graphical user interfaces for easy access to this functionality. The original nonJava version of Weka was a TCL/TK front-end to (mostly third-party) modeling
algorithms implemented in other programming languages, plus data pre-processing
utilities in C, and a Makefile-based system for running machine learning experiments.
This original version was primarily designed as a tool for analyzing data from
agricultural domains, but the more recent fully Java-based version (Weka 3), for which
development started in 1997, is now used in many different application areas, in
particular for educational purposes and research. Advantages of Weka include:
• free availability under the GNU General Public License
• portability, since it is fully implemented in the Java programming language and
thus runs on almost any modern computing platform
• a comprehensive collection of data pre-processing and modelling techniques
• ease of use due to its graphical user interfaces
Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection.
All of Weka's techniques are predicated on the assumption that the data is available as
a single flat file or relation, where each data point is described by a fixed number of
attributes (normally, numeric or nominal attributes, but some other attribute types are
also supported). Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query. It is not capable
of multi-relational data mining, but there is separate software for converting a
collection of linked database tables into a single table that is suitable for processing
using Weka. Another important area that is currently not covered by the algorithms
included in the Weka distribution is sequence modeling.
Weka's main user interface is the Explorer, but essentially the same
functionality can be accessed through the component-based Knowledge Flow interface
and from the command line. There is also the Experimenter, which allows the
systematic comparison of the predictive performance of Weka's machine learning
algorithms on a collection of datasets.
The Explorer interface features several panels providing access to the main
components of the workbench:
• The Preprocess panel has facilities for importing data from a database, a CSV
file, etc., and for preprocessing this data using a so-called filtering algorithm.
These filters can be used to transform the data (e.g., turning numeric attributes
into discrete ones) and make it possible to delete instances and attributes
according to specific criteria.
• The Classify panel enables the user to apply classification and regression
algorithms (indiscriminately called classifiers in Weka) to the resulting dataset,
to estimate the accuracy of the resulting predictive model, and to visualize
erroneous predictions, ROC curves, etc., or the model itself (if the model is
amenable to visualization like, e.g., a decision tree).
• The Associate panel provides access to association rule learners that attempt to
identify all important interrelationships between attributes in the data.
• The Cluster panel gives access to the clustering techniques in Weka, e.g., the
simple k-means algorithm. There is also an implementation of the expectation
maximization algorithm for learning a mixture of normal distributions.
• The Select attributes panel provides algorithms for identifying the most
predictive attributes in a dataset.
• The Visualize panel shows a scatter plot matrix, where individual scatter plots
can be selected and enlarged, and analyzed further using various selection
6.1 Data sets
NSL-KDD data set: The NSL-KDD data set is advised to resolve a number of
the inherent issues of the KDD CUP'99 data set. KDD CUP’99 is the mostly wide used
data set for anomaly detection. However Tavallaee et al conducted a statistical analysis
on this data set and located two important issues that greatly affected the performance
of evaluated systems, and lands up in a very poor analysis of anomaly detection
approaches. To resolve these problems, they projected a new data set, NSL-KDD that
consists of selected records of the whole KDD data set [18]. The following are the
benefits of the NSL-KDD over the original KDD data set: First, it doesn't include
redundant records within the train set, so the classifiers won't be biased towards more
frequent records. Second, the amount of selected records from every difficulty level
group is inversely proportional to the share of records in the original KDD data set. As
a result, the classification rates of distinct machine learning methods vary in a very
wider range that makes it more efficient to have an accurate evaluation of different
learning techniques. Third, the numbers of records in the train and test sets is
reasonable, that makes it affordable to run the experiments on the entire set without the
requirement to randomly choose a tiny low portion. Consequently, analysis results of
different research works are going to be consistent and comparable.
HONEY POT data set: This data set was collected from honey pot which was
set up at Tokyo University.
DENIAL OF SERVICE data sets: Ping flood and UDP flood data sets were
collected for the performance testing on individual attacks.
6.2 Bagging Ensemble Selection
The simple forward model selection based ensemble selection algorithm is
superior to many other prominent ensemble learning algorithms, such as bagging
decision trees, stacking with linear regression at the meta-level and boosting decision
stumps [3]. However, sometimes the performance of the final ensemble may be
reduced due to ensemble selection over fits the hill climbing set. [3] Explains that the
performance of ensemble selection on the hill climb set gradually increases as the
number of models in the model library increases. The performance on the test set does
not always increase but it may reach a global or local value and then gradually decline.
The root-mean-squared-error metric may decline very quickly as depicted in [3], for
certain data sets. The authors of [3] proposed three additions to the simple forward
selection procedure to reduce the chance of hill climb set over fitting. The proposed
additions are: (1) the individual classifier can be selected multiple times, which results
in some classifiers gets larger weights than others; (2) the models in the library are
sorted by their performance, and the best N models are put into the initial ensemble
which avoids starting with an empty ensemble; (3) ensemble selection is done inside
each bag out of the K bags of models are randomly selected from the model library; the
final ensemble is the union of the subsets selected from each of the bags.
Based on the observations from [8], resulted in proposing a new ensemble
learning algorithm called bagging ensemble selection: the simple forward ensemble
selection algorithm can be taken as an unstable base classifier, then can be applied with
bagging idea to construct an ensemble of simple ensemble selection classifiers, which
result in a more robust technique than an individual ensemble selection classifier. The
hill climb set can be taken from out of bag samples.
The full bootstrap sample is used for model generation and the corresponding
out of bag instances as the hill climb set for gathering in the Bagging Ensemble
algorithm [14]. The bootstrap sample is predicted to contain about 1-1/e ≈ 63:2% of
exclusive examples of the training set. Hence the hill climb set is awaited to have about
1/e≈ 36:8% exclusive examples of the training set for each bagging repetition. Here
Reduced Error Pruning Tree is used as the base classifier.
The following shows the pseudo code for the Bagging Ensemble Selection
Training set S, Ensemble Selection classifier E, Integer T(number of bootstrap
Basic Procedure:
for i=1 to T
S b = bootstrap sample from S
S ob =out of bag sample
Train base classifiers (can be a diverse model library) in E on S b
E i = do ensemble selection based on base classifiers performance on S ob
6.3 Algorithms related to Bagging Ensemble Selection
The data mining community was always facing the problem of constructing
ensemble classifiers with best predictive performance on practical problems. The
ensemble methods are more stable and accurate when compared with individual
classifiers. Here the mathematical expression used in [9] to illustrate the idea of
ensemble learning is noted down: let ‘y’ be an instance and n i ; = 1...k a set of base
classifiers that output probability distributions n i (y, d j ) for each class label d j ; j = 1...n.
The final classifier ensemble output x(y), for instance ‘y’ is shown in equation 6.1,
Where w i is the weight of base classifier n i .
Many ensemble methods have been proposed from mid-90's.The instability of
base classifiers is exploited in bagging method, which is utilized to improve the
predictive performance of such unstable base classifiers. The basic idea is that, for
given a training set T of size m and a classifier C, bagging generates n new training
sets with replacement, T i , each of size m′≤m. Then, bagging applies C to each Ti to
build n models. The final output of bagging is based on the simple voting determines
the final output.
AdaBoost (Adaptive Boosting) is a popular ensemble algorithm that uses an
iterative process for improving simple boosting algorithm [6]. This algorithm gives
more focus to patterns that are difficult to classify. P-AdaBoost algorithm is a
distributed version of AdaBoost which was introduced later. P-AdaBoost works in two
phases, which runs in its sequential fashion for a bounded number of steps in the first
phase. The classifiers are trained in parallel using weights that are estimated from the
first phase which will be utilized in second phase.
The method of constructing ensembles from a library of base classifiers is
called Ensemble selection. Initially, the different machine learning algorithms are used
for building base models. The well performing subsets of all models are extracted
using construction strategy such as forward stepwise selection, guided by some scoring
function. The procedure for simple forward model selection proposed in [8] works as
follows: (1) initialize with an empty ensemble; (2) The model in the library which
maximizes the ensemble's performance to the error metric on a hill-climb set is added
to the ensemble (3) repeat Step 2 until all models have been examined;(4) The model
with maximum performance on the hill-climb set is taken and subset of that model is
returned. The advantage of ensemble selection is that it can be optimized for many
common performance metrics or a combination of metrics
6.4 Classifiers for performance comparison
The following classifier algorithms have been implemented for the
performance comparison on the datasets mentioned in 6.1.
• oneR
• HoeffdingTree
• DecisionStump
6.4.1 oneR
OneR, short for "One Rule", accurate and simple classification algorithm that
generates one rule for every predictor within the data, then selects the rule with the
tiniest total error as its "one rule". To make a rule for a predictor, oneR constructs a
frequency table for every predictor against the target. It's been shown that OneR
produces rules only slightly less accurate than progressive classification algorithms
whereas producing rules that are easy for humans to interpret.
6.4.2 HoeffdingTree
A Hoeffding tree [3] is a progressive, anytime decision tree induction
algorithm that's capable of learning from data streams, assuming that the distribution
generating examples doesn't change over time. Hoeffding trees exploit the actual fact
that a small sample will usually be enough to decide on the optimal splitting attribute.
This idea is supported mathematically by the Hoeffding bound that quantifies the
amount of observations required to estimate some statistics within a prescribed
preciseness. A theoretically appealing feature of Hoeffding Trees not shared by other
incremental decision tree learners is that its sound guarantees of performance. Using
the Hoeffding using one can show that its output is asymptotically nearly similar to that
of a non-incremental learner using infinitely several examples.
6.4.3 DecisionStump
A decision stump [19] is a machine learning model consisting of a one-level
decision tree. That is, it's a decision tree with one internal node that is instantly
connected to the terminal nodes (its leaves). A decision stump makes a prediction that
supports the value of just one input feature. They're also known as 1-rules.Decision
stumps are usually used as base learners in machine learning ensemble techniques like
boosting and bagging. For example, a state-of-the-art Viola–Jones face detection
algorithm employs AdaBoost with decision stumps as weak learners.
6.5 Classifier performance measures
A confusion matrix contains information regarding actual and foreseen
classifications done by a classification system. Performance of such systems is often
evaluated using the data within the matrix. The following Fig.6.1 shows the confusion
Fig.6.1: confusion matrix
The entries within the confusion matrix have meaning which means within the
context of our study:
• a is that the number of correct predictions that an instance is negative,
• b is that the number of incorrect predictions that an instance is positive,
• c is that the number of incorrect of predictions that an instance negative, and
• d is that the number of correct predictions that an instance is positive.
The following are the metrics that is used for the evaluation of data set:
• Accuracy: The accuracy is that the proportion of the total number of predictions
that were correct. It’s determined using the equation:
Accuracy=(𝑎 + 𝑑)/(𝑎 + 𝑏 + 𝑐 + 𝑑)
• Detection Rate: Detection Rate is the proportion of the predicted positive cases
that were correct, as calculated using the equation:
Detection Rate=𝑑/(𝑏 + 𝑑)
• False Alarm Rate: False Alarm Rate is the proportion of negatives cases that
were incorrectly classified as positive, as calculated using the equation
False Alarm Rate=b/ (a+b)
The classifier algorithms were simulated using weka. The Fig 7.1 shows
weka GUI chooser.
Fig.7.1: Weka GUI chooser
The Fig.7.2 shows Command Line Interface of weka, from which the
operations can be initiated.
Fig.7.2: Command Line Interface of weka
The Fig.7.3 shows the performance comparison of classifier algorithms on
NSL-KDD data set.
False Alarm
Fig.7.3: Performance comparison on NSL-KDD data set
The Fig.7.4 shows the performance comparison of classifier algorithms on
Honeypot data set.
False Alarm
Fig.7.4: Performance comparison on Honeypot data set
The Fig.7.5 shows the performance comparison of classifier algorithms on
Ping flood data set.
False Alarm
Fig.7.5: Performance comparison on Ping Flood data set
The Fig.7.6 shows the performance comparison of classifier algorithms on
Udp flood data set.
False Alarm
Fig.7.6: Performance comparison on Udp Flood data set
8.1 Conclusion
The proposed Bagging Ensemble Selection classifier has been tested using the
data sets described in 6.1. Comparative study and analysis related to classification
measures included Accuracy, Detection Rate and False Alarm Rate have been
computed by simulation using Weka Toolkit. Experimental Results show that Bagging
Ensemble Selection gives the best performance in terms of Accuracy, Detection Rate
and False Alarm Rate when NSL-KDD data set is used. Results show that Bagging
Ensemble Selection gives the best performance in terms of Accuracy, Detection Rate
and False Alarm Rate when Honeypot data set is used . Results show that Bagging
Ensemble Selection falls behind the Hoeffding Tree which give the best performance
in terms of Accuracy, Detection Rate and False Alarm Rate when ping flood data set is
used. Bagging Ensemble Selection and Hoeffding Tree gives the best performance in
terms of Accuracy, Detection Rate and False Alarm Rate when udp flood data set is
8.2 Future Enhancement
As a future work, the proposed method may be extended for streaming data.
The proposed method may be tested with more real data sets and more performance
measures may be taken into consideration.
public class BESTrees
extends AbstractClassifier
implements OptionHandler, Randomizable, WeightedInstancesHandler,
AdditionalMeasureProducer, TechnicalInformationHandler {
private BESoob m_BES = new BESoob();
protected boolean m_UseNNLS = false;
protected int m_NumBags = 10;
protected int m_NumTreesPerBag = 10;
protected int m_NumFeatures = 0;
protected int m_RandomSeed = 1;
protected int m_KValue = 0;
protected int m_FinalEnsembleSize = -1;
protected int m_Slots = 1;
protected int m_MaxDepth = -1;
protected int m_NumFolds = 3;
protected int m_HillclimbMetric = BESHelper.METRIC_RMSE;
public SelectedTag getHillclimbMetricMethod() {
public void setHillclimbMetricMethod(SelectedTag method) {
if (method.getTags() == BESDirectHillclimbingES.TAGS_METRIC) {
m_HillclimbMetric = method.getSelectedTag().getID();
} }
public void setNumExecutionSlots(int numSlots) {
m_Slots = numSlots;
public int getNumExecutionSlots() {
return m_Slots;
public int getNumFeatures() {
return m_NumFeatures;
public void setNumFeatures(int newNumFeatures) {
m_NumFeatures = newNumFeatures;
public int getMaxDepth() {
return m_MaxDepth;
public void setMaxDepth(int value) {
m_MaxDepth = value;
if (m_MaxDepth <= 0) {
m_MaxDepth = -1;
public Capabilities getCapabilities() {
return new REPTree().getCapabilities();
public void setSeed(int seed) {
m_RandomSeed = seed;
public int getSeed() {
return m_RandomSeed;
public int getFinalEnsembleSize() {
if (m_UseNNLS) {
return m_FinalEnsembleSize;
} else {
return m_NumBags;
public void setNumTreesPerBag(int n) {
if (n < 2) {
m_NumTreesPerBag = 2;
} else {
m_NumTreesPerBag = n;
public int getNumTreesPerBag() {
return m_NumTreesPerBag;
public int getTotalNumberTrees() {
return m_BES.getTotalNumberTrees();
public void setNumBags(int n) {
if (n < 1) {
m_NumBags = 1;
} else {
m_NumBags = n;
m_BES = besOOB;
m_FinalEnsembleSize = m_BES.getFinalEnsembleSize();
public double[] distributionForInstance(Instance instance) throws Exception {
return m_BES.distributionForInstance(instance);
public TechnicalInformation getTechnicalInformation() {
TechnicalInformation result;
result = new TechnicalInformation(Type.INPROCEEDINGS);
result.setValue(Field.AUTHOR, "Quan Sun and Bernhard Pfahringer");
result.setValue(Field.YEAR, "2012");
result.setValue(Field.TITLE, "Bagging Ensemble Selection for Regression");
result.setValue(Field.JOURNAL, "In Proceedings of the 25th Australasian Joint
Conference on Artificial Intelligence (AI'12)");
result.setValue(Field.PAGES, "695-706");
return result;
public Enumeration enumerateMeasures() {
Vector newVector = new Vector(2);
return newVector.elements();
public double getMeasure(String additionalMeasureName) {
if (additionalMeasureName.equalsIgnoreCase("measureFinalEnsembleSize")) {
return measureFinalEnsembleSize();
} else if (additionalMeasureName.equalsIgnoreCase("measureFinalNumTrees")) {
return measureFinalNumTrees();
} else {
throw new IllegalArgumentException(additionalMeasureName + " not
supported (RandomForest)");
public double measureFinalEnsembleSize() {
return getFinalEnsembleSize();
public double measureFinalNumTrees() {
return m_BES.getTotalNumberTrees();
public Enumeration listOptions() {
Vector newVector = new Vector();
newVector.addElement(new Option(
"\tNumber of trees to build.",
"I", 1, "-I <number of trees>"));
newVector.addElement(new Option(
"\tNumber of features to consider (<1=int(logM+1)).",
"K", 1, "-K <number of features>"));
newVector.addElement(new Option(
"\tSeed for random number generator.\n" + "\t(default 1)",
"S", 1, "-S"));
newVector.addElement(new Option(
"\tThe maximum depth of the trees, 0 for unlimited.\n" + "\t(default 0)",
"depth", 1, "-depth <num>"));
new Option(
"\tSet the target metric"
+ " to use. 0 = Correlation Coefficient, 1 = RMSE, 2 = ROC, "
+ "3 = PRECISION, 4 = Recall, 5 = F score, 6 = All, "
+ "7 = Weighted TP, 8 = Mean Abs Error, 9 = Accuracy \n"
+ "\t(default 1 = RMSE)",
"Z", 2, "-Z <target metric>"));
Enumeration enu = super.listOptions();
while (enu.hasMoreElements()) {
return newVector.elements();
public String toString() {
if (m_BES == null) {
return "BESTrees: No model built yet.";
int totalNumTrees = getNumTreesPerBag() * getNumBags();
String result = "The final BESTrees model has " + this.getTotalNumberTrees() + "
trees (out of " + totalNumTrees + ") \n";
return result;
public static void main(String[] argv) throws Exception {
runClassifier(new BESTrees(), argv);
A.2 List of publications
1. “Intrusion Detection System using Bagging Ensemble Selection ,” 2015 IEEE
International Conference on Engineering and Technology (ICETECH),
20thMarch 2015, Coimbatore, TN, India.
2. “A Comprehensive Review on Intrusion Detection Systems,” CiiT International
Journal of Networking and Communication Engineering, Vol 6, No 9, 2014.
3. “Performance comparison of classification algorithms using Weka,”
International Journal of Advanced and Innovative Research, Vol 4, No 4, 2015.
