Big data analysis using machine learning techniques

Assignment Questions

International Journal of Electrical and Computer Engineering (IJECE)
Vol. 10, No. 4, August 2020, pp. 3811~3818
ISSN: 2088-8708, DOI: 10.11591/ijece.v10i4.pp3811-3818  3811
Journal homepage: http://ijece.iaescore.com/index.php/IJECE
Performance evaluation of Map-reduce jar pig hive and spark
with machine learning using big data
Santosh Kumar J.1, Raghavendra B. K.2, Raghavendra S.3, Meenakshi4
1Department of Computer Science and Engineering, KSSEM, Bangalore, Affiliated to VTU Belagavi, India
2Department of Computer Science and Engineering, BGSIT (ACU) Deemed to be University, India
3Department of Computer Science and Engineering, Christ Deemed to be University, India
4Department of Computer Science and Engineering, Jain Deemed to be University, India
Article Info ABSTRACT
Article history:
Received Mar 9, 2019
Revised Feb 1, 2020
Accepted Feb 19, 2020
Big data is the biggest challenge as we need huge processing power system
and good algorithms to make a decision. We need a Hadoop environment with
pig hive, machine learning and hadoopecosystem components. The data
comes from industries. Many devices around us and sensor, and from social
media sites. According to McKinsey There will be a shortage of 15000000
big data professionals by the end of 2020. There are lots of technologies to
solve the problem of big data Storage and processing. Such technologies are
Apache Hadoop, Apache Spark, Apache Kafka, and many more. Here we
analyse the processing speed for the 4GB data on cloudx lab with Hadoop
mapreduce with varing mappers and reducers and with pig script and Hive
querries and spark environment along with machine learning technology and
from the results we can say that machine learning with Hadoop will enhance
the processing performance along with with spark, and also we can say that
spark is better than Hadoop mapreduce pig and hive, spark with hive and
machine learning will be the best performance enhanced compared with pig
and hive, Hadoop mapreduce jar.
Keywords:
Cloudxlab
Flink
Hadoop
Hbase
HDFS
Hive
Map-reduce
Pig
Spark Copyright © 2020 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Santosh Kumar J.,
Department of Computer Science and Engineering,
KSSEM Bengaluru VTU University,
Mallasandra, Kanakapura Road, Bangalore-560109, India.
Contact: +919035636616
Email: [email protected]
1. INTRODUCTION
Big data refers to data sets whose size is beyond the ability of typical database management tools to
capture, store, manage, and analyze. Cloud computing and big data, two disruptive trends at present,
pose significant influence on current IT industry and research communities. Cloud computing provides
massive computation power and storage capacity which enable users to deploy applications without
infrastructure investment. Integrated with cloud computing, data sets have become so large and complex that
it is a considerable challenge for traditional data processing tools to handle the analysis pipeline of these data.
Generally, such data sets are often from various sources and of different variety such as unstructured social
media content and semi-structured medical records and business transactions are of large volume with
fast data [1].
The Map Reduce framework has been widely adopted by a large number of companies and
organizations to process huge volume of data sets. Unlike the traditional Map Reduce framework, the one
incorporated with cloud computing becomes more flexible, salable and cost-effective. A typical example is
the Amazon Elastic Map Reduce service. Users can invoke amazon EMR to conduct their Map-reduce
computations based on the powerful infrastructure offered by Amazon. Web Services and are charged in
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 4, August 2020 : 3811 – 3818
3812
proportion to the usage of the services. In this way, it is economical and convenient for companies and
organizations to capture, store, organize, share and analyze big data to gain competitive advantages.
Map Reduce is currently a major big data processing paradigm. The authors discussed about existing
performance models for Map Reduce only comply with specific workloads that process a small fraction of
the entire data set, thus failing to assess the capabilities of the Map Reduce paradigm under heavy workloads
that process exponentially increasing data volumes. The authors discussed about building and analyze
a scalable and dynamic big data processing system, including storage, execution engine, and query language.
The authors mainly concentrated in the design and implementation of a resource management system, design
and implementation of a bench marking tool for the Map Reduce processing system and the evaluation and
modeling of Map Reduce using workloads with very large data sets [2]
Spark is the 100 times faster framework than Map Reduce and hdfs in storage and processing it is
also frame work like any other java framework which built on top of OS to utilize memory efficiently and
the other devices of CPU efficiently particularly designed framework for big data processing. Spark has
many advantages and disadvantages efficient utilizations of memory management is one of the disadvantage
of spark whereas processing big data is advantages compared with map reduce framework and HDFS
of Hadoop.
Flink is also a frame work for all components of Hadoop eco-system. Flink is the frame work for
Streaming data, flinklatency is very less to process big data compared with Spark Flink has many advantages,
it processes the data without latency like speed of light, and Memory exception problem is also solved by
Flink. Flink also interact with many devices of which have different storage system to process the data, and it
also optimizes the program before execution.
Big data Processing Technology like Hadoop mapreduce, flink and spark along with caching data
processing engine and scheduler as shown in Figure 1. Data Processing technique like data understanding
data peploration and data modeling are as shown in Figure 2. Big data Ecosystem components like pig hive
spark ambari zookeeper ml lib Habase and many as shown in Figure 3.
Figure 1. Big data processing technology comparison
Figure 2. Data analysis processing steps
Int J Elec & Comp Eng ISSN: 2088-8708 
Performance evaluation of Map-reduce jar pig hive and spark with…. (Santosh Kumar J.)
3813
Figure 3. Ecosystem components of big data
2. LITERATURE REVIEW
Many authors of the paper said about Apache Hadoop that it is a framework for processing large
distributed data set across a cluster of computers and said about scaling the cluster. Due to use of sensors
across all devices and network tools of the organizations generating big data, all wanted to store and analyze
without investing much cost on managing and service issue of the storage and processing want to deploy
everything on the cloud so that cloud management organizations will take care of it, these companies can utilize
the data for analysis and extract useful knowledge out of it. Map Reduce is the framework which allows large
data to be stored across all devices and processed by devices map functions will distribute the data and store
across the devices where a reduce will process the query of the client it works on bases of the key value pair.
Each line will be treated as key and value that is first word is the key and rest all will be value whenever
client request to process the large data first client will approach the name node name node will respond to
client with available free nodes after that mapper functions by client will write data to respective data nodes,
and whenever client want to process the data it request to name node job tracker then job tracker will
communicate to name node to get data information storage then it will assign jobs to task tracker to process
the job by name nodes will process the task by their available data then one of the node will aggregate
the result and give the result to client [3].
Hadoop’s optimization framework for Map Reduce clusters the author of the paper states most
widely used frameworks for developing Map Reduce based applications is Apache Hadoop. But developers
find number of challenges in the Hadoop framework, which causes problem to management of the resources
in the Map Reduce cluster that will optimize the performance of Map Reduce applications running on it.
The constraints in the resource allocation process in the Map Reduce programming model for large-scale data
processing for speed up performance. The novel technique called Dynamic approach for performing speed up
of the available resources. It contains the two major operations; they are slot utilization optimization and
utilization efficiency optimization. The Dynamic technique has the three slot allocation techniques they are
dynamic hadoop slot allocation speculative execution performance balancing and Slot Pre-scheduling.
It achieves a performance speedup by a factor of over the recently proposed cost-based optimization
approach. In addition, performance benefit increases with input data set size [4].
Performance Evaluation of Hadoop and Oracle Platform for Distributed Parallel Processing in
big data Environments The authors discussed about the Reduce data center implementation cost using
commodity hardware to provide high performance Computing. Distributed processing of large data sets
across clusters of computers using distributed and parallel computing architecture. And also the authors do
the Performance comparison of distributed parallel computing system and traditional single computing
system towards an optimized big data processing system author of the paper stated that the authors discussed
about resource management system for Map Reduce based processing system for deploying and resizing
Map Reduce clusters Bench marking tool for the Map Reduce processing system evaluation and modeling of
Map Reduce using workloads with very large data sets and to optimize the Map Reduce system to efficiently
process terabytes of data. Overview on performance testing approach in big data the author stated that many
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 4, August 2020 : 3811 – 3818
3814
organizations are facing challenges in facing test strategies for structured and unstructured data validation,
setting up optimal test environment, working with non relational database and performing non functional
testing. These challenges cause poor quality of data in production, delay in implementation and increase in
cost. Map Reduce provides a parallel and scalable programming model for data-intensive business and
scientific applications. To obtain the actual performance of big data applications, such as response time,
maximum online user data capacity size, and a certain maximum processing capacity [5]. The paper authors
discussed big data and computing cloud management appliances and the processing problems of big data,
with reference to computing cloud, database of cloud, cloud architecture, Map Reduce optimization
techniques [6]. The authors discussed the Resource management Mappers and Reduce-based applications
processing to deploy and resizing Map Reduce Bench marking applications and tool are used for
the Map Reduce processing to extent the Map Reduce enactment using workloads with big data and to
optimize the Map Reduce to process terabytes of data proficiently and Cost Optimizations for Workflows in
the Cloud [7, 8]. The authors discussed about software to expand the scalability of data analytics, Challenges
Availability, partitioning, virtualization and scalability, distribution, and elasticity and performance
bottlenecks for managing big data [9]. The authors said about Benchmarking a several of high-performance
computing (HPC) architectures for data, name node and data node architectures with large memory and
bandwidth are better suited for big data analytics on HPC h/w and Budget-Driven Scheduling Algorithms for
Batches of MapReduce Jobs in Heterogeneous Clouds [10, 11]. Map Reduce provides a parallel and scalable
programming model for data-intensive business and scientific applications. To obtain the actual performance
of big data applications, such as response time, maximum online user data capacity size, and a certain
maximum processing capacity [12]. On the other paper author have discussed about the parallel processing
techniques [13]. Other author of the paper discussed about performance issue with Cloud and big data [14].
The author said about tesing techniques and performance enhancement parameters [15] and the aother
authors discussed about multicore architecture of Hadoop performance [16]. The author discuused about
the Machine learning techniques with Hadoop may enhances the performance [17].The author of the paper
said about Hadoop self tuning mapper and reducer with mland clustersof architectur and optimization of big
data performance parameters [18, 19].The author discussed the performance with oracle and Hadoop and said
Hadoop enhances the performance [20]. The authors discussed Map-reduce execution time Big.txt input file.
With cloudxlab Hadoop big data frame work [21]. The authors discussed Map-reduce execution time
Ramayana text input file. With cloudxlab Hadoop big data frame work [22]. The author discussed about
the AWS Costbased Optimization of Map-Reduce Programs may enhance the performance [23]. The author
said about efficient utilization of mapper and reducer may enhance the performance [24]. The author
discussed about Resource-aware Adaptive Scheduling for MapReduce Clusters [25]. The author discussed
about performance of Pig hive and Hadoop jar file [26].
3. RESULTS AND DISCUSSION
Figure 4 is the Map Reduce architectural framework for word count program where hugeinput file is
split as blocks of pages and each pages split as lines and each lines spit as words by spaces to get number of
words then all words are shuffled with all the data nodes mappers to count occurrence of each words in each
data nodes finally using reduces combines the results achieved by each data node. Running Character Count
Job in Cloudxlab hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-streaming.jar-input/data/mr/
wordcount/input -output letter_count -mapper mapper.py -file mapper.py -reducer reducer.py file reducer.py.
The Table 1 shows the out of the character count job, which reads the input file and calculate the number of
occurrences of the character and store the output in output file. Figures 4-7 shows the execution time of word
count program of pig script and Hive Query. First, we create a table called doc then will load a input file after
that word count query program execution which shows a time of 14 Sec to exec. Total of 20 Sec to execute
a word count program for input file (14sec+ 6sec= 20 Sec). Total of 36 sec + 16 sec = 52 sec of time to
execute the word count program for input file. Table 1 shows the characters and its count on mapreduce
Hadoop after execution.
Mapreduce framework for the word count as shown in Figure 4, huge input data is divided and
given to mapper based on key value pair for the data. Then the suffle action later reducer will be used for
combining the results of mapper. The word count program is given for the execution with Spark Hive and
machine learning query and Execution time of 6 seconds as shown in Figure 5. The word count program is
given for the execution with Spark Hive query and Execution time of 14 seconds as shown in Figure 6.
The word count program is given for the execution with Spark Hive query and Execution time of 16 seconds
as shown in Figure 7. The word count program is given for the execution with PIG query and Execution time
of 36 seconds as shown in Figure 8.
Int J Elec & Comp Eng ISSN: 2088-8708 
Performance evaluation of Map-reduce jar pig hive and spark with…. (Santosh Kumar J.)
3815
Table 1. Character Count output
Character and its Count Character and its Count
a. 08096 n. 369018
b. 73168 o. 386867
c. 144974 p. 98913
d. 215706 q. 4571
e. 633821 r. 309558
f. 120875 s. 334901
g. 96916 t. 460748
h. 294683 u. 138732
i. 365641 v. 52378
j. 6436 w. 100831
k. 32798 x. 9810
l. 198648 y. 90481
m. 127063 z. 3796
Figure 4. Map Reduce framework for word count
Figure 5. Hive Query execution time 6 sec for input file
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 4, August 2020 : 3811 – 3818
3816
Figure 6. Hive Query execution time 14 Sec for input file
Figure 7. Word count program execution time 16 Sec for input file
Figure 8. Word count program execution time 36 Sec for the input file
Int J Elec & Comp Eng ISSN: 2088-8708 
Performance evaluation of Map-reduce jar pig hive and spark with…. (Santosh Kumar J.)
3817
4. CONCLUSION
Hadoopissoftware framework for variety, volume and velocity of data processing, companies like
google yahoo and Amazon have their own framework for processing the big data also they provide cloud
based big data eco-system infrastructure to store (using HDFS) and process (using map-Reduce) big data,
from above results we say that Hive Query execution time is 20 Seconds, whereas pig script execution time
is 52 Seconds for the same input file without machine learning and with machine learning its enhanced to
16 seconds with combination of ml and spark with hive also, we can say that word count program for given
input file Hive is better than Pig, Hive enhances the execution time, from above results we can we may state
that machine learning, spark with hive gives enhanced performance than hadoop mapreduce and pig spark
and flink.
ACKNOWLEDGEMENTS
I would like express my deep gratitude to the Principal, HoD and Staff of Computer Science and
Engineering department of KSSEM, Bangalore for supporting me in doing this research work.
REFERENCES
[1] Md. Armanur Rahman, J. Hossen, “A Survey of Machine Learning Techniques for Self-tuning Hadoop
Performance,” International Journal of Electrical and Computer Engineering (IJECE), vol. 8, no. 3,
pp. 1854-1862, 2018.
[2] Aman Lodha, “Hadoop’s Optimization Framework for Map Reduce Clusters,” Imperial Journal of
Interdisciplinary Research (IJIR), vol 3, no 4, pp. 1648-1650, 2017.
[3] Dan Wang, Jiangchuan Liu, “Optimizing Big Data Processing Performance in the Public Cloud: Opportunities and
Approaches,” IEEE Network, September/October 2015.
[4] A. K. M. Mahbubul Hossen, A. B. M. Moniruzzaman et. al., “Performance Evaluation of Hadoop and Oracle
Platform for Distributed Parallel Processing in Big Data Environments,” International Journal of Database Theory
and Application, vol. 8, no. 5, pp.15-26, 2015.
[5] Changqing Ji, Yu Li, Wenming Qiu et.al., “Big Data Processing in Cloud Computing environments,” International
Symposium on Pervasive Systems, Algorithms and Networks, 2012.
[6] Aman Lodha, “Hadoop’s Optimization Framework for Map Reduce Clusters” Imperial Journal of Interdisciplinary
Research (IJIR), vol-3, no 4, 2017
[7] Dan Wang, Jiangchuan Liu, “Optimizing Big Data Processing Performance in the Public Cloud: Opportunities and
Approaches” IEEE Network, September/October 2015.
[8] C. Zhou, B.S. He, “Transformation-based Monetary Cost Optimizations for Workflows in the Cloud,” IEEE
Transaction on Cloud Computing, 2014.
[9] A. K. M. MahbubulHossen1, A. B. M. Moniruzzaman et. al. “Performance Evaluation of Hadoop and Oracle
Platform for Distributed Parallel Processing in Big Data Environments,” International Journal of Database Theory
and Application, vol. 8, no. 5, pp.15-26, 2015.
[10] Changqing Ji, Yu Li, Wenming Qiu et.al. “Big Data Processing in Cloud Computing environments,” International
Symposium on Pervasive Systems, Algorithms and Networks, 2012
[11] Y. Wang, W. Shi, “Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous
Clouds,” IEEE Transaction on Cloud Computing, 2014
[12] Bogdan Ghiţet. al. “Towards an Optimized Big Data Processing System” 13th IEEE/ACM International
Symposium on Cluster, Cloud, and Grid Computing, 2013
[13] Kyong-Ha Lee et. al. “Parallel Data Processing with Map Reduce: A Survey,” SIGMOD Record, vol. 40, no. 4,
December 2011.
[14] Jaliya Ekanayake and Geoffrey Fox “High Performance Parallel Computing with Clouds and Cloud Technologies”
International Conference on Cloud Computing, 2009.
[15] Ashlesha S. Nagdive et al, “Overview on Performance Testing Approach in Big Data,” International Journal of
Advanced Research in Computer Science, vol. 5, no. 8, Nov–Dec, pp. 165-169, 2014.
[16] Y. Zhang, “Optimized runtime systems for MapReduce applications in multi-core clusters,” Thesis, Rice
University, Texas. 2014. [Online]. Available: https://www.semanticscholar.org/paper/Optimized-Runtime-
Systems-for-MapReduce-in-Clusters-Zhang/10fa14d4c7846bf9b35e8507a8dcdcb7ee79a672
[17] Md. Armanur Rahman, and J. Hossen, “A Survey of Machine Learning Techniques for Self-tuning Hadoop
Performance,” International Journal of Electrical and Computer Engineering (IJECE), vol. 8, no. 3,
pp. 1854-1862, June 2018.
[18] Aman Lodha, “Hadoop’s Optimization Framework for Map Reduce Clusters,” Imperial Journal of
Interdisciplinary Research (IJIR), vol-3, no.-4, pp. 1648-1650, 2017.
[19] Dan Wang, Jiangchuan Liu , “Optimizing Big Data Processing Performance in the Public Cloud: Opportunities and
Approaches” IEEE Network, September/October 2015.
[20] A. K. M. Mahbubul Hossen, A. B. M. Moniruzzaman et. al., “Performance Evaluation of Hadoop and Oracle
Platform for Distributed Parallel Processing in Big Data Environments,” International Journal of Database Theory
and Application, vol. 8, no. 5, pp.15-26, 2015.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 4, August 2020 : 3811 – 3818
3818
[21] J. Santosh Kumar, S. Raghavendra, B. K. Raghavendra et.al., “Big data Performance Evalution of Map-Reduce Pig
and Hive,” International Journal of Engineering and Advanced Technology (IJEAT), vol-8, no.-6, August 2019.
[22] J. Santosh Kumar, S. Raghavendra, B. K. Raghavendra et.al., “Big data Processing Comparison using Pig and
Hive,” International Journal of Computer Science and Engineering (IJCSE), vol. 7, no. 3, March 2019.
[23] H. Herodotou and S. Babu, Profiling, “What-if Analysis, and Costbased Optimization of MapReduce Programs,” In
Proc. of the VLDB Endowment, vol. 4, no. 11, 2011.
[24] Z. H. Guo, G. Fox, M. Zhou, Y. Ruan, “Improving Resource Utilization in MapReduce,” In IEEE Cluster’12,
pp. 402-410, 2012.
[25] J. Polo, C. Castillo, D. Carrera, et al., “Resource-aware Adaptive Scheduling for MapReduce Clusters,” In
Middleware’11, pp. 187-207, 2011.
[26] J. Santosh Kumar, S. Raghavendra, B. K. Raghavendra et. al., “Big data Performance evaluation of mapreduce pig
hive,” International Journal of Engineering and Advanced Technology (IJEAT), vol. 8 no. 6, Aug. 2019.
BIOGRAPHIES OF AUTHORS
Santosh Kumar J.is currently working as Associate Professor in the Department of Computer
Science and Engineering at K.S. School of Engineering and Management, Bangalore. Affiliated to
VTU Belagavi, He is pursuing Ph.D. in VTU, Belgaum, India. He has 10 years of teaching and 3
years of industry experience. He isInterested in Big data streaming analysis. His reserach topics
includesBig data with machine learning.
Dr. Raghavendra B.K. Pursued P. hd From Dr. MGR Educational & Research Institute,
Chennai and Masters from PESCE, Mandya bengaluru university and Bachelors from GCE,
Ramanagara Bengaluru University Karnataka He published nearly 15 reputed journals and His
Area of interest is Data mining and Big data He currently working in BGSIT B G Nagar (ACU)
Mandya as Professor and Head department computer science and engineering.
Dr. Raghavendra S. is currently working as Associate Professor in the Department of Computer
Science and Engineering at CHRIST DEEMED TO BE UNIVERSITY, Bangalore. He completed
his Ph.D. degree in Computer Science and Engineering from VTU, Belgaum, India in 2017 and
has 14 years of teaching experience. His interests include Data Mining and Big data.
Meenakshi is currently working as Assistant Professor in the Department of Computer Science
and Engineering specialization at Jain Deemed to be university Bangalore. She completed masters
from VTU Belagavi and has 1 year of teaching experience. she is Interested in Big data streaming
analysis.
Copyright of International Journal of Electrical & Computer Engineering (2088-8708) is the
property of Institute of Advanced Engineering & Science and its content may not be copied or
emailed to multiple sites or posted to a listserv without the copyright holder’s express written
permission. However, users may print, download, or email articles for individual use.

Vol.:(0112 33456789)
International Journal of Information Security
https://doi.org/10.1007/s10207-020-00508-5
REGULAR CONTRIBUTION
A novel scalable intrusion detection system based on deep learning
Soosan Naderi Mighan1 · Mohsen Kahani1
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
This paper successfully tackles the problem of processing a vast amount of security related data for the task of network intrusion
detection. It employs Apache Spark, as a big data processing tool, for processing a large size of network traffic data.
Also, we propose a hybrid scheme that combines the advantages of deep network and machine learning methods. Initially,
stacked autoencoder network is used for latent feature extraction, which is followed by several classification-based intrusion
detection methods, such as support vector machine, random forest, decision trees, and naive Bayes which are used for fast
and efficient detection of intrusion in massive network traffic data. A real time UNB ISCX 2012 dataset is used to validate
our proposed method and the performance is evaluated in terms of accuracy, f-measure, sensitivity, precision and time.
Keywords Apache Spark · Stacked autoencoder · Latent · Accuracy · ISCX · Intrusion detection
1 Introduction
Nowadays, with the development of network technology,
computer security-related issues are becoming more crucial
and should be addressed accordingly. Intrusion detection
systems are key components of defense in the computer
security ecosystem. IDSs (Intrusion Detection Systems) are
based on the hypothesis that the behavior of intruders is different
from that of normal users [40].
In general, IDSs can be divided into three categories
based on their architectures: host based, network based,
and hybrid [5]. A host-based intrusion detection system is a
software application installed on a host computer which can
monitor and analyze the system behavior. Most of host IDSs
detect intrusion using system event log files [19].
Unlike the host-based IDSs, which analyze each host individually,
network-based IDSs monitor the flow of packets
through the network. This type of IDS has the advantage,
over host-based IDSs, that it can monitor the whole of the
network with one system, which saves time and cost of
installing software on any host.
Hybrid intrusion detection systems combine both types
of network and host-based IDSs with high flexibility and
better security mechanism. In fact, they combine IDS spatial
sensors to report attacks, which occur at a specific point or
across the entire network [5].
Further, intrusion detection systems can be divided into
three categories based on their detection methods; anomaly
detection, signature or misuse detection and stateful protocol
analysis detection.
In signature-based detection, attacks’ signatures are
stored in the database. Whenever an intruder attempts to
penetrate, the IDS compares the attack signature with the
signatures stored in the database and generates an alert if it
detects a match. Anomaly detection is related to the modelling
of normal behavior and distinguishing the malicious
patterns from the normal ones. This type of detection, in
comparison with the signature detection methods, has the
advantage of detecting unknown attacks.
The stateful protocol detection methods compare the
observed events and identify the deviation from the state
of the protocol [5]. Several machine learning techniques
including neural networks, fuzzy logic [48], support vector
machines (SVMs) [31, 48] have been studied for the design
of IDSs. In particular, such techniques are developed as
classifiers, which are used to classify whether the incoming
network traffics are normal or are attacks.
Due to increase in internet-based services, the size of
network traffic data has become too large to be processed
* Mohsen Kahani
[email protected]
Soosan Naderi Mighan
[email protected]
1 Computer Department, Faculty of Engineering, Ferdowsi
University of Mashhad, Mashhad, Iran
S. N. Mighan, M. Kahani
1 3
with traditional tools. Therefore, a fast and efficient intrusion
detection system is needed so that it can process large
and complex network data as fast as possible to detect an
intrusion [16].
It should be noted that Deep learning is a subfield of
machine learning and it can be generalized to new problems
in which the data is complex and contains a high level of
dimensionality. Besides, deep learning algorithms enable the
scalable training of nonlinear models on large datasets. This
is why deep learning performs better in the domain of network
intrusion detection because not only is it dealing with
a large amount of data, but the model also could generalize
and be effective in new network environments. Ideally it
can be capable of generalizing to new forms of attacks [18].
In this paper, we focus on network-based IDSs that use
anomaly detection methods. We propose an intrusion detection
scheme using Apache Spark, one of the best processing
tools for big data and for fast detection of malicious traffic.
Also, in the proposed system, we use a stack autoencoder
network (SAE), followed by an SVM classifier. SAE is used
as a latent feature extraction method. The efficiency of combining
deep learning methods and SVM classifier is evaluated
using UNB ISCX 2012 IDS dataset. The precision and
prediction time of our proposed scheme, for the classification
task, outperforms SVM.
The innovations of this research are summarized as
follows:
• Combining deep learning and machine learning in a big
data processing framework with scalability and distributability
– Using deep learning methods for training latent features
to improve processing time and cost
– Using machine learning algorithms compatible with
big data framework to meet the challenge of a large
amount of data
• Improving the machine learning algorithm for better
diagnostic accuracy and error rates
The rest of this paper is organized as follows: Section 2 presents
the related work. Machine learning algorithms used
in our work are briefly explained in Sect. 3. In Sect. 4, we
introduce our proposed method in three parts: (1) data preprocessing,
(2) latent feature extraction, and (3) attack detection.
The experimental results and metrics are presented in
Sect. 5. Section 6 contains the result analysis for all the
experiments done. Finally, the conclusion is presented in
Sect. 7.
2 Related work
There are three potential ways that can be used, separately
or in combination, to improve intrusion detection systems.
These are feature selection and machine learning, intrusion
detection in big data framework, and deep learning.
In this section, all of the issues relating to the three ways
are discussed.
All the research studies in this part, except the papers
in [2, 7, 10] have used KDD CUP 1999 dataset to test the
performance of their models.
2.1 Feature selection and machine learning
algorithms
Anomaly detection systems have the ability to identify new
attacks that differ from the normal activities. Once an attack
is detected, the system administrator could be informed so
that preventive or corrective actions can be taken. In order to
detect attacks, a number of machine learning methods have
been proposed. Although all of the methods can be used
individually to improve the performance of an IDS, it was
shown that the best possible accuracy and detection rate can
be achieved using hybrid learning approaches [28].
Classification methods in machine learning have two
stages of training and classification. In the training phase,
the distribution of the features is trained and, in the classification
stage, the trained features are applied and unconventional
behaviors are detected.
One of the most famous methods in machine learning is
k-means clustering algorithms, which are used in intrusion
detection. The papers in [17, 28, 30] used this method, in
various ways, for attack detection.
In [28], a k-means clustering algorithm based on particle
swarm optimization was proposed. [17] improved the
k-means algorithm to identify unconventional behaviors.
First, it filtered out the outliers and error points in the
dataset, and then it calculated the distance between all the
data points and acquired the center of k clusters using a
dynamic and repetitive process.
The paper in [8] used data mining techniques and several
preprocessing methods, such as normalization, discretization,
and feature selection.
[7] used incremental support vector machine (ISVM)
by providing a half partition strategy. This paper selected
unsupported vectors called candidate support vectors
(CSVs) and support vectors were considered to be the next
increment in this type of classification. This paper used
Kyoto dataset and compared its method with several other
methods. The results showed that the method has a high
detection rate and low error rate.
A novel scalable intrusion detection system based on deep learning 1 3
The paper in [6] used a multiple detector set of artificial
immune system to classify intrusion based on the features of
application layer protocols in network data flows.
There are some studies in [2, 14, 24, 29, 41] that used
hybrid methods of intrusion detection. They claimed that
a combinational method can improve the performance. All
of them had two parts of feature selection and classification.
These techniques can reduce the training and testing
times and improve the accuracy, speed of convergence and
reliability.
2.2 Intrusion detection in big data framework
As mentioned earlier, the large size of traffic data has
become a big challenge for intrusion detection. All the
methods described above fail to successfully address the
challenge of big data. Big data characteristics have been
recognized by Gartner [26] in three Vs: volume, velocity,
and variety. Volume is about the amount of data. Velocity
is associated with the data processing speed. Variety is
related to the complexity of data. Others have added more
Vs, Veracity and Value, to the characteristics [49]. Veracity
is about the integrity and quality of data including things
such as noise or null value. Value is associated with the large
size of data [50].
Two studies have used machine learning algorithms in big
data framework. [11] used naïve bayes and k2 algorithms to
predict an attack and [25] used five classification algorithms,
such as logistic regression, support vector machine, random
forest, gradient boosted decision trees and naïve Bayes for
attack detection, and Apache Spark for high speed. Also,
[11] merged the two datasets of DARPA and KDD99 by
using big data tools and map-reduce paradigm before analyzing
the dataset with WEKA tool.
Also, there are three works that used hybrid methods for
feature selection and classification in big data framework.
[16] used Apache Spark big data processing tool and [16]
used Hadoop and Spark framework for a large size of network
traffic data.
[16] utilized two feature selection algorithms, i.e., correlation-
based feature selection and chi-squared feature reduction,
and five classification algorithms for attack detection,
i.e., logistic regression, support vector machine, random forest,
gradient boosted decision trees, and naïve Bayes. The
first algorithm for feature selection used a correlation-based
heuristic to take advantage of individual features in predicting
labels. The second algorithm for feature selection was
used to understand how the categorical variables differ from
each other in terms of their distributions. Two datasets were
used in this work, namely DARPA KDD99 [20] and NSLKDD
[42] dataset.
The paper in [34] proposed DSCA as a DPI-based stream
classification algorithm. The method utilized applications
detected by DPI as labels for incoming traffic flow. In this
paper, network traffic is fed into the feature extractor module.
Then inline deep packet inspector processes the flow packets
and finally, stream processor forward flows and applications
to stream classifier where each flow is assigned to a label as
a pair of (flow,label).
The proposed system in [35] has four layers of network
traffic sniffing, filtering and load balancing, processing, and
decision making. First, the required features were selected
using forward selection ranking (FSR) and backward elimination
ranking (BER) methods, and then it used some classification
methods, such as J48, REPTree, SVM and naïve
Bayes, for attack detection.
Another paper, [10] used rough set theory to attribute
subset selection and a scalable parallel genetic algorithm,
sequential GA, in Hadoop framework in the form of mapreduce
to find the minimum rough set reduct. They proposed
a workflow which has two main parts: the construction of
a distinction table and the GA operations. They divided the
work among a set of nodes. Their method consisted of the
three parts of drivers, mappers and reducers to minimize
the overhead.
They evaluated the experiments on four cyber security
datasets: Spam base, NSK KDD, Kyoto and CDMC2012,
and compared the results in terms of running time and reduct
size. Also, those experiments were repeated for sequential
GA as well as parallel GA. The results showed that the best
performance was when four mappers were to be used. Also,
they reported a significant reduction in the execution time
when using parallel GA with four reducers, while the eightreducer
method ran in longer periods of time.
They also used two factors, the number of attributes and
the number of instances, that can affect the performance.
The results showed that parallel GA with four reducers led to
noticeable reduction in the execution time. Moreover, when
more attributes and instances were used, better performance
was achieved.
2.3 Deep learning algorithms
Deep learning algorithms are applicable to many areas of
science, such as speech recognition, image and text processing.
The most obvious advantage of these methods is
their automatic learning of features. These techniques have
deeper effects than the traditional methods. Increase in the
size and complexity of data makes it difficult to choose the
appropriate features. This problem can be addressed using
deep learning methods.
Deep learning-based methods have also been employed
for intrusion detection [19]. The usage of deep networks in
intrusion detection systems can be categorized according
to their architectures and deployments, as shown in Fig. 1.
S. N. Mighan, M. Kahani
1 3
Generative models are referred to as graphical models, in
which nodes represent random variables and edges represent
their relationships. Graphical models have hidden variables
that are not visible. These models are associated with supervised
learning. However, they are not dependent on data label.
This category contains four subclasses as seen in Fig. 1.
Discriminative architecture deals with the post-distribution
of classes in the input data. recurrent neural network
(RNN) and convolutional neural network (CNN) are categorized
in this type of architecture. So far, CNN has not been
used to detect intrusions.
There are several papers that used a single deep network
for attack detection: [3, 12, 13, 23, 44, 46, 47]. Also, there
are several studies that used a combination of neural networks
such as [9, 27, 36].
There are some papers that are similar to our proposed
method in some ways. All of them used UNB ISCX 2012
dataset to evaluate their model. Therefore, we briefly explain
them to compare them with our framework.
The paper in [15] used restricted boltzmann machine
(RBM) for network intrusion detection. The main goal of
the paper was to show that it was capable of learning complex
datasets with a systematic approach intended for training
RBMs. They used an optimized approach for choosing
the training parameters of the specific configured RBMs.
Their approach comprehended three major steps of weight
initialization, pretraining, and fine tuning.
There are several studies that used hybrid methods for
feature selection and classification such as [22, 38, 39]. The
paper in [22] proposed an anomaly-based intrusion detection
system which analyzed the packet files by using Apache
Hadoop and Spark. The approach was based on Hive SQL
and unsupervised learning algorithms.
Six datasets were used in the related literature. These
are Spam base, NSK KDD, Kyoto and CDMC2012, KDD
Cup99 and UNB ISCX 2012. We briefly explain them.
Spam base1 is a collection of spam e-mails that came
from postmaster and individuals who had filed spam. The
collection of non-spam e-mails came from field work and
personal emails. It has two classes of spam or non-spam,
which were collected by Hewlett Packard in 1999. It has
57 attributes of continuous real and integers and 5,054,644
records.
KDD CUP 992 is the dataset used for The Third International
Knowledge Discovery and Data Mining Tools Competition,
which was held in conjunction with KDD-99, The
Fifth International Conference on Knowledge Discovery and
Data Mining. The competition task was to build a network
intrusion detector, a predictive model capable of distinguishing
between bad connections, called intrusions or attacks,
and good normal connections.
The database contains a standard set of data to be audited,
which includes a wide variety of intrusions simulated in a
military network environment. It has 41 features of continuous
and symbolic values and two classes of normal and
attack. Also, there are 4898431 records in this dataset.
NSL-KDD3 is a dataset suggested to solve some of the
inherent problems of the KDD CUP 99 dataset. It contains
essential records of the complete KDD data set. It
does not include redundant records in the train set. There
are no duplicate records in the proposed test sets and it has
6,228,096 records.
Kyoto4 has 24 statistical, 14 conventional and 10 additional
features. Among them, the first 14 features were
extracted based on KDD Cup 99 data set, which is a very
popular and widely used performance evaluation data set in
Fig. 1 Classification of intrusion detection systems based on deep learning [19]
1 https ://archi ve.ics.uci.edu/ml/datas ets/Spamb ase.
2 http://archi ve.ics.uci.edu/ml/datas ets/kdd+cup+1999+data.
3 http://www.unb.ca/cic/datas ets/nsl.html.
4 http://www.takak ura.com/kyoto data/.
A novel scalable intrusion detection system based on deep learning 1 3
intrusion detection research. Among the 41 original features
of KDD Cup 99 data set, only 14 significant and essential features
have been extracted from the raw traffic data obtained
by honeypot systems that are deployed in Kyoto University.
CDMC 2012 dataset5 is a real traffic data collected from
several types of honeypots and a mail server over five different
networks inside and outside of Kyoto University. The
dataset is composed of 14 features including label information,
which indicates whether each session is attack or not.
It has 23,430,769 records.
Table 1 compares the related works in terms of feature
selection algorithms, detection methods, accuracy, detection
rate, false alarm rate, execution time, big data framework
and the dataset that they have used for the evaluation. All
the features in Table 1 are extracted based on the papers.
Therefore, some features such as time are assigned null value
since there is no value for that feature in related paper.
3 Machine learning algorithms
In this section, we briefly explain SVM and decision tree
algorithms which are used in our method.
3.1 Support vector machine (SVM)
Support vector machine [21] is a well-known classification
technique, which is based on statistical learning theory
(SLT). It is based on the idea of creating an optimal separating
hyperplane, which is as far away as possible from each
of the classes. Training data points can be viewed as a vector
in the d-dimensional feature space.
3.2 Decision tree
A decision tree [32] is a decision support tool that uses a treelike
graph or model of decisions and their possible consequences,
including chance event outcomes, resource costs, and
utility. The decision tree can be linearized into decision rules,
where the outcome is the contents of the leaf nodes, and the
conditions along the path form a conjunction in the if clause.
4 Proposed approach
In this section, we propose a novel hybrid approach for intrusion
detection. First of all, two definitions are presented. Then,
we introduce our framework, which is based on big data platform.
We call our framework SAE–SVM in big data. The first
two phases of our method are the same as those presented in
our paper in [33], which followed binary classification.
Definition 1 (Flow of packet) Each flow of packets, which
includes a combination of packets, has several characteristics.
Every tuple of such characteristics is a feature of that
flow. Therefore, FOP is defined as follows:
where n is the number of packets in the flow and the F s are
the features and can be instantiated by source IP, destination
IP, number of packets transmitted from source to destination,
Number of packets transmitted from destination to source,
protocol name, start date time, stop date time, etc. The features
may be different from dataset to dataset. For instance,
in DARPA and KDD CUP 1999 datasets, which are the most
popular dataset in the field, there are several features such as
duration, protocol name, service, source bytes, etc.
Definition 2 (Attack) An attack in network security is a
deviant behavior from the normal behavior in the network
and attempts to steal or destroy information without having
authorized access or permission. There are higher-level
features that help in distinguishing normal connections from
attacks. Attack can be defined as a set of features as follows:
where n is the number of features which help in recognition
of the attacks. For instance, A can be named as “same host”
and is obtained through comparing the host IPs. It examines
only the connections in the past two seconds that have the
same destination host as the current connection does, and
calculate statistics related to the protocol behavior, service,
etc. A can be “same service” feature that examines the connections
in the past two seconds that have the same service
as the current connection does. “Same host” and “same service”
features are jointly called time-based traffic features of
the connection records. Some probing attacks scan the hosts
or ports using a much larger time interval than two seconds,
for example once per minute. Therefore, we can realize that
probing attack happened by investigating the features.
In this paper, we want to detect anomaly behavior in networks.
To this end, we propose a hybrid model which has
four important phases. The phases are described as follows
and are visualized in Fig. 2.
4.1 First phase (data preprocessing)
This component is responsible for preprocessing the data
in order to make it ready for feature extraction. The dataset
consists of miscellaneous types of data of symbolic and
numerical representations, such as service and duration.
There are 39 continuous or discrete numerical attributes and
3 symbolic attributes.
FOP = (1)

F1, F2, F3,…, Fn

ATTACK = (2)

A1, A2, A3,…, An

5 http://www.csmin ing.org/index .php/cdmc-2012.html.
S. N. Mighan, M. Kahani
1 3
Table 1 Comparison of the related works in terms of their characteristics
aParticle swarm optimization
bNaive Bayes
cCorrelation-based feature selection
dIncremental support vector machine
ePrincipal component analysis
fSupport vector machine
gKernel principle component analysis
References Feature selection Detection method Accuracy (%) Detection
rate
(%)
False
alarm rate
(%)
Time Big data framework
Dataset
[28] – K-means based on
PSOa
82 – 2.8 – – KDD 99
[17] – K-means – 92.7 5.21 – – KDD 99
[30] – K-means and NBb 99.6 0.5 – – KDD 99
[8] CBFSc NB 89 – 13.4 4.46 s – KDD 99
[4] – Genetic algorithm – 99.74 3.74 – – KDD 99
[7] – ISVMd – 90.14 2.314 40.97 s – Kyoto
[41] PCAe SVMf 99.70 99.85 0.3 237 s – KDD 99
[29] Gain ratio J48 tree 93 99.97 2.15 – – KDD 99
[24] KPCAg SVM – 95.25 1.03 2.07 s – KDD 99
[2] Vote and Information
Gain
algorithm
J48, MPh , RTi ,
REPTree, Ada-
BoosM1, DS j
and NB
99.81 98.56 0.3 – – NSL-KDD
[11] – NB and K2 – 89.24 0.58 – Map-reduce in big
data
DARPA and
KDD99
[25] – LRk , SVM, RFl ,
GBDTm and NB
91 89 – 175 s Spark KDD99
[16] CBFS and Chisquared
LR, SVM, RF,
GBDT and NB
91.56 89.91 289 s Spark and Hadoop KDD 99 and NSLKDD
[35] FSRn and BERo J48, REPTree,
SVM and NB
99 – 0.001 10 s Hadoop mapreduce
KDD99
[10] Rough set theory Sequential genetic
algorithm
– – – 1000 s Hadoop mapreduce
Spam base, NSL
KDD, Kyoto and
CDMC2012
[23] – LSTMp to RNN 96.93 98.88 10.04 – – KDD99
[12] RBM 84 – – 186.4 h – KDD99
[46] ANN SAEq 98.51 – – – – KDD99
[44] AEr LR 94.82 93.95 4.13 1503 s – KDD99
[13] – DBNs 93.49 – 0.76 – – KDD99
[3] – DBN 97.5 – – 0.32 s – KDD99
[27] AE DBN 92.10 92.20 1.58 1.243 s – KDD99
[9] Euclidean distance SVM-RBM 82 – – – – KDD99
[36] DBN SVM 92.84 – – 3.07 s – KDD99
[15] – RBM 78 63 10.6 – – UNB ISCX 2012
[22] PCA GMMt 82.6 47 13 – Hadoop and Spark UNB ISCX 2012
[39] K-means RF 99.97 98.94 0.06 415.56 s – UNB ISCX 2012
[38] Information gain,
ABABu
LR, EGDv and
XGBoost
99.65 99.40 0.3 130 s Apache Hama UNB ISCX 2012
[43] – Graphical model
based on detection
system and
SD
– 89.30 10.70 1600 s – UNB ISCX 2012
A novel scalable intrusion detection system based on deep learning 1 3
The next two phases require each record in the dataset
to be represented as a vector of real numbers. Therefore,
every symbolic feature in the input data is first converted to
a numerical value. Integers from 0 to N − 14 , where N is the
number of symbols, are assigned to the symbolic features.
In order to eliminate the effect of dimension for every
single attribute, an essential step of data normalization is
performed. Each numerical value is normalized over the
range [0, 1], according to the following data smoothing
method [12].
where y is a numerical value, min is the minimum value of
the attribute that y belongs to and max is the maximum value
of that attribute.
y = (3)
y − min
max−min
hMeta pagging
iRandom tree
jDecision stump
kLogistic regression
lRandom forest
mGradient boosted decision trees
nForward selection ranking
oBackward elimination ranking
pLong short-term memory
qStack autoencoder
rAutoencoder
sDeep belief network
tGaussian mixture model
uAutomated branch and bound algorithms
vExtreme gradient boosting
Table 1 (continued)
Fig. 2 The framework of the proposed method
S. N. Mighan, M. Kahani
1 3
4.2 Second phase (latent feature extraction)
In this step, we intend to extract the latent features, i.e.,
the variables which are not directly observed but are rather
inferred from other observed variables. Suppose we have a
matrix R of item-users interactions. The model assumption
in matrix factorization is that each cell R of the matrix is
generated by pT
q , a dot product between the latent vector
p , describing the user u, and a latent vector q , describing
the item i. Intuitively, this product measures how similar the
two vectors are. During the training, we want to find “good”
vectors in a way that the approximation error is minimized.
As said earlier, several types of networks may be used in
deep learning methods. In this work, we used autoencoder
network. In fact, the principal advantage of deep learning
is that it replaces the handcrafted features with an efficient
feature learning algorithm and hierarchal feature extraction.
Autoencoder is aimed at learning an efficient, compressed
representation for a set of data. The structure of the autoencoder
network is shown in Fig. 3.
Autoencoder is an ANN, with the number of layers
always being set to three. The difference is that the nodes in
the output layer are the same as the ones in the input layer.
The nodes in the middle layer are new features that are represented
in a lower number of dimensions. It means that data
can be reconstructed after complicated computations. Since
the training process does not involve any labels, it is unsupervised.
Generally, a dataset without labels is easily collected.
As a result, it imposes small workload on researchers.
In this work, we use multilayer perceptron in Auto
encoder network. Multilayer perceptron is perhaps the
simplest class of neural networks. A multilayer network
consists of a set of neurons that are logically arranged in
layers. There are at least two layers: input and output. The
output activations of neurons in the input layer are determined
by the network’s input. Usually, one or more hidden
layers are located between the input and output layers as
shown in Fig. 4.
Data flow is unidirectional from the outputs of the previous
layer to the inputs of the next layer. Thus, the output
of the network is a function of its inputs. The output of
such a network can be computed in a single deterministic
pass. Every single neuron has n inputs x = (x, x,…, x)
plus another dummy input, which is always valued at 1,
acting as a bias b. Every neuron is characterized by n
weights w = (w, w,…, w) and an activation function f
that is applied to the weighted sum of the inputs, yielding
as the result:
The operational behavior of the network is principally determined
by weights. The shape of the activation function
minimally affects the expressive power of the network. In
addition, it influences the convergence speed of the training
procedure.
The simplest form of the nontrivial activation function
is the threshold function f = 1 if x ≥ a and 0 otherwise.
In general, a differentiable activation function is preferred.
A common family of activation functions is the sigmoid
functions. An example of sigmoid function is the logistic
sigmoid, which is:
Whose derivative is always positive and is given by the
logistic equation:
Further, the structures can be stacked to make up deep networks.
As shown in Fig. 5, the training results of the middle
layer are cascaded, and a new network structure, which
f (4)

n
i=1
wi ⋅ xi + b = f (wtx + b)

sigm(x) = (5)
1
1 + e−x
(6) d
dx
f (x) = f (x) ⋅ (1 − f (x))
Fig. 3 Autoencoder architecture [1]
Fig. 4 A typical three layer network [12]
A novel scalable intrusion detection system based on deep learning 1 3
is called stacked autoencoder (SAE), is formed. Many new
features at different depths are learned using this method.
In this study, we use Stacked Auto encoder, which has
input, output and two hidden layers. As mentioned above,
the input and output in autoencoder networks are the same
and it is constituted by two main parts: an encoder that maps
the input into the code, and a decoder that maps the code to
a reconstruction of the original input. After training, if we
have the same input and output layers, we will have a good
network with a low error and the neurons in the hidden layer
are the required latent features.
A stacked autoencoder is a neural network consisting of
several layers of autoencoders where output of each hidden
layer is connected to the input of the successive hidden layer.
The idea of autoencoders has been popular in the field of
neural networks for decades. Their most traditional application
was dimensionality reduction or feature learning [45].
The recent Stacked autoencoder systems provide a version
of raw data with much detailed and promising feature
information, which is used to train a classier with a specific
context and find better accuracy than training with raw
data. This is why we use stacked autoencoder for feature
reduction.
Each record in our dataset, which has 42 features, is
defined as follows:
The last element in the record is the label of normal or
attack, which is used for the evaluation. Therefore, in our
deep network, the input and output layers have 42 neurons,
which means that, in each iteration, one record is presented
to the input layer. The first hidden layer has 20 neurons,
meaning that all the 42 features are connected to the next
layer and converted to 20 features, which is the first encoder
part of the network. It can be defined as:
FOP = (7)

F1, F2, F3,…, F42

The first hidden layer becomes the input layer for the second
hidden layer, which has 10 neurons. These 20 features are
converted to 10 features using the activation function. This
is the second encoder part of the network. It can be defined
as follows:
The last part of the network is the decoder layer, which converts
10 features in the second hidden layer into 42 features
in the output layer. It can be defined as follows:
This operation is done for each record. Finally, we use the
matrix of 10 features in the second hidden layer of the deep
network as our latent features.
As it was mentioned above, we use two hidden layers in
our network. The first one has 20 neurons and the second one
has 10. After optimizing the weights of these two layers, we
performed two experiments. Initially, 20 features of the first
hidden layer was used to the inputs of the SVM algorithm
in third phase. For the second experiment, 10 features of the
second hidden layer was used. It was noticed that the latter
outperformed the former. This scenario was examined for
other number of features for each hidden layer and we found
that 10-feature model yields the best accuracy.
4.3 Third phase (attack classification)
After extracting the latent features using deep learning, the
features are used in the big data framework. In fact, the main
goal of our system is intrusion detection in real time with
high accuracy and low false alarm.
Apache Spark6 is an open source computational infrastructure
for big data analysis. Compared with other big
data tools, such as Storm and Hadoop, Spark uses multilayer
in-memory layout, which can very quickly process a large
amount of data in parallel. Also, it can support most languages
such as Java, Scala and Python. Spark has functional
APIs for job management. It can run over Hadoop clusters,
access their data and process them. The main core of Apache
Spark consists of basic operations, such as operational planning,
memory management, error recovery and interacting
with storage systems. Resilient distributed datasets (RDDs)
are the main elements of programming in Spark, represent
items that are distributed in computer systems, and are used
for processing data in parallel.
FOP = (8)

G1,G2,G3,…,G20

FOP = (9)

H1,H2,H3,…,H10

FOP = (10)

F1, F2, F3,…, F42

Fig. 5 The schematic diagram of SAE [46]
6 Apache SparkTM—Lightning-Fast Cluster Computing, 2015, http://
spark .apach e.org/.
S. N. Mighan, M. Kahani
1 3
We split the dataset into training and testing sets in such a
way that 70 percent of the data is used as the training data and
30 percent for testing. The training dataset is passed to the
SVM classifier to be classified.
In our dataset, each data point belongs to either class Normal
or Attack. Each training data point x can be labeled by y
based on the following equation:
Thus, the training dataset can be denoted as:
4.4 Fourth phase (decision making)
After classification, we have ten categories of data consisting
of normals and attacks. Then we merge all the categories with
each other to classify the test data. However, there are some
classification errors. We use a decision tree for correcting such
errors.
We use decision tree to train the data output from the third
phase to decrease the false positive of attack detection. In this
way, we make rules to find the most accurate class of attack
for all records of data.
y (11) i =

0 xi ∈ class Normal
1 xi ∈ class Attack
D = {(x (12) i, yi)i = 1, 2, 3…N}
5 Experimentation and result analysis
In this section, we first describe the dataset used to conduct
our experiments, then specify the performance metrics
used for comparison. Then, we present and discuss the
results that were obtained by using feature extraction methods
and SVM-based intrusion detection scheme in the big
data framework. We have used Apache Spark and Python
language.
All the experiments are conducted on a cluster consisting
of one master node and five slave nodes. All of them are
Intel Corei7 quad core [email protected] GHz with 1T hard size,
6 cores, 16 GB RAM and 100 MBs bandwidth for network
with Ubuntu 14.04 trusty installed.
5.1 Description of datasets
5.1.1 UNB ISCX 2012 dataset
In order to verify time efficiency and the effectiveness
of the proposed cyber security intrusion detection
framework, we use a real time network traffic dataset,
UNB ISCX 2012 dataset [37], rather than the traditional
approach of using legacy KDD dataset family. The total
Table 2 Details of ISCX dataset
Date Description File size Number of normals Number of attacks Percentage
of
attacks
2010/6/11 Friday Normal activity 16.1GB 378667 0 0.0
No malicious activity
2010/6/12 Saturday Normal activity 4.22GB 131111 2082 1.56
Non-classified attacks
2010/6/13 Sunday Infiltering the network from inside 3.95GB 255170 20358 7.4
Normal activity
2010/6/14 Monday HTTP denial of service 6.85GB 167609 3771 2.2
Normal activity
2010/6/15 Tuesday Distributed denial of service using an
IRC Botnet
23.4GB 534320 37378 6.5
2010/6/16 Wednesday Normal activity 17.6GB 522263 0 0.0
No malicious activity
2010/6/17 Thursday Brute force SSH 12.3GB 392392 5203 1.31
Normal activity
Total 84.42GB 2381532 68792 2.8
A novel scalable intrusion detection system based on deep learning 1 3
size of the dataset is 84.42 GB. The traces are obtained in
seven days under practical and systematic conditions. It is
a labeled dataset which comprises over two million traffic
packets that attack data representing 2% of the whole
traffic. This dataset has four types of attack scenario consisting
of Infiltrating, Brute force SSH, HTTP denial of
service (DoS) and Distributed denial of service (DDoS).
Table 2 shows finer details of the dataset.
This dataset has two xml files for training and testing.
There are 10 classes in this dataset. One of the classes
is normal and the others are attacks. Attack categories
are DoS, Backdoor, Analysis, Exploit, Fuzzers, Generic,
Shellcode, Worms and Reconnaissance. The numbers of
attacks in each class versus the normal records are shown
in Fig. 6.
5.1.2 UNB CICIDS 2017 dataset
The total size of the dataset is about 51.1 GB. For this dataset,
they built the abstract behavior of 25 users based on the
HTTP, HTTPS, FTP, SSH, and email protocols. The datacapturing
period started at 9:00am, Monday, July 3, 2017
and ended at 17:00 on Friday July 7, 2017, for a total of 5
days.
Monday is the normal day and only includes the benign
traffic. The implemented attacks include Brute Force FTP,
Brute Force SSH, DoS, Heartbleed, Web Attack, Infiltration,
Botnet and DDoS. They have been executed both in
the morning and in the afternoon on Tuesday, Wednesday,
Thursday and Friday. Table 3 shows finer details of the
dataset.
Fig. 6 Multiclass classification in UNB ISCX 2012 dataset
Table 3 Details of CICIDS 2017 dataset
Date Description File size (GB)
2017/7/3 Monday Normal activity 11
No malicious activity
2017/7/4 Tuesday Normal activity 11
Brute force, SFTP, SSH
2017/7/5 Wednesday Normal activity 13
DoS/DDoS, hearbleed attacks, slowloris, Slowhttptest, Hulk and GoldenEy
2017/7/6 Thursday Normal activity 7.8
Web and infiltration attacks, Web BForce, XSS, Sql Injection, Infiltration Dropbox Downloa
2017/7/7 Friday Normal activity 8.3
DDoS LOIT, Botnet ARES, PortScans (sS, sT, sF, sX, sN, sP, sV, sU, sO, sA, sW, sR, sL and B)
Total 51.1
S. N. Mighan, M. Kahani
1 3
Also, there are 8 csv files for each of the days, which consist
of normal and attack records. The total size of them is about
1.12 GB and there are 2,830,743 records in this dataset. Each
record has 85 features.
5.2 Evaluation metrics
Generally, the performance of anomaly detection can be evaluated
using four major criteria: true positive (TP), which is the
number of attacks correctly identified as attacks, true negative
(TN), which is the number of normal connections correctly
identified as normal, false positive (FP), which is the number
of normal connections incorrectly identified as attack, false
negative (FN), which is the number of attack connections
incorrectly identified as normal. Subsequently, performance
metrics are described as follows to validate the proposed system
performance.
5.2.1 Accuracy
It is the most important measure of performance and describes
the percentage of true IDS predictions.
5.2.2 Precision
It is one of the important metrics and is the percentage of
attack connections correctly classified as an intrusion compared
with the total number of attack flows.
5.2.3 Recall
It is the percentage of true positive rate in a system, which is
useful in evaluating the performance of an algorithm.
Accuracy = (13)
TP + TN
TP + TN + FP + FN
Precision = (14)
TP
TP + FP
Recall = (15)
TP
TP + FN
5.2.4 F‑measure
It is a harmonic composition of precision and recall and has
been used in some studies.
5.2.5 False alarm rate
It is the percentage of normal flows incorrectly classified as
intrusion compared with the total number of normal flows.
5.2.6 Sensitivity
It is also called true positive rate. It is used to measure the
proportion of positives that are correctly identified as such.
5.2.7 Specificity
It is also called true negative rate. It is used to measure the
proportion of negatives that are correctly identified as such.
5.2.8 Training time
Time taken to train a classifier.
5.2.9 Prediction time
This describes how much time a particular algorithm has taken
to predict all the records in a dataset as normal or as a specific
attack.
F-measure = (16)
2 × Precision × Recall
Precision + Recall
FPR = (17)
FP
FP + TN
Sensitivity = (18)
TP
TP + FN
Specificity = (19)
TN
TN + FP
Table 4 Performance metrics for PCA and deep learning
Method Accuracy Precision Recall FAR ROC F-measure Kappa
PCA-SVM 0.856 0.880 0.849 0.136 0.856 0.847 0.702
DL-SVM 0.902 0.903 0.903 0.098 0.902 0.903 0.806
A novel scalable intrusion detection system based on deep learning 1 3
6 Result analysis
We conduct several experiments, each of which will be
described in this section. Two kinds of classifications exist in
this field, which are binary and multiclass classification. In the
first type, we just detect that a record is normal or an attack,
while in the second one, we can identify the type of an attack.
Experiment 1 This experiment is about binary classification.
In the first experiment, we apply SVM algorithm implemented
in Weka package using Java language. For feature
selection, stack autoencoder network is used. The activation
function of the encoder layer is Rectifier Linear Unit (ReLu)
and that of the decoder layer is Sigmod.
The loss function is mean-squared-error and the gradient
descend optimization algorithm is Adadelta algorithm,
which decrease learning rate monotonically and need no
learning rate to be set. The network is trained in 1000 epoch
and its batch size is 256. This experiment was done in [33].
We map the 42 features related to each record of the data
to 10 features, each of which results from a certain combination
of some of the 42 features. Since the PCA algorithm
is one of the best methods of feature selection [41],
we compare our method with PCA and the result is shown
in Table 4.
As shown in Table 4, according to all the metrics, the
performance of our method is better than PCA.
In the second step, we apply SVM (with Radial Basis
Function (RBF) as kernel). The parameters c and gamma are
set to 1000 and 0.001, respectively. The first experiment is
done on UNB ISCX 2012 and we compare our method with
the methods proposed by [6, 14, 15, 22, 34, 47], which were
discussed above, as shown in Table 5.
Table 5 shows that in terms of accuracy, precision, recall,
false positive rate and the other metrics, our method is better
than the two other references. Also, the execution time of our
method is much less than the others.
Experiment 2 This experiment investigates the binary classification
in Apache Spark big data framework. The first step
involves latent feature extraction, which is done using stack
autoencoder, similar to the first experiment. In the second
step, we apply binary SVM implemented in Apache Spark,
LIBSVM, Python Machine Learning, PySpark and Sklearn
packages. For SVM, the linear kernel is used and the parameter
c is set to 1, 10, 100, 1000 and gamma is set to 0.001,
0.0001. The best result obtained belongs to the case when
the parameters c and gamma are set to 1000 and 0.001. The
result is shown in Table 6.
As shown in Table 3, Precision [0] and Precision [1] are
the precisions respectively associated with class 0, or Normal,
and class 1, or Attack, and the Precision row indicates
the overall precision of detection. The same holds for Recall.
Experiment 3 The third experiment deals with Multiclass
classification in Apache Spark big data framework. The first
step is similar to that of Experiment 2. In the second step,
we apply SVM implemented in Apache Spark, LIBSVM,
Python Machine Learning, PySpark and Sklearn packages.
For SVM, the Radial Basis Function (RBF) kernel and
the linear kernel are used and the parameter c and gamma
are, respectively, set to the values of 1, 10, 100, 1000 and
0.001, 0.0001. The best result is obtained when the kernel
is RBF and the parameters c and gamma are set to 1000 and
0.001. We run the algorithm for each class of attacks individually.
In fact, this step is executed 10 times and, in each
step, the label of the intended class is set to 1 and the labels
of the others are set to 0.
Also, we apply decision tree, decision tree one-vs-rest,
gradient boosted, logistic regression one-vs-rest, naïve bayes
and random forest. For decision tree, the impurity function
is Gini and the parameter maxdepth of tree and maxbins are
set to 5 and 32, respectively. In the case of gradient descend
boosted, its max iteration parameter is set to 10 and it chains
indexers and GBT in pipeline. For logistic regression onevs-
rest, the max iteration and tol parameters are set to 10 and
IE-6, respectively.
Also, the fitIntercept parameter is set to true. For random
forest, the parameter numtrees is set to 10, which means that
Table 5 Comparison of our method with the others
Method Accuracy TPR TNR Kappa Sensitivity Time (min)
PCA-GMM [22] 0.862 0.470 0.870 – – 40
RBM [15] 0.786 0.960 0.657 0.541 0.6331 –
[14] 0.818 – – 0.607 0.961 –
[6] 0.767 – 0.997 – – –
[47] 0.894 0.856 0.933 – – –
[34] 0.869 – – – – –
DL-SVM (our method) 0.902 0.903 0.902 0.806 0.902 1
S. N. Mighan, M. Kahani
1 3
bootstrapping is done. For the others, we set the parameters
to the default values used in the Python-ML package. We
compare all the algorithms in Table 7.
We examine four machine learning algorithms to evaluate
our method. We use OVR (One-vs-Rest) to extend the binary
forms of decision tree and logistic regression to the multiclass
forms used in classification. As shown in Table 7, the
accuracy is approximately the same in the cases of decision
tree, decision tree OVR, logistic regression OVR and random
forest. However, the execution time in logistic regression
is shorter than the others by a large margin.
In the third step, we should use another method to do
all the binary classifications, which is intended to reduce
the false alarm rate and improve the accuracy. To do this,
we apply apiriori, FP-growth and decision tree algorithms.
Using apiriori, we generate 48 rules, which are not useful.
As a result, we try FP-growth. In the case of FP-growth,
we use the default values in Sklearn package and acquire 9
main frequent item sets. We evaluate the performance of this
algorithm in probabilistic and percentage terms. The algorithm
yields the values of 42% and 57%, respectively. Also,
we apply decision tree both with and without Apache Spark.
The results of these two methods are shown in Table 8.
Experiment 4 The fourth experiment is about Multiclass
classification without Spark. The first step is similar to
the other experiments. In the second step, we apply SVM
implemented in Sklearn package. The best performance is
obtained for RBF kernel when the parameters c and gamma
are set to 1000 and 0.001, respectively. We run the algorithm
with 10 jobs and in 5 folds. The results are shown in Table 9.
The metrics are separately calculated for each class.
In Table 9, the labels of the classes are related to the
ten categories of normal and attacks: Normal, Analysis,
Reconnaissance, DoS, Fuzzers, Exploits, Backdoor, Generic,
Shellcode and Worms. A comparison of the methods reveals
that our method outperforms the others and the execution
time of its algorithm is 16 seconds and 390 milliseconds.
Finally, we compare the accuracy of our proposed method
with the ones of the other papers in Table 10.
Experiment 5 In this experiment, we apply our method to
UNB CICIDS 2017 dataset.7 The fifth experiment addresses
Multiclass classification in Apache Spark big data framework.
The first step is similar to that of Experiment 2. In the
second step, we apply SVM implemented in Apache Spark,
LIBSVM, Python Machine Learning, PySpark and Sklearn
packages.
Table 6 Performance metrics for binary classification in spark
Metric Precision [0] Precision [1] Precision Recall [0] Recall [1] Recall F-measure Accuracy FP TP ROC PR Time (s)
Value 0.9441 0.8342 0.8848 0.8293 0.9459 0.8848 0.8846 0.8848 0.1095 0.8848 0.8876 0.9029 1:23
7 http://www.unb.ca/cic/datas ets/ids-2017.html.
A novel scalable intrusion detection system based on deep learning 1 3
For SVM, the Radial Basis Function (RBF) kernel is used
and the parameters c and gamma are set to 1000 and 0.001.
We run the algorithm for each class of attacks individually.
In fact, this step is executed 14 times and, in each step, the
label of the intended class is set to 1 and the labels of the
others are set to 0.
In the third step, we apply decision tree without Spark to
reduce the false alarm rate and improve the accuracy. The
results of this experiment are shown in Table 11.
7 Conclusion
Studying the literature reveals that as use of internet
increases, intrusion detection is considered as an important
security issue. Therefore, the main goal of this paper is to
address scalable network intrusion detection in big data
framework. Besides, one has to keep in mind that detection
accuracy, time and cost are of importance, as well. Thus, the
present study tried to use deep learning methods to improve
diagnostic accuracy, decrease the error rate, improve prediction
speed and save time and cost.
This paper proposed a hybrid SAE–SVM scheme for a
fast and efficient cyber security intrusion detection system.
In the proposed system, a stacked autoencoder network was
used as a feature extraction method and SVM as the classifier.
The deep network platform outperformed other feature
extraction methods.
Also, the performance of the proposed framework was
evaluated using the big data processing tool of Apache Spark
and machine learning algorithms. We examined the performance
of the proposed SAE–SVM scheme by reducing the
42-dimensional ISCX dataset to approximately 75% of its
original size and then classified the reduced data by SVM
in Spark.
Table 7 Performance metrics for all the algorithms
Method Accuracy Precision Recall F-Measure Test error Time (s) FP TP ROC
Decision tree 0.75322 0.73024 0.75322 0.71585 0.24677 49 – – –
Decision tree OVR 0.74759 0.80435 0.74759 0.764713 – 19 0.088 0.74759 0.87239
Logestic regression OVR 0.76044 0.75363 0.76044 0.7399 0.239554 12 – – –
Naive Bayes 0.65650 0.72207 0.65650 0.66572 – 14 0.083 0.65650 0.77247
Random forest 0.7526 0.70733 0.75263 0.71253 0.24737 50 – – –
Table 8 Performance metrics for decision tree in the third step
Metric DT in Spark DT
Precision 0.6301 0.9598
Recall 0.6851 –
Accuracy 0.6851 0.9598
F-measure 0.5987 –
Test error 0.3148 –
Time 144 ms 5 min
Table 9 Performance metrics for SVM without Spark
Precision Recall F-measure
[0] 0.87 0.94 0.90
[1] 0.71 0.56 0.62
[2] 0.11 0.00 0.00
[3] 0.53 0.04 0.07
[4] 0.72 0.84 0.77
[5] 0.60 0.01 0.02
[6] 0.47 0.50 0.48
[7] 0.50 0.11 0.17
[8] 0.59 0.22 0.32
[9] 0.99 0.89 0.89
Average 0.76 0.78 0.75
Table 10 Comparison of the results
Method Accuracy (%)
[15] 0.7865
[22] 86.2
[43] 89.30
SAE–SVM in big data framework (our method) 95.98
Table 11 Performance metrics for CICIDS 2017 dataset
Metric Precision Recall Accuracy F-measure Time (min)
SAE–SVM in big data framework (our method) 0.9949 0.9949 0.9949 0.9941 6.15
S. N. Mighan, M. Kahani
1 3
Removing highly correlated features from ISCX 2012
dataset affects the accuracy by a low margin, but it reduces
the time taken, by all the techniques, to train the model or
predict the data. If we extract the latent features with SAE,
the accuracy increases and the execution time reduces. In the
last phase, if we use decision tree, the accuracy increases by
a large margin and we can increase the speed of execution,
as well.
Acknowledgements The authors would like to thank Mr. Behdad
Behmadi for his contribution to the English copy editing of this paper.
Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of
interest.
Ethical approval This article does not contain any studies with animals
performed by any of the authors.
Informed consent Informed consent was obtained from all individual
participants included in the study.
References
1. Abolhasanzadeh, B.: Nonlinear dimensionality reduction for intrusion
detection using auto-encoder bottleneck features. In: 2015 7th
Conference on Information and Knowledge Technology (IKT), pp.
1–5. IEEE (2015)
2. Aljawarneh, S., Aldwairi, M., Yassein, M.B.: Anomaly-based
intrusion detection system through feature selection analysis and
building hybrid efficient model. J. Comput. Sci. 25, 152–160
(2017)
3. Alom, Md.Z., Bontupalli, V., Taha, T.M.: Intrusion detection
using deep belief networks. In: 2015 National Aerospace and
Electronics Conference (NAECON), pp. 339–344. IEEE (2015)
4. Benaicha, S.E., Saoudi, L., Guermeche, S.E.B., Lounis, O.: Intrusion
detection system using genetic algorithm. In: Science and
Information Conference (SAI), pp. 564–568. IEEE (2014)
5. Bijone, M.: A survey on secure network: intrusion detection &
prevention approaches. Am. J. Inf. Syst. 4(3), 69–88 (2016)
6. Brown, J., Anwar, M., Dozier, G.: Intrusion detection using a
multiple-detector set artificial immune system. In: 2016 IEEE 17th
International Conference on Information Reuse and Integration
(IRI), pp. 283–286. IEEE (2016)
7. Chitrakar, R., Huang, C.: Selection of candidate support vectors in
incremental SVM for network intrusion detection. Comput. Secur.
45, 231–241 (2014)
8. Deshmukh, D.H., Ghorpade, T., Padiya, P.: Intrusion detection
system by improved preprocessing methods and naïve bayes classifier
using NSL-KDD 99 dataset. In: 2014 International Conference
on Electronics and Communication Systems (ICECS), pp.
1–7. IEEE (2014)
9. Dong, B., Wang, X.: Comparison deep learning method to traditional
methods using for network intrusion detection. In: 2016 8th
IEEE International Conference on Communication Software and
Networks (ICCSN), pp. 581–585 (2016)
10. El-Alfy, E.-S.M., Alshammari, M.A.: Towards scalable rough
set based attribute subset selection for intrusion detection using
parallel genetic algorithm in mapreduce. Simul. Model. Pract.
Theory 64, 18–29 (2016)
11. Essid, M., Jemili, F.: Combining intrusion detection datasets using
Mapreduce. In: 2016 IEEE International Conference on Systems,
Man, and Cybernetics (SMC), pp. 4724–4728. IEEE (2016)
12. Fiore, U., Palmieri, F., Castiglione, A., De Santis, A.: Network
anomaly detection with the restricted Boltzmann machine. Neurocomputing
122, 13–23 (2013)
13. Gao, N., Gao, L., Gao, Q., Wang, H.: An intrusion detection
model based on deep belief networks. In: 2014 Second International
Conference on Advanced Cloud and Big Data (CBD), pp.
247–252. IEEE (2014)
14. Gouveia, A., Correia, M.: Feature set tuning in statistical learning
network intrusion detection. In: 2016 IEEE 15th International
Symposium on Network Computing and Applications (NCA), pp.
68–75. IEEE (2016)
15. Gouveia, A., Correia, M.: A systematic approach for the application
of restricted Boltzmann machines in network intrusion
detection. In: International Work-Conference on Artificial Neural
Networks, Vol. 10305, pp. 432–446. Springer, Berlin (2017)
16. Gupta, G.P., Kulariya, M.: A framework for fast and efficient cyber
security network intrusion detection using Apache Spark. Procedia
Comput. Sci. 93, 824–831 (2016)
17. Han, L.: Using a dynamic k-means algorithm to detect anomaly
activities. In: 2011 Seventh International Conference on Computational
Intelligence and Security (CIS)
18. Heaton, J., Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning.
Genet. Program. Evolvable. Mach. 19(1–2), 305–307 (2018)
19. Hodo, E., Bellekens, X., Hamilton, A., Tachtatzis, C., Atkinson,
R.: Shallow and deep networks intrusion detection system: a taxonomy
and survey. CoRR, arXiv :1701.02145 (2017)
20. Information and Irvine Computer Science University of California:
KDD Cup 1999 Data. http://kdd.ics.uci.edu/databa ses/kddcu
p99/kddcu p99.html (1999)
21. Jakkula, V.: Tutorial on Support Vector Machine (SVM), p. 37.
School of EECS, Washington State University (2006)
22. Kato, K., Klyuev, V.: Development of a network intrusion detection
system using apache Hadoop and spark. In: 2017 IEEE Conference
on Dependable and Secure Computing, pp. 416–423.
IEEE (2017)
23. Kim, J., Kim, J., Thu, H.L.T., Kim, H.: Long short term memory
recurrent neural network classifier for intrusion detection. In: 2016
International Conference on Platform Technology and Service
(PlatCon), pp. 1–5. IEEE (2016)
24. Kuang, F., Weihong, X., Zhang, S.: A novel hybrid KPCA and
SVM with GA model for intrusion detection. Appl. Soft Comput.
18, 178–184 (2014)
25. Kulariya, M., Saraf, P., Ranjan, R., Gupta, G.P.: Performance
analysis of network intrusion detection schemes using Apache
Spark. In: 2016 International Conference on Communication and
Signal Processing (ICCSP), pp. 1973–1977. IEEE (2016)
26. Laney, D.: 3d data management: controlling data volume, velocity
and variety. META Group Res. Note 6(70), 1 (2001)
27. Li, Y., Ma, R., Jiao, R.: A hybrid malicious code detection method
based on deep learning. Methods 9(5), 205–216 (2015)
28. Li, Z., Li, Y., Xu, L.: Anomaly intrusion detection method based
on k-means clustering algorithm with particle swarm optimization.
In: 2011 International Conference on Information Technology,
Computer Engineering and Management Sciences (ICM),
Vol. 2, pp. 157–161. IEEE (2011)
29. Masarat, S., Taheri, H., Sharifian, S.: A novel framework, based
on fuzzy ensemble of classifiers for intrusion detection systems.
In: 2014 4th International eConference on Computer and Knowledge
Engineering (ICCKE), pp. 165–170. IEEE (2014)
30. Muda, Z., Yassin, W., Sulaiman, M.N., Udzir, N.I.: Intrusion
detection based on k-means clustering and naïve bayes
A novel scalable intrusion detection system based on deep learning 1 3
classification. In: 2011 7th International Conference on Information
Technology in Asia (CITA 11), pp. 1–6. IEEE (2011)
31. Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection: support
vector machines and neural networks. In: Proceedings of the IEEE
International Joint Conference on Neural Networks (ANNIE), St.
Louis, MO, pp. 1702–1707 (2002)
32. Myles, A.J., Feudale, R.N., Liu, Y., Woody, N.A., Brown, S.D.:
An introduction to decision tree modeling. J. Chemom. A J. Chemom.
Soc. 18(6), 275–285 (2004)
33. Mighan, S.N., Kahani, M.: Deep learning based latent feature
extraction for intrusion detection. In: 26th Iranian Conference on
Electrical Engineering (ICEE2018) (2018)
34. Nazari, Z., Noferesti, M., Jalili, R.: DSCA: an inline and adaptive
application identification approach in encrypted network traffic.
In: Proceedings of the 3rd International Conference on Cryptography,
Security and Privacy, pp. 39–43. ACM (2019)
35. Rathore, M.M., Ahmad, A., Paul, A.: Real time intrusion detection
system for ultra-high-speed big data environments. J. Supercomput.
72(9), 3489–3510 (2016)
36. Salama, M.A., Eid, H.F., Ramadan, R.A., Darwish, A., Hassanien,
A.E.: Hybrid intelligent intrusion detection scheme. In: Gaspar-
Cunha, A., Takahashi, R., Schaefer, G., Costa, L. (eds.) Soft Computing
in Industrial Applications. Advances in Intelligent and Soft
Computing, vol. 96, pp. 293–303. Springer, Berlin (2011)
37. Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward
developing a systematic approach to generate benchmark datasets
for intrusion detection. Comput. Secur. 31(3), 357–374 (2012)
38. Siddique, K., Akhtar, Z., Lee, H., Kim, W., Kim, Y.: Toward
bulk synchronous parallel-based machine learning techniques for
anomaly detection in high-speed big data networks. Symmetry
9(9), 197 (2017)
39. Soheily-Khah, S., Marteau, P.-F., échet, N.: Intrusion detection
in network systems through hybrid supervised and unsupervised
mining process a detailed case study on the ISCX benchmark
dataset. In: 2018 1st International Conference on Data Intelligence
and Security (ICDIS). IEEE (2017)
40. Stallings, W.: Cryptography and Network Security: Principles and
Practice. Pearson, Upper Saddle River (2017)
41. Thaseen, I.S., Kumar, Ch.A.: Intrusion detection model using
fusion of PCA and optimized SVM. In: 2014 International Conference
on Contemporary Computing and Informatics (IC3I), pp.
879–884. IEEE (2014)
42. UNB-ISCX: NSL KDD Dataset. http://www.unb.ca/resea rch/iscx/
datas et/iscx-NSL-KDD-datas et.html (2009)
43. Wang, B., Zheng, Y., Lou, W., Hou, Y.T.: Ddos attack protection
in the era of cloud computing and software-defined networking.
Comput. Netw. 81, 308–319 (2015)
44. Wang, Y., Cai, W., Wei, P.: A deep learning approach for detecting
malicious Javascript code. Secur. Commun. Netw. 9(11), 1520–
1534 (2016)
45. Wang, Y., Yao, H., Zhao, S.: Auto-encoder based dimensionality
reduction. Neurocomputing 184, 232–242 (2016)
46. Wang, Z.: The Applications of Deep Learning on Traffic Identification.
BlackHat USA (2015)
47. Watson, G.: A Comparison of Header and Deep Packet Features
When Detecting Network Intrusions. Technical Report (2018)
48. Wu, S.X., Banzhaf, W.: The use of computational intelligence in
intrusion detection systems: a review. Appl. Soft Comput. 10(1),
1–35 (2010)
49. Zikopoulos, P., Deroos, D., Parasuraman, K., Deutsch, T., Giles,
J., Corrigan, D.: Harness the Power of Big Data: The IBM Big
Data Platform. McGraw-Hill, New York (2013)
50. Zuech, R., Khoshgoftaar, T.M., Wald, R.: Intrusion detection and
big heterogeneous data: a survey. J. Big Data 2(1), 3 (2015)
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Continue to order Get a quote

Big data analysis using machine learning techniques

Calculate the price of your order

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee