Data Extraction using Gaussian Distribution Based tree Algorithm

Vijayalakshmi.R Asst.Prof VCET,Madurai. E-Mail:[email protected] M.Salomi Asst.Prof VCET,Madurai. [email protected] S.Kavitha Asst.Prof VCET,Madurai. [email protected]

Abstract— Data mining techniques are used in fault prediction models for improving the software quality. Early detection of high-risk modules can assist in quality enhancement efforts to modules that are likely to have a much number of faults. The objective of this paper is to mine the maximum values of datas from the given data sets for the proper and efficient usage of the processor memory. The paper discusses about the improvement of memory management for wireless environmental monitoring system using distribution based decision tree algorithm. For the analysis continuous distributions methods are considered. Comparison of gaussian distribution with Averaging are discussed in this paper. Therefore obtaining an efficient data mining scheme for wireless sensor networks using decision tree algorithms is obtained. Key Words – Averaging, Data mining,Decision Tree, data sets,Gaussian Distribution.

as genetic combination, mutation, and nature selection in a design based on the concepts of natural evolution. Decision trees: Tree-based structures that represent sets of decisions. These decisions form the rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome[2]. CART algorithm splits a dataset by creating 2-way splits while CHAID splits using chi square tests to create multi-way splits. Basically CART requires less data preparation than CHAID. Nearest neighbor method: A method that classifies each record in a dataset related on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes known as the k-nearest neighbor technique. Rule induction: The retrival of useful if-then rules from data sets based on statistical significance. Data visualization: The visual representation of complex relationships in multidimensional data. Graphics tools are used to obtain the data relationships. II. DECISION TREES Decision trees are often used in classification and prediction[3]. It is easy and efficeient way of knowledge representation. The models produced by decision trees are represented in the form of tree structure. A leaf node indicates the class of the examples. The instances are classified by sorting them down the tree from the root node to some leaf node. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like splits[5]. These splitted datas form an inverted decision tree that originates with a parent root node at the top of the tree. The object of analysis is reflecting in this root node as a simple, one-dimensional display in the decision tree interface. The alternate name of the field of data that is the object of analysis is usually showed, along with the spread or distribution of the datas that are contained in that field. A. FEATURES OF DECISION TREE These trees generate simple, understandable rules. It can look into the trees, cleanly understand each and every split, and can observe the impact of that split and even compare it to alternative splits. Decision trees are non-parametric that refers no specific data distribution is necessary. Decision trees have an ability handle continuous

I. INTRODUCTION Generally, data mining (sometimes called data or knowledge discovery) is the process of extraction of data from different views and summarizing it into useful information - which can be used to increase the decision making process. Data mining is one of the best analytical tool for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified[1]. Technically, data mining is the process of extracting correlations or patterns among dozens of fields in large relational databases. Data mining consists of five major steps: Extract, transform, and load the transaction data onto the data warehouse system. Store and maintain the data in a multidimensional database system. Provide data access to expert analysts and information technology professionals. Analyze the data using algorithm by application software. Display the data in a useful format, such as a graph or table. Various types of techniques are available: Artificial neural networks: Non-linear predictive models that are learning through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that are used for processing the data in different kind of perspectives such

Vijayalakshmi.R,Department of Engineering,Velammal College Technology,Madurai,Mob:9715373764. (e-mail:[email protected]) of Computer Science Engineering and and

Data Extraction using Gaussian Distribution Based tree Algorithm and categorical variables. Decision trees perfectly handle the missing values as easily as any normal value of the variable In decision trees fine tweaking is possible. There can be possible to choose to set the depth of the trees, the minimum number of observed datas are needed for a split, or for a leave, the number of leaves per split (in case of multilevel target variables). Among data mining algorithms Decision trees is one of the best independent variable selection algorithms They are fast, and, unlike computing simple correlations with the target datas, they also take into consider the interactions between variables . As an underlying mechanism that produces the gaussian distribution, we can think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person's height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the population. The Gaussian distribution function is determined by the following formula:

III. ALGORITHMS A. AVERAGING A straightforward way to deal with the uncertain information is to replace each PDF with its expected value, thus,effectively transforms the data tuples into point-valued tuples. This simplifies the problem back to that for point-valued data, and hence, classic decision tree algorithms such as ID3 and C4.5 [3] can be reused. It is called as Averaging (AVG). . Here is a well defined description. AVG is a greedy algorithm that builds a tree root-child.When analyzing a node; we examine a set of tuples S. The algorithm begins with the root node and with S being the set of all training tuples. Fig 1: Illustration of decision tree Slow learners are great when you want to use lots of them in passages, because paggages, like bagging, boosting, random forests, tree nets become very powerful algorithms when the individual models are weak learners,. Decision trees identify subgroups. Each terminal or intermediate leave in a decision tree can be seen as a subgroup/segment of your population. Decision trees work fast even with lots of observations and variables. Decision trees can able to handle unbalanced datasets. If we have 0.1 % of positive targets and 99.9 percent of negative ones. B. GAUSSIAN DISTRIBUTION The Gaussian distribution (the "bell-shaped curve" which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions (see also Elementary Concepts). In general, the gaussian distribution provides a good model for a random variable, when: 1. 2. 3. There is a strong possibility for the variable to take a central value; Positive and negative deflections from this central value are equally likely; The frequency of deflections falls off rapidly as the deviations become larger. Algorithm Steps: Step 1: Get input from the dataset which contains n number of datas. Step2: Split the dataset into groups according to our requirement. Step3: Calculate the average of each group by using the formula: Sum of numbers in group/number of total data in a group. Step4: Change the group lengths and go to step 2. Step5: Calculate the accuracy percentage. B. DISTRIBUTION BASED TREE ALGORITHM The key to building a good decision tree is a good choice of an attribute Ajn and a split point zn for each node n. After an attribute Ajn and a split point zn have been chosen for a node n, we have to split the set of tuples S into number of groups. The Gaussian distribution for which we use as the standard deviation. In both cases, the pdf is generated using s sample points in the interval. Using this method), we transform a data set with point values. Algorithm Steps: Step 1: Get input from the dataset which contains n number of datas. Step2: Split the dataset into groups according to our requirement.

2

Data Extraction using Gaussian Distribution Based tree Algorithm Step3: Calculate the guassian distribution of each group by using the formula: Sum of numbers in group/number of total data in a group. Step4: Change the group lengths and go to step 2. Step5: Calculate the accuracy percentage. The columns are designed for each type of calculation and the results are obtained separately therefore giving a clear view for analyzing and performance evaluation.

Fig 5: Main page The tree generated as the result of Gaussian distribution is given below:

Fig 2: flow chart of the proposed system

IV. RESULTS AND DISCUSSION The simulation for the datamining is being done in NetBeans IDE and the dataset of temperature, pressure and humidity is being considered. The main coding is being developed in JAVA and the performance graph is obtained.

Fig 6: Generation of tree using Gaussian distribution From this tree the highest value of the node is considered to make decisions. Thus the decision making algorithm is being successfully implemented. V. PERFORMANCE EVALUATION Fig 3: NETBEANS IDE The project is being created using JAVA swing, developed and compiled using the IDE. The results obtained are shown in figure given below:

Fig:7 Performance Graph Fig 4: login page

3

Data Extraction using Gaussian Distribution Based tree Algorithm From this graph, it is clear that the Gaussian distribution based tree algorithm gives us more efficiency when compared to Averaging method.

S.No Dataset S Accuracy Achieved Existing Mehtod(AVG) 5 10 15

1 2 3 RL Current Temp Pressure

500 470 400 95.7 91.9 95.9 94.1 92.6 95.5 93.8 92.2 94.3

20

93.2 89.1 93.2

Proposed Method(GDA) 5 10 15

98.7 96.3 98.4 97.6 93.7 96.2 96.7 94.1 95.7

20

95.3 92.3 94.8

Table 1 comparison AVG and GDA The above Table 1 shows that the results obtained from the already existing method(AVG) and the results obtained from the proposed method. In this comparison we clearly understood that the proposed method Gaussian Distribution based Tree Algorithm(GDA) is given more accuracy than the existing method(AVG). VI. CONCLUSION The algorithms such as the distribution based tree algorithm and averaging method to build decision trees for extracting the maximum values of data in datasets have been implemented and successfully shown that proposed algorithm have sgiven the better performance.It is found that the method in the paper has adapted when suitable PDFs are used with remarkably higher accuracies. Therefore that data be collected and stored with the PDF information intact.

C. Olaru and L. Wehenkel, “A Complete Fuzzy Decision Tree Technique,” Fuzzy Sets and Systems, vol. 138, no. 2, pp. 221-254,2003. [11] L. Breiman, “Technical Note: Some Properties of Splitting Criteria,” Machine Learning, vol. 24, no. 1, pp. 41-47, 1996. [12] T. Elomaa and J. Rousu, “Efficient Multisplitting Revisited:Optima-Preserving Elimination of Partition Candidates,” Data Mining and Knowledge Discovery, vol. 8, no. 2, pp. 97-126, 2004. [13] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J.S. Vitter, “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,” Proc. Int’l Conf. Very Large Data Bases (VLDB),pp. 876-887, Aug./Sept. 2004. [14] J. Chen and R. Cheng, “Efficient Evaluation of Imprecise Location- Dependent Queries,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 586- 595, Apr. 2007. [15] N.N. Dalvi and D. Suciu, “Efficient Query Evaluation on Probabilistic Databases,” The VLDB J., vol. 16, no. 4, pp. 523-544,2007.

[10]

REFERENCES [1] S. Tsang, B. Kao, K.Y. Yip,W.-S. Ho, and S.D. Lee, “Decision Trees for Uncertain Data,” IEEE TRANSACTIONS . Data Eng. (ICDE), pp. 441-444, JAN 2011. [2] M. Umanol, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S.Umedzu, and J. Kinoshita, “Fuzzy Decision Trees by Fuzzy ID3 Algorithm and Its Application to Diagnosis Systems,” Proc. IEEE Conf. Fuzzy Systems, IEEE World Congress Computational Intelligence, vol. 3, pp. 2113-2118, June 1994. [3] J.R. Quinlan, “C4.5: Programs for Machine Learning”. Morgan Kaufmann, 1993. [4] Decision Trees for Business Intelligence and Data Mining: Using SAS Enterprise Miner. [5] N.N. Dalvi and D. Suciu, “Efficient Query Evaluation on Probabilistic Databases,” The VLDB J., vol. 16, no. 4, pp. 523-544, 2007. [6] L. Breiman, “Technical Note: Some Properties of Splitting Criteria,” Machine Learning, vol. 24, no. 1, pp. 41-47, 1996. [7] http://www.the-data-mine.com. [8] W.K. Ngai, B. Kao, C.K. Chui, R. Cheng, M. Chau, and K.Y. Yip,“Efficient Clustering of Uncertain Data,” Proc. Int’l Conf. Data Mining (ICDM), pp. 436-445, Dec. 2006. [9] C.K. Chui, B. Kao, and E. Hung, “Mining Frequent Itemsets from Uncertain Data,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), pp. 47-58, May 2007.

4

Vijayalakshmi.R Asst.Prof VCET,Madurai. E-Mail:[email protected] M.Salomi Asst.Prof VCET,Madurai. [email protected] S.Kavitha Asst.Prof VCET,Madurai. [email protected]

Abstract— Data mining techniques are used in fault prediction models for improving the software quality. Early detection of high-risk modules can assist in quality enhancement efforts to modules that are likely to have a much number of faults. The objective of this paper is to mine the maximum values of datas from the given data sets for the proper and efficient usage of the processor memory. The paper discusses about the improvement of memory management for wireless environmental monitoring system using distribution based decision tree algorithm. For the analysis continuous distributions methods are considered. Comparison of gaussian distribution with Averaging are discussed in this paper. Therefore obtaining an efficient data mining scheme for wireless sensor networks using decision tree algorithms is obtained. Key Words – Averaging, Data mining,Decision Tree, data sets,Gaussian Distribution.

as genetic combination, mutation, and nature selection in a design based on the concepts of natural evolution. Decision trees: Tree-based structures that represent sets of decisions. These decisions form the rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome[2]. CART algorithm splits a dataset by creating 2-way splits while CHAID splits using chi square tests to create multi-way splits. Basically CART requires less data preparation than CHAID. Nearest neighbor method: A method that classifies each record in a dataset related on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes known as the k-nearest neighbor technique. Rule induction: The retrival of useful if-then rules from data sets based on statistical significance. Data visualization: The visual representation of complex relationships in multidimensional data. Graphics tools are used to obtain the data relationships. II. DECISION TREES Decision trees are often used in classification and prediction[3]. It is easy and efficeient way of knowledge representation. The models produced by decision trees are represented in the form of tree structure. A leaf node indicates the class of the examples. The instances are classified by sorting them down the tree from the root node to some leaf node. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like splits[5]. These splitted datas form an inverted decision tree that originates with a parent root node at the top of the tree. The object of analysis is reflecting in this root node as a simple, one-dimensional display in the decision tree interface. The alternate name of the field of data that is the object of analysis is usually showed, along with the spread or distribution of the datas that are contained in that field. A. FEATURES OF DECISION TREE These trees generate simple, understandable rules. It can look into the trees, cleanly understand each and every split, and can observe the impact of that split and even compare it to alternative splits. Decision trees are non-parametric that refers no specific data distribution is necessary. Decision trees have an ability handle continuous

I. INTRODUCTION Generally, data mining (sometimes called data or knowledge discovery) is the process of extraction of data from different views and summarizing it into useful information - which can be used to increase the decision making process. Data mining is one of the best analytical tool for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified[1]. Technically, data mining is the process of extracting correlations or patterns among dozens of fields in large relational databases. Data mining consists of five major steps: Extract, transform, and load the transaction data onto the data warehouse system. Store and maintain the data in a multidimensional database system. Provide data access to expert analysts and information technology professionals. Analyze the data using algorithm by application software. Display the data in a useful format, such as a graph or table. Various types of techniques are available: Artificial neural networks: Non-linear predictive models that are learning through training and resemble biological neural networks in structure. Genetic algorithms: Optimization techniques that are used for processing the data in different kind of perspectives such

Vijayalakshmi.R,Department of Engineering,Velammal College Technology,Madurai,Mob:9715373764. (e-mail:[email protected]) of Computer Science Engineering and and

Data Extraction using Gaussian Distribution Based tree Algorithm and categorical variables. Decision trees perfectly handle the missing values as easily as any normal value of the variable In decision trees fine tweaking is possible. There can be possible to choose to set the depth of the trees, the minimum number of observed datas are needed for a split, or for a leave, the number of leaves per split (in case of multilevel target variables). Among data mining algorithms Decision trees is one of the best independent variable selection algorithms They are fast, and, unlike computing simple correlations with the target datas, they also take into consider the interactions between variables . As an underlying mechanism that produces the gaussian distribution, we can think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person's height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the population. The Gaussian distribution function is determined by the following formula:

III. ALGORITHMS A. AVERAGING A straightforward way to deal with the uncertain information is to replace each PDF with its expected value, thus,effectively transforms the data tuples into point-valued tuples. This simplifies the problem back to that for point-valued data, and hence, classic decision tree algorithms such as ID3 and C4.5 [3] can be reused. It is called as Averaging (AVG). . Here is a well defined description. AVG is a greedy algorithm that builds a tree root-child.When analyzing a node; we examine a set of tuples S. The algorithm begins with the root node and with S being the set of all training tuples. Fig 1: Illustration of decision tree Slow learners are great when you want to use lots of them in passages, because paggages, like bagging, boosting, random forests, tree nets become very powerful algorithms when the individual models are weak learners,. Decision trees identify subgroups. Each terminal or intermediate leave in a decision tree can be seen as a subgroup/segment of your population. Decision trees work fast even with lots of observations and variables. Decision trees can able to handle unbalanced datasets. If we have 0.1 % of positive targets and 99.9 percent of negative ones. B. GAUSSIAN DISTRIBUTION The Gaussian distribution (the "bell-shaped curve" which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions (see also Elementary Concepts). In general, the gaussian distribution provides a good model for a random variable, when: 1. 2. 3. There is a strong possibility for the variable to take a central value; Positive and negative deflections from this central value are equally likely; The frequency of deflections falls off rapidly as the deviations become larger. Algorithm Steps: Step 1: Get input from the dataset which contains n number of datas. Step2: Split the dataset into groups according to our requirement. Step3: Calculate the average of each group by using the formula: Sum of numbers in group/number of total data in a group. Step4: Change the group lengths and go to step 2. Step5: Calculate the accuracy percentage. B. DISTRIBUTION BASED TREE ALGORITHM The key to building a good decision tree is a good choice of an attribute Ajn and a split point zn for each node n. After an attribute Ajn and a split point zn have been chosen for a node n, we have to split the set of tuples S into number of groups. The Gaussian distribution for which we use as the standard deviation. In both cases, the pdf is generated using s sample points in the interval. Using this method), we transform a data set with point values. Algorithm Steps: Step 1: Get input from the dataset which contains n number of datas. Step2: Split the dataset into groups according to our requirement.

2

Data Extraction using Gaussian Distribution Based tree Algorithm Step3: Calculate the guassian distribution of each group by using the formula: Sum of numbers in group/number of total data in a group. Step4: Change the group lengths and go to step 2. Step5: Calculate the accuracy percentage. The columns are designed for each type of calculation and the results are obtained separately therefore giving a clear view for analyzing and performance evaluation.

Fig 5: Main page The tree generated as the result of Gaussian distribution is given below:

Fig 2: flow chart of the proposed system

IV. RESULTS AND DISCUSSION The simulation for the datamining is being done in NetBeans IDE and the dataset of temperature, pressure and humidity is being considered. The main coding is being developed in JAVA and the performance graph is obtained.

Fig 6: Generation of tree using Gaussian distribution From this tree the highest value of the node is considered to make decisions. Thus the decision making algorithm is being successfully implemented. V. PERFORMANCE EVALUATION Fig 3: NETBEANS IDE The project is being created using JAVA swing, developed and compiled using the IDE. The results obtained are shown in figure given below:

Fig:7 Performance Graph Fig 4: login page

3

Data Extraction using Gaussian Distribution Based tree Algorithm From this graph, it is clear that the Gaussian distribution based tree algorithm gives us more efficiency when compared to Averaging method.

S.No Dataset S Accuracy Achieved Existing Mehtod(AVG) 5 10 15

1 2 3 RL Current Temp Pressure

500 470 400 95.7 91.9 95.9 94.1 92.6 95.5 93.8 92.2 94.3

20

93.2 89.1 93.2

Proposed Method(GDA) 5 10 15

98.7 96.3 98.4 97.6 93.7 96.2 96.7 94.1 95.7

20

95.3 92.3 94.8

Table 1 comparison AVG and GDA The above Table 1 shows that the results obtained from the already existing method(AVG) and the results obtained from the proposed method. In this comparison we clearly understood that the proposed method Gaussian Distribution based Tree Algorithm(GDA) is given more accuracy than the existing method(AVG). VI. CONCLUSION The algorithms such as the distribution based tree algorithm and averaging method to build decision trees for extracting the maximum values of data in datasets have been implemented and successfully shown that proposed algorithm have sgiven the better performance.It is found that the method in the paper has adapted when suitable PDFs are used with remarkably higher accuracies. Therefore that data be collected and stored with the PDF information intact.

C. Olaru and L. Wehenkel, “A Complete Fuzzy Decision Tree Technique,” Fuzzy Sets and Systems, vol. 138, no. 2, pp. 221-254,2003. [11] L. Breiman, “Technical Note: Some Properties of Splitting Criteria,” Machine Learning, vol. 24, no. 1, pp. 41-47, 1996. [12] T. Elomaa and J. Rousu, “Efficient Multisplitting Revisited:Optima-Preserving Elimination of Partition Candidates,” Data Mining and Knowledge Discovery, vol. 8, no. 2, pp. 97-126, 2004. [13] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J.S. Vitter, “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,” Proc. Int’l Conf. Very Large Data Bases (VLDB),pp. 876-887, Aug./Sept. 2004. [14] J. Chen and R. Cheng, “Efficient Evaluation of Imprecise Location- Dependent Queries,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 586- 595, Apr. 2007. [15] N.N. Dalvi and D. Suciu, “Efficient Query Evaluation on Probabilistic Databases,” The VLDB J., vol. 16, no. 4, pp. 523-544,2007.

[10]

REFERENCES [1] S. Tsang, B. Kao, K.Y. Yip,W.-S. Ho, and S.D. Lee, “Decision Trees for Uncertain Data,” IEEE TRANSACTIONS . Data Eng. (ICDE), pp. 441-444, JAN 2011. [2] M. Umanol, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S.Umedzu, and J. Kinoshita, “Fuzzy Decision Trees by Fuzzy ID3 Algorithm and Its Application to Diagnosis Systems,” Proc. IEEE Conf. Fuzzy Systems, IEEE World Congress Computational Intelligence, vol. 3, pp. 2113-2118, June 1994. [3] J.R. Quinlan, “C4.5: Programs for Machine Learning”. Morgan Kaufmann, 1993. [4] Decision Trees for Business Intelligence and Data Mining: Using SAS Enterprise Miner. [5] N.N. Dalvi and D. Suciu, “Efficient Query Evaluation on Probabilistic Databases,” The VLDB J., vol. 16, no. 4, pp. 523-544, 2007. [6] L. Breiman, “Technical Note: Some Properties of Splitting Criteria,” Machine Learning, vol. 24, no. 1, pp. 41-47, 1996. [7] http://www.the-data-mine.com. [8] W.K. Ngai, B. Kao, C.K. Chui, R. Cheng, M. Chau, and K.Y. Yip,“Efficient Clustering of Uncertain Data,” Proc. Int’l Conf. Data Mining (ICDM), pp. 436-445, Dec. 2006. [9] C.K. Chui, B. Kao, and E. Hung, “Mining Frequent Itemsets from Uncertain Data,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), pp. 47-58, May 2007.

4