10BM60027: WEKA Data Mining Techniques

Published on May 2016 | Categories: Types, Instruction manuals | Downloads: 35 | Comments: 0 | Views: 392

of 9

Term paper on using data mining techniques in WEKA tool

Content

VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR

Weka: Data Mining
IT for Business Intelligence Course
Term paper on using some of the data mining techniques with Weka tool

Submitted by: Gaurav Arora 10BM60027 MBA 2010-2012

Table of Contents
About Weka ........................................................................................................................................................ 3 Features .............................................................................................................................................................. 3 Data Used ........................................................................................................................................................... 4 For Regression:................................................................................................................................................ 4 For Clustering: ................................................................................................................................................. 5 Regression Analysis ............................................................................................................................................. 6 Cluster Analysis ................................................................................................................................................... 8

About Weka
Weka (Waikato Environment for Knowledge Analysis) is a tool that was developed at the University of Waikato in New Zealand originally for the purpose of identifying information from raw data gathered from agricultural domains. Weka supports many other data mining tasks such as data preprocessing, classification, regression, clustering, visualization and feature selection. The use of this tool is to premise of the application is to derive useful information in the form of trends and patterns from our raw data. It is an open source application that is freely available under the GNU general public license agreement and was originally written in C. Later it was completely rewritten in Java and is now compatible with almost every computing platform. It has a user friendly with graphical interface that allows for quick set up and operation. Attribute Relationship File Format (ARFF) is the text format file used by Weka to store data in a database.

Features
There are four options available on the initial screen.

   

Simple CLI: provides users without a graphic interface option the ability to execute commands from a terminal window. Explorer: the graphical interface used to conduct experimentation on raw data Experimenter: this option allows users to conduct different experimental variations on data sets and perform statistical manipulation Knowledge Flow: Same functionality as Explorer but with drag and drop functionality. The advantage of this option is that it supports incremental learning from previous results

Main tabs provided in Explorer are:

     

Preprocess- used to choose the data file to be used by the application Classify- used to test and train different learning schemes on the preprocessed data file under experimentation Cluster- used to apply different tools that identify clusters within the data file Association- used to apply different rules to the data file that identify association within the data Select attributes-used to apply different rules to reveal changes based on selected attributes inclusion or exclusion from the experiment Visualize- used to see what the various manipulation produced on the data set in a 2D format, in scatter plot and bar graph output

Data Used
For Regression:
Data from open source like World bank website:http://data.worldbank.org/country/india is used to perform regression analysis. This data contains various variables linked to the GDP of India. A total of 21 records of data from Year 1999 to 2009 is used. All data are in current U.S. dollars

Exports_of_goods_and_services: Exports of goods and services comprise all transactions between residents of a country and the rest of the world involving general merchandise, goods sent for processing and repairs, nonmonetary gold, and services Imports_of_goods_and_services: Imports of goods, services and income is the sum of goods (merchandise) imports, imports of (nonfactor) services and income (factor) payments Agriculture_value_added: Agriculture includes forestry, hunting, and fishing, as well as cultivation of crops and livestock production. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs Manufacturing_value_added: Manufacturing refers to industries belonging to ISIC divisions 15-37. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs Industry_value_added: Industry includes manufacturing (ISIC divisions 15-37). It comprises value added in mining, manufacturing (also reported as a separate subgroup), construction, electricity, water, and gas. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs Services_value_added: Services includes value added in wholesale and retail trade (including hotels and restaurants), transport, and government, financial, professional, and personal services such as education, health care, and real estate services. Also included are imputed bank service charges, import duties, and any statistical discrepancies noted by national compilers as well as discrepancies arising from rescaling. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs GDP: GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products Gross_savings: Gross savings are calculated as gross national income less total consumption, plus net transfers.

For Clustering:
Another data is obtained from a survey of Mobile subscribers to understand their SMS/GPRS pack usage and spending pattern (Done for Aircel as part of Summer Project). A total of 221 records used.

Various variables and the corresponding questions asked in survey: internet_hrs: Daily hours spent on Internet through your Mobile? Internet_home: Do you have Internet connection at home? Income: Your monthly Income level? Occupation: Your Occupation? Age: Your Age? Travel_hrs: Daily time spent travelling/on the move? Sms_pack: Which SMS Pack you use/prefer to use? Sms_bill: How much you spend for SMS Packs monthly? Gprs_pack: Which GPRS Pack you use/prefer to use? Gprs_bill: How much you spend for GPRS Packs monthly? Month_bill: How much is your monthly mobile bill around?

Regression Analysis
The regression model is used to predict the result of a unknown dependent variable (GDP), given the values of the independent variables. We take a number of independent variables togetherand find their relation to our dependent variable (GDP). Weka Run information Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: GDP_regression-weka.filters.unsupervised.attribute.Remove-R1,11weka.filters.unsupervised.attribute.Remove-R1-2 Instances: 21 Attributes: 8 Test mode: split 66.0% train, remainder test: Classifier model (full training set) This way 14 entries will be used to create the model and remaining 7 to test the validity of it

Linear Regression Model developed GDP = 0.4179 * Exports_of_goods_and_services + -0.3048 * Imports_of_goods_and_services + 1.3542 * Agriculture_value_added + 0.6773 * Industry_value_added + 0.9744 * Services_value_added + 0.464 * Gross_savings + -1672945075143.6054 From the above model we can draw following conclusions:      GDP is related to all of the independent variables taken for study Service, Agriculture and Industry Value Added variables are all positively contributing to the GDP Exports and Gross Savings increase also grows the GDP value Imports on other hand is inversely related and decreases the GDP value Agriculture has the highest effect impact followed by Service and then Industry sector

Predictions on test split inst# 1 2 3 4 5 6 7 actual 32422100000000 5696240000000 27546200000000 36924900000000 49864300000000 17512000000000 55826200000000 predicted 32059372648333.712 5753533243202.538 27612724209197.072 36784939417643.64 49276857202715.168 17546685222165.234 57273793936650.44 error -362727351666.289 57293243202.537 66524209197.07 -139960582356.359 -587442797284.828 34685222165.234 1447593936650.438

Evaluation on test split Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances From the results we can say:   As correlation coefficiet is high our model is accurate and gives good estimate of India’s GDP Small value of Root squared error also signifies the model accuracy 0.9994 385175334646.108 609530075729.5292 2.5776 % 3.4314 % 7

Cluster Analysis
It allows us to make groups of data to which can be useful for many marketing applications like: segmentation and new product launch. Thus from this survey data we can find SMS/GPRS usage patterns and design our future products according to market needs Weka Run information Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: Aircel_RAW-weka.filters.unsupervised.attribute.Remove-R1weka.filters.unsupervised.attribute.Remove-R7-weka.filters.unsupervised.attribute.Remove-R7weka.filters.unsupervised.attribute.Remove-R3 Instances: 221 Attributes: 11 Test mode: evaluate on training data: Clustering model (full training set) : kMeans Clustering to find total clusters possible Number of iterations: 4 Within cluster sum of squared errors: 571.5049648606962 Cluster centroids Attribute internet_hrs internet_home income occupation age travel_hrs sms_pack sms_bill gprs_pack gprs_bill month_bill Full Data(221) 2.2527 Yes Nil Student (UG/PG) 24.1357 2.6371 Monthly 30-50 Monthly >80 379.8643 Cluster#0 (84) 1.764 Yes Nil Student (UG/PG) 22.7143 2.7488 Monthly 30-50 None Nil 310.7143 Cluster#1 (72) 2.4954 Yes 20K-40K Service/Employee 27.6111 2.6389 None Nil Monthly >80 469.4444 Cluster#2 (65) 2.6154 No Nil Student (UG/PG) 22.1231 2.4909 Monthly 30-50 Monthly >80 370

Model and evaluation on training set Cluster 0 1 2 Instances 84 ( 38%) 72 ( 33%) 65 ( 29%)

Clusters Explanation Cluster0:  This segment comprises of Students with average age around 23 years

     

They don’t have any income No internet connection at home Avid users of SMS and spend maximum of Rs.30-50 per month Spend quiet a time travelling each day Don’t prefer to use GPRS packs for internet surfing and have lower than average use of 1.7 hrs per day Monthly bill of Rs.300+

Cluster1:        This segment comprises of working Professional with average age around 27 years They have average monthly income in range of 20-40K with internet at home Prefer to use internet on GPRS and take monthly unlimited pack by paying Rs.80 or more Don’t use any particular SMS pack and overall low usage of this VAS Spend quiet a time travelling each day Don’t prefer to use GPRS packs for internet surfing and have lower than average use of 1.7 hrs per day Monthly bill of Rs.470 is quite high than overall average of Rs.380

Cluster2:      This segment comprises of Students with average age around 22 years They don’t have any income Internet connection at home Useboth GPRS and SMS packs and spend Rs.130+ on these Monthly bill around the same as overall average of Rs.380

Other points to be noted:       We can see that Cluster 0 comprising of Students prefers to communicate through messages with friends and are not using internet pack because of Internet facility at home For Cluster 1 the young working people, GPRS pack is a must as they need to keep updated on emails and social networking. They don’t want to spend on SMS pack separately We could come up with a GPRS and SMS combo pack which would cater to both these clusters The pricing of this Combo should be under the total Rs.130 mark to show end customer the utility of having both facilities monthly with no usage limit. A separate limited sms/internet hour cards can also be introduced at price which in in range Rs.60-Rs.80 and hence attractive to both segments and useful for Cluster 2 also Even some free Internet hours with SMS packs and vice versa with GPRS packs can increase trial of these services

10BM60027: WEKA Data Mining Techniques

Comments

Content

Sponsor Documents

Recommended