Data Dan Eksplorasi Data

Published on April 2017 | Categories: Documents | Downloads: 61 | Comments: 0 | Views: 295
of 16
Download PDF   Embed   Report

Comments

Content

Computer Science Department
Bogor Agricultural University

Data dan Eksplorasi Data
Kuliah 2
2/9/2014

0

Outline



Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan



Catatan: semua slide diambil dari









Tan P., Michael S., & Vipin K. 2006. Introduction to Data mining. Pearson
Education, Inc.
Han J & Kamber M. 2006. Data mining – Concept and Techniques. MorganKauffman, San Diego

1







2/9/2014

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

2

2/9/2014

1

What is Data?


Collection of data objects and
their attributes



An attribute is a property or
characteristic of an object
 Examples: eye color of a
person, temperature, etc.
 Attribute is also known as
variable, field, characteristic,
or feature
A collection of attributes
describe an object
 Object is also known as
record, point, case, sample,
entity, or instance



Attributes

Tid Refund Marital
Status

Taxable
Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Objects

60K

1
0

Attribute values


Attribute values are numbers or symbols assigned to
an attribute



Distinction between attributes and attribute values


Same attribute can be mapped to different attribute values
 Example: height can be measured in feet or meters



Different attributes can be mapped to the same set of
values
 Example: Attribute values for ID and age are integers
 But properties of attribute values can be different


ID has no limit but age has a maximum and minimum value

4

Tipe atribut: Kategori (kualitatif)
Tipe atribut
Deskripsi
Nominal
Nilai dari atribut nominal
adalah nama-nama yang
berbeda, yaitu nilai nominal
hanya menyediakan
informasi yang cukup untuk
membedakan satu objek
dengan objek yang lain. (=
dan ≠)
Ordinal
Nilai dari atribut ordinal
menyediakan informasi
yang cukup mengurutkan
objek. (<, >)

Contoh
Kode pos, ID
Number
karyawan, warna
mata, jenis
kelamin

Operasi
Mode,
entropy,
contingency
correlation, uji
χ2

Kekerasan
mineral {baik,
lebih baik,
sangat baik},
nomor jalan,
grade

Median,
presentil, rank
correlation,
run test, sign
test

5

2/9/2014

2

Tipe atribut: Numerik (Kuantitatif)
Tipe atribut
Interval

Deskripsi
Untuk atribut interval,
perbedaan antarnilai
adalah sesuatu yang
berarti, adanya unit
pengukuran. (+,−)

Contoh
Tanggal pada
kalender,
temperatur dalam
Celcius atau
Fahrenheit

Operasi
Rataan,
simpangan
baku, korelasi
Pearson, Uji t
dan F

Ratio

Untuk variabel rasio,
perbedaan dan rasio
merupakan hal yang
berarti. (*, /)

Temperatur dalam
Kelvin, kuantitas
moneter, count,
umur, panjang, arus
listrik

Rataan
geometri,
rataan
harmonik,
variasi persen

6

2/9/2014

Discrete and Continuous Attributes
Discrete Attribute








Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes

Continuous Attribute








Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.

7

Types of data sets
Record
Data Matrix
 Document Data
 Transaction Data
Graph
 World Wide Web
 Molecular Structures
Ordered
 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data









8

3

Record Data
Data that consists of a collection of records, each of
which consists of a fixed set of attributes



Tid Refund Marital
Status

Taxable
Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

9

Transaction Data


A special type of record data, where



each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
TID

Items

1

Bread, Coke, Milk

2
3

Beer, Bread
Beer, Coke, Diaper, Milk

4

Beer, Bread, Diaper, Milk

5

Coke, Diaper, Milk

10

Ordered Data
Temporal Data









Temporal data is data whose
objects have attributes that
represent measurements
taken over time.
For example, financial data
set are time series that give
the daily prices of various
stocks.
A time series is a sequence of
measurements of some
attribute
e.g., stock price or rainfall,
taken at (usually regular)
points in time.)

11

4

Ordered Data
Sequences of transactions



Items/Events

The data still consists of a
set of transactions and
items, but time and
customer ID attributes are
associated with each
transaction.

An element of
the sequence
12

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan







13

2/9/2014

Data Quality





What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
 Noise and outliers
 missing values
 duplicate data

14

5

Noise


Noise refers to modification of original values


Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screen

Two Sine Waves

Two Sine Waves + Noise

15

Outliers


Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set

16

Missing Values


Reasons for missing values





Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values




Eliminate Data Objects
Estimate Missing Values  the missing values can be estimated
(interpolated) by using the remaining values.
Ignore the Missing Value During Analysis



e.g., suppose that objects are being clustered and the similarity between
pairs of data objects.
the similarity can be calculated by using only the non-missing attributes

17

6

Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another






Major issue when merging data from heterogeous sources
Examples: Same person with multiple email addresses



That care needs to be taken to avoid accidentally combining
data objects that are similar, but not duplicates.



Data cleaning


Process of dealing with duplicate data issues

18

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan







19

2/9/2014

Summary Statistics


Summary statistics are numbers that summarize
properties of the data


Summarized properties include frequency, location and spread




Examples:

location - mean
spread - standard deviation

Most summary statistics can be calculated in a single pass
through the data

7

Frequency and Mode


The frequency of an attribute value is the percentage
of time the value occurs in the
data set





For example, given the attribute ‘gender’ and a representative population
of people, the gender ‘female’ occurs about 50% of the time.

The mode of a an attribute is the most frequent attribute value
The notions of frequency and mode are typically used with
categorical data

Measures of Location: Mean and Median




The mean is the most common measure of the location of
a set of points.
However, the mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly
used.

Measures of Spread: Range and Variance





Range is the difference between the max and min
The variance or standard deviation is the most common
measure of the spread of a set of points.

However, this is also sensitive to outliers, so that other
measures are often used.

8

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan







24

2/9/2014

Visualization
Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and
the relationships among data items or attributes can be
analyzed or reported.


Visualization of data is one of the most powerful and
appealing techniques for data exploration.



Can detect general patterns and trends
Can detect outliers and unusual patterns

Example: Sea Surface Temperature


The following shows the Sea Surface Temperature (SST)
for July 1982


Tens of thousands of data points are summarized in a single
figure

9

Representation





Is the mapping of information to a visual format
Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
Example:





Objects are often represented as points
Their attribute values can be represented as the position of the
points or the characteristics of the points, e.g., color, size, and
shape
If position is used, then the relationships of points, i.e., whether
they form groups or a point is an outlier, is easily perceived.

Arrangement




Is the placement of visual elements within a display
Can make a large difference in how easy it is to understand
the data
Example:

Visualization Techniques: Histograms


Histogram







Usually shows the distribution of values of a single variable
Divide the values into bins and show a bar plot of the number of objects in
each bin.
The height of each bar indicates the number of objects
Shape of histogram depends on the number of bins

Example: Petal Width (10 and 20 bins, respectively)

10

Two-Dimensional Histograms



Show the joint distribution of the values of two attributes
Example: petal width and petal length


What does this tell us?

Visualization Techniques: Box Plots


Box Plots




Invented by J. Tukey
Another way of displaying the distribution of data
Following figure shows the basic part of a box plot
outlier

10th percentile

75th percentile
50th percentile
25th percentile

10th percentile

Example of Box Plots


Box plots can be used to compare attributes

11

Visualization Techniques: Scatter Plots


Scatter plots





Attributes values determine the position
Two-dimensional scatter plots most common, but can have
three-dimensional scatter plots
Often additional attributes can be displayed by using the size,
shape, and color of the markers that represent the objects
It is useful to have arrays of scatter plots can compactly
summarize the relationships of several pairs of attributes


See example on the next slide

Scatter Plot Array of Iris Attributes







Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

35

2/9/2014

12

Similarity and Dissimilarity


Similarity






Dissimilarity







Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies

Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.

Euclidean Distance


Euclidean Distance

dist =

n

∑ ( pk

k =1

− qk )2

Where n is the number of dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or data objects p and q.


Standardization is necessary, if scales differ.

13

Euclidean Distance
3

poi nt
p1
p2
p3
p4

p1

2

p3

p4

1
p2

0
0

1

2

3

4

5

y
2
0
1
1

6

p1
p1
p2
p3
p4

x
0
2
3
5

0
2.828
3.162
5.099

p2
2.828
0
1.414
3.162

p3
3.162
1.414
0
2

p4
5.099
3.162
2
0

Distance Matrix

Minkowski Distance
Minkowski Distance is a generalization of Euclidean
Distance
1



n

dist = ( ∑ | pk − qk |r ) r
k =1

Where r is a parameter, n is the number of dimensions (attributes) and
pk and qk are, respectively, the kth attributes (components) or data
objects p and q.

Mahalanobis Distance

mahalanobi s ( p, q ) = ( p − q ) ∑ − 1 ( p − q )T
Σ is the covariance matrix of the
input data X

Σ j ,k =

1 n
∑ ( X ij − X j )( X ik − X k )
n − 1 i =1

Common Properties of a Distance


Distances, such as the Euclidean distance, have some
well known properties.
1.
2.
3.

d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.


A distance that satisfies these properties is a metric

14

Common Properties of a Similarity


Similarities, also have some well known properties.
1.

s(p, q) = 1 (or maximum similarity) only if p = q.

2.

s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data
objects), p and q.

Similarity Between Binary Vectors


Common situation is that objects, p and q, have only binary attributes



Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1



Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)

SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
M01 = 2
M10 = 1
M00 = 7
M11 = 0

(the number of attributes where p was 0 and q was 1)
(the number of attributes where p was 1 and q was 0)
(the number of attributes where p was 0 and q was 0)
(the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

15

Materi selanjutnya: Praproses data

Source:
http://www.kdnuggets.com/2012/12/cartoo
n-preparing-for-big-data-flood.html

45

2/9/2014

16

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close