Data Dan Eksplorasi Data

Published on April 2017 | Categories: Documents | Downloads: 61 | Comments: 0 | Views: 295

of 16

Content

Computer Science Department
Bogor Agricultural University

Data dan Eksplorasi Data
Kuliah 2
2/9/2014

0

Outline

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

Catatan: semua slide diambil dari

Tan P., Michael S., & Vipin K. 2006. Introduction to Data mining. Pearson
Education, Inc.
Han J & Kamber M. 2006. Data mining – Concept and Techniques. MorganKauffman, San Diego

1

2/9/2014

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

2

2/9/2014

1

What is Data?

Collection of data objects and
their attributes

An attribute is a property or
characteristic of an object
Examples: eye color of a
person, temperature, etc.
Attribute is also known as
variable, field, characteristic,
or feature
A collection of attributes
describe an object
Object is also known as
record, point, case, sample,
entity, or instance

Attributes

Tid Refund Marital
Status

Taxable
Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

Objects

60K

1
0

Attribute values

Attribute values are numbers or symbols assigned to
an attribute

Distinction between attributes and attribute values

Same attribute can be mapped to different attribute values
Example: height can be measured in feet or meters

Different attributes can be mapped to the same set of
values
Example: Attribute values for ID and age are integers
But properties of attribute values can be different

ID has no limit but age has a maximum and minimum value

4

Tipe atribut: Kategori (kualitatif)
Tipe atribut
Deskripsi
Nominal
Nilai dari atribut nominal
adalah nama-nama yang
berbeda, yaitu nilai nominal
hanya menyediakan
informasi yang cukup untuk
membedakan satu objek
dengan objek yang lain. (=
dan ≠)
Ordinal
Nilai dari atribut ordinal
menyediakan informasi
yang cukup mengurutkan
objek. (<, >)

Contoh
Kode pos, ID
Number
karyawan, warna
mata, jenis
kelamin

Operasi
Mode,
entropy,
contingency
correlation, uji
χ2

Kekerasan
mineral {baik,
lebih baik,
sangat baik},
nomor jalan,
grade

Median,
presentil, rank
correlation,
run test, sign
test

5

2/9/2014

2

Tipe atribut: Numerik (Kuantitatif)
Tipe atribut
Interval

Deskripsi
Untuk atribut interval,
perbedaan antarnilai
adalah sesuatu yang
berarti, adanya unit
pengukuran. (+,−)

Contoh
Tanggal pada
kalender,
temperatur dalam
Celcius atau
Fahrenheit

Operasi
Rataan,
simpangan
baku, korelasi
Pearson, Uji t
dan F

Ratio

Untuk variabel rasio,
perbedaan dan rasio
merupakan hal yang
berarti. (*, /)

Temperatur dalam
Kelvin, kuantitas
moneter, count,
umur, panjang, arus
listrik

Rataan
geometri,
rataan
harmonik,
variasi persen

6

2/9/2014

Discrete and Continuous Attributes
Discrete Attribute

Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes

Continuous Attribute

Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a
finite number of digits.
Continuous attributes are typically represented as floating-point
variables.

7

Types of data sets
Record
Data Matrix
Document Data
Transaction Data
Graph
World Wide Web
Molecular Structures
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data

8

3

Record Data
Data that consists of a collection of records, each of
which consists of a fixed set of attributes

Tid Refund Marital
Status

Taxable
Income Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

No

Single

90K

Yes

60K

10

9

Transaction Data

A special type of record data, where

each record (transaction) involves a set of items.
For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute
a transaction, while the individual products that were
purchased are the items.
TID

Items

1

Bread, Coke, Milk

2
3

Beer, Bread
Beer, Coke, Diaper, Milk

4

Beer, Bread, Diaper, Milk

5

Coke, Diaper, Milk

10

Ordered Data
Temporal Data

•

•

•

•

Temporal data is data whose
objects have attributes that
represent measurements
taken over time.
For example, financial data
set are time series that give
the daily prices of various
stocks.
A time series is a sequence of
measurements of some
attribute
e.g., stock price or rainfall,
taken at (usually regular)
points in time.)

11

4

Ordered Data
Sequences of transactions

Items/Events

The data still consists of a
set of transactions and
items, but time and
customer ID attributes are
associated with each
transaction.

An element of
the sequence
12

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

13

2/9/2014

Data Quality

What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
missing values
duplicate data

14

5

Noise

Noise refers to modification of original values

Examples: distortion of a person’s voice when talking on a
poor phone and “snow” on television screen

Two Sine Waves

Two Sine Waves + Noise

15

Outliers

Outliers are data objects with characteristics that are
considerably different than most of the other data
objects in the data set

16

Missing Values

Reasons for missing values

Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values

Eliminate Data Objects
Estimate Missing Values  the missing values can be estimated
(interpolated) by using the remaining values.
Ignore the Missing Value During Analysis

e.g., suppose that objects are being clustered and the similarity between
pairs of data objects.
the similarity can be calculated by using only the non-missing attributes

17

6

Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another

Major issue when merging data from heterogeous sources
Examples: Same person with multiple email addresses

That care needs to be taken to avoid accidentally combining
data objects that are similar, but not duplicates.

Data cleaning

Process of dealing with duplicate data issues

18

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

19

2/9/2014

Summary Statistics

Summary statistics are numbers that summarize
properties of the data

Summarized properties include frequency, location and spread

Examples:

location - mean
spread - standard deviation

Most summary statistics can be calculated in a single pass
through the data

7

Frequency and Mode

The frequency of an attribute value is the percentage
of time the value occurs in the
data set

For example, given the attribute ‘gender’ and a representative population
of people, the gender ‘female’ occurs about 50% of the time.

The mode of a an attribute is the most frequent attribute value
The notions of frequency and mode are typically used with
categorical data

Measures of Location: Mean and Median

The mean is the most common measure of the location of
a set of points.
However, the mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly
used.

Measures of Spread: Range and Variance

Range is the difference between the max and min
The variance or standard deviation is the most common
measure of the spread of a set of points.

However, this is also sensitive to outliers, so that other
measures are often used.

8

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

24

2/9/2014

Visualization
Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and
the relationships among data items or attributes can be
analyzed or reported.

Visualization of data is one of the most powerful and
appealing techniques for data exploration.

Can detect general patterns and trends
Can detect outliers and unusual patterns

Example: Sea Surface Temperature

The following shows the Sea Surface Temperature (SST)
for July 1982

Tens of thousands of data points are summarized in a single
figure

9

Representation

Is the mapping of information to a visual format
Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
Example:

Objects are often represented as points
Their attribute values can be represented as the position of the
points or the characteristics of the points, e.g., color, size, and
shape
If position is used, then the relationships of points, i.e., whether
they form groups or a point is an outlier, is easily perceived.

Arrangement

Is the placement of visual elements within a display
Can make a large difference in how easy it is to understand
the data
Example:

Visualization Techniques: Histograms

Histogram

Usually shows the distribution of values of a single variable
Divide the values into bins and show a bar plot of the number of objects in
each bin.
The height of each bar indicates the number of objects
Shape of histogram depends on the number of bins

Example: Petal Width (10 and 20 bins, respectively)

10

Two-Dimensional Histograms

Show the joint distribution of the values of two attributes
Example: petal width and petal length

What does this tell us?

Visualization Techniques: Box Plots

Box Plots

Invented by J. Tukey
Another way of displaying the distribution of data
Following figure shows the basic part of a box plot
outlier

10th percentile

75th percentile
50th percentile
25th percentile

10th percentile

Example of Box Plots

Box plots can be used to compare attributes

11

Visualization Techniques: Scatter Plots

Scatter plots

Attributes values determine the position
Two-dimensional scatter plots most common, but can have
three-dimensional scatter plots
Often additional attributes can be displayed by using the size,
shape, and color of the markers that represent the objects
It is useful to have arrays of scatter plots can compactly
summarize the relationships of several pairs of attributes

See example on the next slide

Scatter Plot Array of Iris Attributes

Data dan Tipe data
Kualitas data
Statistika ringkasan
Visualisasi
Ukuran kemiripan dan ketidakmiripan

35

2/9/2014

12

Similarity and Dissimilarity

Similarity

Dissimilarity

Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Numerical measure of how different are two data objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies

Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.

Euclidean Distance

Euclidean Distance

dist =

n

∑ ( pk

k =1

− qk )2

Where n is the number of dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or data objects p and q.

Standardization is necessary, if scales differ.

13

Euclidean Distance
3

poi nt
p1
p2
p3
p4

p1

2

p3

p4

1
p2

0
0

1

2

3

4

5

y
2
0
1
1

6

p1
p1
p2
p3
p4

x
0
2
3
5

0
2.828
3.162
5.099

p2
2.828
0
1.414
3.162

p3
3.162
1.414
0
2

p4
5.099
3.162
2
0

Distance Matrix

Minkowski Distance
Minkowski Distance is a generalization of Euclidean
Distance
1

n

dist = ( ∑ | pk − qk |r ) r
k =1

Where r is a parameter, n is the number of dimensions (attributes) and
pk and qk are, respectively, the kth attributes (components) or data
objects p and q.

Mahalanobis Distance

mahalanobi s ( p, q ) = ( p − q ) ∑ − 1 ( p − q )T
Σ is the covariance matrix of the
input data X

Σ j ,k =

1 n
∑ ( X ij − X j )( X ik − X k )
n − 1 i =1

Common Properties of a Distance

Distances, such as the Euclidean distance, have some
well known properties.
1.
2.
3.

d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
d(p, q) = d(q, p) for all p and q. (Symmetry)
d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.

A distance that satisfies these properties is a metric

14

Common Properties of a Similarity

Similarities, also have some well known properties.
1.

s(p, q) = 1 (or maximum similarity) only if p = q.

2.

s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data
objects), p and q.

Similarity Between Binary Vectors

Common situation is that objects, p and q, have only binary attributes

Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1

Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01 + M10 + M11)

SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
M01 = 2
M10 = 1
M00 = 7
M11 = 0

(the number of attributes where p was 0 and q was 1)
(the number of attributes where p was 1 and q was 0)
(the number of attributes where p was 0 and q was 0)
(the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

15

Materi selanjutnya: Praproses data

Source:
http://www.kdnuggets.com/2012/12/cartoo
n-preparing-for-big-data-flood.html

45

2/9/2014

16

Data Dan Eksplorasi Data

Comments

Content

Sponsor Documents

Recommended