Business Intelligence & Data Mining 12-13

Published on June 2016 | Categories: Types, School Work | Downloads: 17 | Comments: 0 | Views: 219
of 64
Download PDF   Embed   Report

Business Intelligence & Data Mining 12-13

Comments

Content

Blog and Social Media Mining

What is a weblog / blog?
– a (more or less) frequently updated
publication on the Web, sorted in (usually
reverse) chronological order of the constituent
blog posts.
– The content may reflect any interest
including personal, journalistic or corporate.
– Usually textual, but multimedia forms exist
(photoblog etc.)

Blog as an emerging new data…

……

An Example of Blog Article
Location Info.

Blog Contents

The time stamp

Characteristics of blogs

Blog Article

Interlinking &
Forming communities

Highly personal
With opinions
Time

Location

Immediate response
to events
With mixed topics

Associated with time
& location

Blogs are Bursty

Blogs contain Theme Patterns
Theme life cycles

Discussion about “Release of iPod Nano”
in articles about “iPod Nano”

Strength

United States
China

Locations

Canada

09/20/05 – 09/26/05 Time

Discussion about “Government Response” in
articles about Hurricane Katrina

A theme snapshot

Existing Work on Weblog Analysis
• Interlinking and Community
Analysis
– Identifying communities
– Monitoring the evolution and
bursting of communities
– E.g., [Kumar et al. 2003]

# of nodes in communities

# of communities

• Content Analysis
– Blog level topic analysis
– Information diffusion through
blogspace
– Use topic bursting to predict
sales spikes
– E.g., [Gruhl et al. 2005]

Blog mentions
Sales rank

Applications of Blog Theme Mining
• Help answer questions like
– Which country responded first to the release of iPod
Nano? China, UK, or Canada?
– Did people in different states (e.g., Illinois vs. Texas)
respond differently/similarly to the increase of comodity
prices during Hurricane Katrina?

• Potentially useful for





Summarizing topics
Monitoring public opinions
Business Intelligence


Consumers are more informed,
more demanding than ever
•92% of
respondents said
they had more
confidence in
information they
seek out online
than anything
coming from a
salesclerk or other
source

That’s where our best consumers are!
A Large and Growing Population



Over 184 million people currently maintain a blog / are active in Soc. Media
...about 20% of the Internet population
Over 60% read post in blogs / Soc. Media

Trend Setters



74% with college degrees
42% have post-graduate degrees

That are More Diverse



58% over age of 35
51% of household incomes > $75K

Sharing Opinions and Ideas



Unaided, natural conversations
Rich in content

In Real Time


There are over a million new blog / Soc. Media posts every day

Source: Technorati, State of the Blogosphere, 2008 and Pew Research.

Analysis of text content has many
applications from tactical to strategic
Web
Intelligence

Product Innovation
Consumer Insight
Trend Insight
Brand Insight
Blogger Outreach

Buzz &
Sentiment

Crisis Communication

Tactical

Strategic
Strategic

If you have a crisis, definitely
use it to listen and reachout
June 21, 2005
Dell lies. Dell sucks.
I just got a new Dell laptop and paid a fortune for the four-year, in-home
service.
The machine is a lemon and the service is a lie.
I'm having all kinds of trouble with the hardware: overheats, network
doesn't work, maxes out on CPU usage. It's a lemon.
But what really irks me is that they say is they sent someone to my
home -- which I paid for -- he wouldn't have the parts, so I might as
well just send the machine in and lose it for 7-10 days -- plus the time
going through this crap. So I have this new machine and paid for
them to FUCKING FIX IT IN MY HOUSE and they don't and I lose it
for two weeks.
DELL SUCKS. DELL LIES. Put that in your Google and smoke it,

Facts and Opinions
• Two main types of textual information.
Facts and Opinions
• Most current information processing
techniques (e.g., search engines) work
with facts (assume they are true)
• Facts can be expressed with topic
keywords

Opinions
• In real life, facts are important, but opinion also
plays a crucial role. A computer manufacturer,
disappointed with low sales, asks itself: Why
aren’t consumers buying our laptop? A political
party, disappointed with the last election, wants
to know on an on-going basis: What is the
reaction in the press, newsgroups, chat rooms,
and blogs to latest policy decisions?

Opinions in posts
• Analysis of Posts (Tasks)
– Perform subjectivity and polarity classification
on blog posts
– Discover irregularities in temporal mood
patterns (fear, excitement, etc) appearing in a
large corpus of posts
– Use link polarity information to model trust
and influence in the blogosphere
– Analyze sentiments about products and
correlate it with its sales

Challenges
• Determine whether a document or portion
(e.g. paragraph or statement) is
subjective.
• Example: “the battery lasts 2 hours” vs.
“the battery lasts only 2 hours”

Challenges
• The difficulty lies in the richness of human
language use.
Example:
1. This is a great camera.
2. A great amount of money was spent for
promoting this camera.
3. One might think this is a great camera.
Well think again, because.....
• a single keyword can be used to convey three
different opinions, +ve, neutral and -ve
respectively.

Challenges
• In order to arrive at sensible conclusions,
sentiment analysis has to understand
context. For example, “fighting” and
“disease” is negative in a war context but
positive in a medical one.
• Different mining conditions for different
domains.

Sentiment Classification
• There are two main techniques for
sentiment classification:
• The symbolic technique uses manually
crafted rules and lexicons,
• The machine learning approach uses
unsupervised, or supervised learning to
construct a model from a large training
corpus.

Subjectivity
• Find relevant words, phrases, patterns that
can be used to express subjectivity
• Determine the polarity of subjective
expressions

Words
• Adjectives
• positive: honest important mature large patient
Ron Paul is the only honest man in Washington.
• Kitchell’s writing is unbelievably mature and is only likely to
get better.
• To humour me my patient father agrees yet again to my
choice of film

• negative: harmful hypocritical inefficient insecure
– It was a macabre and hypocritical circus.
– Why are they being so inefficient ?

Words
• Verbs
• positive: praise, love
• negative: blame, criticize

– Nouns
• positive: pleasure, enjoyment
• negative: pain, criticism

Phrases
• Phrases containing adjectives and
adverbs
– positive: high intelligence, low cost, better
performance
– negative: little variation, many troubles,
several excuses

Supervised Methods
• In order to train a classifier for sentiment
recognition in text, classic supervised learning
techniques (e.g. Support Vector Machines, naive
Bayes, Maximum Entropy) can be used. A
supervised approach entails the use of a
labelled training corpus to learn a certain
classification function. Support Vector Machine
classifiers have been found to have the greatest
accuracy.

Unsupervised Learning
Clustering algorithms can be used to partition the
adjectives into two subsets
+

slow

scenic

-

nice
terrible
handsome

painful

fun
expensive
comfortable

Applications / Caselets
Sentiment Analysis for Mining
Marketing Intelligence

Sentiment Analysis for Mining Marketing Intelligence
• This case study demonstrates the application of sentiment
analysis and opinion mining for extracting marketing
intelligence from online reviews
• Different studies have confirmed the importance of online
reviews for consumers and product manufacturers
• Users’ opinions expressed in reviews are important for
potential consumers to make well informed purchase
decisions
• While, the same are needed by product manufacturers to gain
insights about their products’ strengths and weaknesses, and
to collect product benchmarking information

Marketing Intelligence
• “MI is the process of acquiring and analyzing information in
order to understand the market (both existing and potential
customers); to determine the current and future needs and
preferences, attitudes and behavior” (Cornish, 1997)
• In consonance with Cornish’s definition, we take the view that
consumer sentiments and opinions can be useful for elicitation
of their preferences

Traditional Methods for Collecting Consumer
Preference
• Typically, consumer preferences are estimated by means of
conjoint analysis of data from online or paper-and-pencil
surveys
• However, this type of preference elicitation can easily become
expensive in terms of time and money
• Moreover, the quality of the data resulting from surveys
depends on the willingness of the respondents to participate
in the study and the length (complexity) of the questionnaire
• Data collection methods such as opinion polls, field interview
or purchasing costly point-of-sale data are found to be
expensive and time consuming

Objective
• To discover marketing intelligence like Feature Buzz related to
products and to analyze feature level opinion by sentiment
analysis and opinion mining
• Developing novel approaches for analysis of opinionated text
information by bridging the gap among text mining, machine
learning and natural language processing techniques

The Framework

Online Reviews Text Corpus
• The study used online product reviews as the text corpus
• The online reviews were collected from the Internet
• The dataset was generated by collecting total 2,010 hotel
reviews for 102 hotels (11 popular travel destinations
in India) from Tripadvisor.com and Yatra.com

Credibility of Opinion Source-Online Reviews
• TripAdvisor make up the largest travel community in the
world, with more than 60 million unique monthly visitors, and
over 75 million reviews and opinions (comScore Media
Report, 2012). (World's most trusted travel advice)
• Yatra.com is India’s leading online travel website which is
recently voted ‘Most Trusted Brand of India’ in the online
travel category by Brand Equity (CNBC Report, 2011)
• Free text and user ratings format enables easy check of
content face validity
• Each review undergoes genuine opinion checks and both sites
follow zero-tolerance policy on fake reviews

Preparation of Text Corpus
• The download for hotel reviews was conducted during June
2012 to December 2012
• The reviews were classified in terms of the overall sentiment
orientations and then divided to training and test datasets
• Hotel reviews annotation- More than 3 stars rating as being
positive and less than 3 stars rating as being negative
• Reviews with 3 stars (neutral) were discarded to restrict the
task to binary sentiment analysis

The Framework

Textual Pre-processing
• The opinionated text documents were collected and then,
pre-processed to remove any non-textual information
• The Vector Space Model (VSM) was adopted in order to
generate the bag of words for each document
• Stemming was done to reduce words to their common
root or stem
• Some of the stop words were removed but, we preserved
some useful sentiment expressing terms such as “ok” and
“not”
• Top n-ranked terms were selected using Information Gain
feature selection

The Framework

Opinion Related Resource Generation
• The opinion related resource generation involves identifying
product features (attributes), extracting the associated
opinions (positive or negative) and annotating text documents
for training the machine learning classifiers
• Statistical patterns like frequent nouns, adjectives and other
phrases, association rules based frequent n-grams, manual
extraction rules, sentiment and domain knowledge
dictionaries can be used for extracting features and opinion
words
• Rule-based Part-of-speech (POS) tagging was adopted for
identification of feature (as noun phrases) and opinion words
(Adj. and adverb) in the text

Example of Feature-Opinion Tuple Extraction Rules

Feature-Opinion Tuple Extraction
• Redundancy pruning was done to remove non-candidate and
redundant features
• Pointwise mutual information (PMI) based scores were used
to group features having similar meaning or co-occurring
features
• Point mutual information, is a measure of association used
in information theory

• Finally, phrase similarity was used to eliminate or merge
similar product features

The Framework

Sentiment based Classification
• Feature-level sentiment analysis aims to find what
people like and dislike about a given object (Product
Feature)
• Product review polarity classification involves discovering
whether
a
product
was
recommended/notrecommended in a review
• We applied supervised machine learning based approach
for feature-level sentiment classification of online
reviews
• Support vector machine (SVM) was the machine learning
model used

The Framework

Feature-level Opinion Mining
• Product features are attributes that provide functionality to
products and play a crucial role in distinguishing similar
products of different brands
• Feature-level opinion mining provides deep analysis of online
reviews by identifying different features of products that
consumers are concerned about
• By mining product features and their associated opinion,
feature-level buzz monitoring and feature-level opinion
summarization can be done
• Buzz - a term used in word-of-mouth marketing defined as a
vague but positive (may be negative on rare occasions)
association or anticipation about a product or service

Top 100 Frequent Features Extracted from 2000
Hotel Reviews

Feature Buzz with Top 30 Features in Online Hotel
Reviews

Overall Positive and Negative Sentiment Words in
Feature-Opinion Tuple

Feature-Opinion Tuple for Top 5 Features
Top 5 Features

Top 15 Positive Opinion Words

Top 15 Negative Opinion Words

Room

Clean (482), Good (370), Like (238), Nice (210), Comfort
(102), Better (81), Great (72), Excel (67), Big (62), Best
(59), Beautiful (50), Love (45), Decent (38), Worth (28),
Modern (26)

Small (139), Hot (92), Bad (82), Smell (60), Cold (52), Problem
(39), Poor (37), Stink (30), Costly (28), Worst (27), Damp (27),
Dark (26), Complain (18), Broken (18), Leak (15)

Food

Good (243), Excellent (76), Great (54), Tasty (47), Delight
(45), Like (44), Delicious (41), Enjoy (39), Nice (37), Decent
(30), Awesome (27), Best (26), Love (22), Fine (19), Better
(18)

Bad (104), Worst (78), Dislike (69), Wait (62), Cold (58),
Disappoint (49), Poor (42), Expensive (39), Horrible (39), Late
(33), Worse (31), Smell (27), Refuse (26), Complain (21), Pathetic
(17)

Good (94), Comfortable (67), Nice (52), Great (44), Enjoy
(30), Pleasant (30), Deluxe (27), Wonderful (20), Homestay
(20), Luxury (18), Memorable (16), Relax (15), Incredible
(14), Royal (12), Romantic (11)

Bad (57), Worst (43), Horrible (38), Disappoint (29), Problem
(23), Poor (22), Difficult (21), Disliked (20), Nightmare (18),
Costly (17), Expensive (14), Terrible (14), Pain (12), Avoid (11),
Mistake (11)

Good (38), Nice (28), Great (24), Reasonable (22),
Recommend (20), Worth (19), Decent (17), Budget (17),
Free (16), Standard (16), Fine (14), Quite (12), Ideal (11),
Best (10), Ok (10)

Expensive (38), Overpriced (28), Cost (25), Waste (21), Costly
(19), Fail (14), Joke (14), Limited (13), More (13), High (12),
Feel (10), Cheat (9), Poorly (8), Wrong (8), Con (7)

Stay
Experience

Price

Location

Near (52), Beautiful (35), Convenient (26), Good (22), Peace
Far (28), Distance (24), Away (23), Remote (21), Crowded (20),
(21), Agree (19), Easily (18), Walkable (16), Wait (16), Short
Lost (17), Problem (14), Issue (11), Out (11), Busy (10), Noisy
(14), Easy (12), Ideal (11), Lavish (10), Accessible (10),
(10), Mislead (8), Bitter (8), Hectic (7), Long (7)
Popular (8)

Summary
• The study has demonstrated methods for
automatically extracting consumer opinions from
online reviews of hotels
• It has shown that aggregated consumer sentiment
as well as specific opinion about product features
can be extracted using sentiment analysis
techniques

More Advanced:
Spatiotemporal Theme Mining
• Given a collection of posted articles about a topic with
time and location information
– Discover multiple themes (i.e., subtopics) being discussed in
these articles
– For a given location, discover how each theme evolves over
time (generate a theme life cycle)
– For a given time, reveal how each theme spreads over
locations (generate a theme snapshot)
– Compare theme life cycles in different locations
– Compare theme snapshots in different time periods
–…

Challenges in
Spatiotemporal Theme Mining
• How to represent a theme?
• How to model the themes in a collection?
• How to model their dependency on time and
location?
• How to compute the theme life cycles and
theme snapshots?
• All these must be done in an unsupervised
way…

How?
• Time-stamped data sets of weblogs, each about one
event (broad topic):
Data Set

# docs

Time Span(2005)

Query

Katrina

9377

08/16 -10/04

Hurricane Katrina

Rita

1754

08/16 - 10/04

Hurricane Rita

iPod Nano

1720

09/02 - 10/26

iPod Nano

• Extract location information from author profiles
• Isolate by location
• On each data set, we extract a set of salient themes
and their life cycles / theme snapshots

Theme Life Cycles for iPod Nano
United States

China
Release of Nano

Canada

United Kingdom

ipod 0.2875
nano 0.1646
apple 0.0813
september 0.0510
mini 0.0442
screen 0.0242
new 0.0200


Applications / Caselets
Identifying the Target Segment

CASE STUDY | Enabling your passionates
COMPANY BACKGROUND
– Maker of pruning sheers
for gardening and scissors
for crafts

NEED
– Wanted to build a
marketing campaign to
recruit brand advocates
into an online
community

ASSUMPTIONS
– Knew Boomer Females
were great target for
sewing and crafts

Surprising findings
SOLUTION
–Baseline read for
online chatter
–Identify
demographics

FINDINGS
–Found that Gen Y
females were
actually the right
target
–AND, big issue
was online
crafters could be
‘mean’

Adjusting the game plan
RESULTS

–Adjusted strategy for
new demographics
and new voices
–Created ambassador
program which has
helped grow
Fiskateers to more
than 6,000 active
members
–Members invite others
–In first 3 months,
increased online
mentions by 341%
–Sales grew by 20%

Applications / Caselets
Trend and Segmentation Analysis
Are Consumers Buying Green?

2007
2008

160%

De
c

Se
p
O
ct
No
v

Ju
l
Au
g

M
ar
Ap
r
M
ay
Ju
n

Ju
l
Au
g
Se
p
O
ct
No
v
De
c
Ja
n
Fe
b

M
ar
Ap
r
M
ay
Ju
n

Ja
n
Fe
b

Trend analysis

156,177

98,148

71,882

51,638

37,944

Early 2007 was dominated by the Negators and the
“I just don’t know what to think…”crowd
ACTION

Negator
22%

Social

Activist
9%

Personal

Shifter
8%

AGREEMENT

DISAGREEMENT

Rejecter
14%
Uncertain
24%

Idler
5%

Skeptic
12%

Guilty
6%

Apathetic
(not measured)

INACTION

By late 2007, momentum had swung to
agreement
ACTION

Negator
17%

Activist
10%

Social
Personal

Shifter
16%

AGREEMENT

DISAGREEMENT

Rejecter
12%
Uncertain
9%

Idler
13%

Skeptic
11%

Guilty
14%

Apathetic
(not measured)

INACTION

Concern about the environment continued
to gain momentum in early 2008
ACTION
Social

Negator
14%

Activist
8%

Personal

DISAGREEMENT

AGREEMENT

Shifter
19%

Rejecter
8%

Uncertain
10%

Idler
15%

Skeptic
13%

Guilty
13%

Apathetic
(not measured)

INACTION

By 2010, more than 7 out of 10 were concerned, and
almost half were actively doing something about it
ACTION
Social

Negator
3%

Activist
18%

Personal

DISAGREEMENT

AGREEMENT

Shifter
27%

Rejecter
5%

Uncertain
10%

Skeptic
10%

Idler
21%

Apathetic

Guilty
6%

(not measured)

INACTION

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close