Text Mining

Published on May 2016 | Categories: Documents | Downloads: 51 | Comments: 0 | Views: 345
of 46
Download PDF   Embed   Report

Comments

Content

Text mining and the Semantic Web

Dr Diana Maynard
NLP Group
Department of Computer Science
University of Sheffield

http://nlp.shef.ac.uk

Structure of this lecture







Text Mining and the Semantic Web
Text Mining Components / Methods
Information Extraction
Evaluation
Visualisation
Summary

University of Manchester – 15 March

2

Introduction to Text Mining and
the Semantic Web

http://nlp.shef.ac.uk

What is Text Mining?
• Text mining is about knowledge discovery from
large collections of unstructured text.
• It’s not the same as data mining, which is
more about discovering patterns in structured
data stored in databases.
• Similar techniques are sometimes used,
however text mining has many additional
constraints caused by the unstructured nature
of the text and the use of natural language.
• Information extraction (IE) is a major
component of text mining.
• IE is about extracting facts and structured
information from unstructured text.
University of Manchester – 15 March

4

http://nlp.shef.ac.uk

Challenge of the Semantic Web
• The Semantic Web requires machine
processable, repurposable data to complement
hypertext
• Such metadata can be divided into two types of
information: explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata.
• More on this later…

University of Manchester – 15 March

5

Text mining components and
methods

http://nlp.shef.ac.uk

Text mining stages
• Document selection and filtering (IR
techniques)
• Document pre-processing (NLP
techniques)
• Document processing (NLP / ML /
statistical techniques)

University of Manchester – 15 March

7

http://nlp.shef.ac.uk

Stages of document processing
• Document selection involves identification and retrieval of
potentially relevant documents from a large set (e.g. the
web) in order to reduce the search space. Standard or
semantically-enhanced IR techniques can be used for this.
• Document pre-processing involves cleaning and preparing
the documents, e.g. removal of extraneous information,
error correction, spelling normalisation, tokenisation, POS
tagging, etc.
• Document processing consists mainly of information
extraction
• For the Semantic Web, this is realised in terms of metadata
extraction

University of Manchester – 15 March

8

http://nlp.shef.ac.uk

Metadata extraction
• Metadata extraction consists of two types:
• Explicit metadata extraction involves information
describing the document, such as that contained
in the header information of HTML documents
(titles, abstracts, authors, creation date, etc.)
• Implicit metadata extraction involves semantic
information deduced from the material itself, i.e.
endogenous information such as names of entities
and relations contained in the text. This essentially
involves Information Extraction techniques, often
with the help of an ontology.
University of Manchester – 15 March

9

Information Extraction (IE)

http://nlp.shef.ac.uk

IE is not IR
IR pulls documents
from large text
collections (usually the
Web) in response to
specific keywords or
queries. You analyse
the documents.
IE pulls facts and
structured information
from the content of large
text collections. You
analyse the facts.
University of Manchester – 15 March

11

http://nlp.shef.ac.uk

IE for Document Access
• With traditional query engines, getting the facts can
be hard and slow
• Where has the Queen visited in the last year?
• Which places on the East Coast of the US
have had cases of West Nile Virus?
• Which search terms would you use to get this kind
of information?
• How can you specify you want someone’s home
page?
• IE returns information in a structured way
• IR returns documents containing the relevant
information somewhere (if you’re lucky)

University of Manchester – 15 March

12

http://nlp.shef.ac.uk

IE as an alternative to IR
• IE returns knowledge at a much deeper
level than traditional IR
• Constructing a database through IE and
linking it back to the documents can
provide a valuable alternative search tool.
• Even if results are not always accurate,
they can be valuable if linked back to the
original text
University of Manchester – 15 March

13

http://nlp.shef.ac.uk

Some example applications
• HaSIE
• KIM
• Threat Trackers

University of Manchester – 15 March

14

http://nlp.shef.ac.uk

HaSIE
• Application developed by University of
Sheffield, which aims to find out how
companies report about health and safety
information
• Answers questions such as:
“How many members of staff died or had accidents
in the last year?”
“Is there anyone responsible for health and safety?”
“What measures have been put in place to improve
health and safety in the workplace?”
University of Manchester – 15 March

15

http://nlp.shef.ac.uk

HASIE
• Identification of such information is too
time-consuming and arduous to be done
manually
• IR systems can’t cope with this because
they return whole documents, which could
be hundreds of pages
• System identifies relevant sections of each
document, pulls out sentences about
health and safety issues, and populates a
database with relevant information
University of Manchester – 15 March

16

http://nlp.shef.ac.uk

HASIE

University of Manchester – 15 March

17

http://nlp.shef.ac.uk

KIM
• KIM is a software platform developed by
Ontotext for semantic annotation of text.
• KIM performs automatic ontology
population and semantic annotation for
Semantic Web and KM applications
• Indexing and retrieval (an IE-enhanced
search technology)
• Query and exploration of formal
knowledge
University of Manchester – 15 March

18

http://nlp.shef.ac.uk

KIM
Ontotext’s KIM query and results

University of Manchester – 15 March

19

http://nlp.shef.ac.uk

Threat tracker
• Application developed by Alias-I which finds and
relates information in documents
• Intended for use by Information Analysts who
use unstructured news feeds and standing
collections as sources
• Used by DARPA for tracking possible
information about terrorists etc.
• Identification of entities, aliases, relations etc.
enables you to build up chains of related people
and things
University of Manchester – 15 March

20

http://nlp.shef.ac.uk

Threat tracker

University of Manchester – 15 March

21

http://nlp.shef.ac.uk

What is Named Entity
Recognition?
• Identification of proper names in texts,
and their classification into a set of
predefined categories of interest
• Persons
• Organisations (companies, government
organisations, committees, etc)
• Locations (cities, countries, rivers, etc)
• Date and time expressions
• Various other types as appropriate
University of Manchester – 15 March

22

http://nlp.shef.ac.uk

Why is NE important?
• NE provides a foundation from which to build
more complex IE systems
• Relations between NEs can provide tracking,
ontological information and scenario building
• Tracking (co-reference) “Dr Head, John, he”
• Ontologies “Manchester, CT”
• Scenario “Dr Head became the new director
of Shiny Rockets Corp”
University of Manchester – 15 March

23

http://nlp.shef.ac.uk

Two kinds of approaches
Knowledge Engineering Learning Systems
• rule based
• developed by
experienced language
engineers
• make use of human
intuition
• require only small
amount of training data
• development can be
very time consuming
• some changes may be
hard to accommodate

• use statistics or other
machine learning
• developers do not
need LE expertise
• require large amounts
of annotated training
data
• some changes may
require re-annotation
of the entire training
corpus

University of Manchester – 15 March

24

http://nlp.shef.ac.uk

Typical NE pipeline
• Pre-processing (tokenisation, sentence
splitting, morphological analysis, POS
tagging)
• Entity finding (gazeteer lookup, NE
grammars)
• Coreference (alias finding, orthographic
coreference etc.)
• Export to database / XML
University of Manchester – 15 March

25

http://nlp.shef.ac.uk

GATE and ANNIE
• GATE (Generalised Architecture for Text Engineering)
is a framework for language processing
• ANNIE (A Nearly New Information Extraction system)
is a suite of language processing tools, which
provides NE recognition
GATE also includes:
• plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc.
• tools for visualising and manipulating ontologies
• ontology-based information extraction tools
• evaluation and benchmarking tools
University of Manchester – 15 March

26

http://nlp.shef.ac.uk

GATE

University of Manchester – 15 March

27

http://nlp.shef.ac.uk

Information Extraction for the Semantic Web
• Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation,
Date, Time etc.
• For the Semantic Web, we need information in a
hierarchical structure
• Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology
• Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology
University of Manchester – 15 March

28

http://nlp.shef.ac.uk

Richer NE Tagging
• Attachment of
instances in the text to
concepts in the
domain ontology
• Disambiguation of
instances, e.g.
Cambridge, MA vs
Cambridge, UK

University of Manchester – 15 March

29

http://nlp.shef.ac.uk

Magpie
• Developed by the Open University
• Plugin for standard web browser
• Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked
• Provides means for a structured and informed
exploration of the web resources
• e.g. looking at a list of publications, we can find
information about an author such as projects
they work on, other people they work with, etc.
University of Manchester – 15 March

30

http://nlp.shef.ac.uk

MAGPIE in action

University of Manchester – 15 March

31

http://nlp.shef.ac.uk

MAGPIE in action

University of Manchester – 15 March

32

Evaluation

http://nlp.shef.ac.uk

Evaluation metrics and tools
• Evaluation metrics mathematically define how to
measure the system’s performance against humanannotated gold standard
• Scoring program implements the metric and
provides performance measures
– for each document and over the entire corpus
– for each type of NE
– may also evaluate changes over time

• A gold standard reference set also needs to be
provided – this may be time-consuming to produce
• Visualisation tools show the results graphically and
enable easy comparison
University of Manchester – 15 March

34

http://nlp.shef.ac.uk

Methods of evaluation
• Traditional IE is evaluated in terms of Precision
and Recall
• Precision - how accurate were the answers the
system produced?
correct answers/answers produced
• Recall - how good was the system at finding
everything it should have found?
correct answers/total possible correct answers
• There is usually a tradeoff between precision
and recall, so a weighted average of the two (Fmeasure) is generally also used.
University of Manchester – 15 March

35

http://nlp.shef.ac.uk

GATE AnnotationDiff Tool

University of Manchester – 15 March

36

http://nlp.shef.ac.uk

Metrics for Richer IE
• Precision and Recall are not sufficient for
ontology-based IE, because the distinction
between right and wrong is less obvious
• Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as a
Lecturer is not so wrong
• Similarity metrics need to be integrated
additionally, such that items closer together in the
hierarchy are given a higher score, if wrong
• Also possible is a cost-based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error, and
combined to form a single score
University of Manchester – 15 March

37

Visualisation of Results

http://nlp.shef.ac.uk

Visualisation of Results
• Cluster Map example
• Traditionally used to show documents classified
according to topic
• Here shows instances classified according to
concept
• Enables analysis, comparison and querying of
results
• Examples here created by Marta Sabou (Free
University of Amsterdam) using Aduna software
University of Manchester – 15 March

39

http://nlp.shef.ac.uk

The principle – Venn Diagrams
Documents
classified
according to topic

University of Manchester – 15 March

40

http://nlp.shef.ac.uk

Jobs by region

Instances
classified by
concept

University of Manchester – 15 March

41

http://nlp.shef.ac.uk

Concept distribution

Shows the
relative
importance of
different concepts
University of Manchester – 15 March

42

http://nlp.shef.ac.uk

Correct and
incorrect
instances
attached to
concepts

University of Manchester – 15 March

43

http://nlp.shef.ac.uk

Summary
• Introduction to text mining and the
semantic web
• How traditional information extraction
techniques, including visualisation and
evaluation, can be extended to deal with
complexity of the Semantic Web
• How text mining can help the progression
of the Semantic Web
University of Manchester – 15 March

44

http://nlp.shef.ac.uk

Research questions
• Automatic annotation tools are currently
mainly domain and ontology-dependent,
and work best on a small scale
• Tools designed for large scale applications
lose out on accuracy
• Ontology population works best when the
ontology already exists, but how do we
ensure accurate ontology generation?
• Need large scale evaluation programs
University of Manchester – 15 March

45

http://nlp.shef.ac.uk

Some useful links
• NaCTem (National centre for text mining)
http://www.nactem.ac.uk
• GATE
http://gate.ac.uk
• KIM
http://www.ontotext.com/kim/
• h-TechSight
http://www.h-techsight.org
• Magpie
http://www.kmi.open.ac.uk/projects/magpie
University of Manchester – 15 March

46

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close