bioinformatics

Published on May 2017 | Categories: Documents | Downloads: 37 | Comments: 0 | Views: 429

of 27

Content

BIOL3004 Genomics & Bioinformatics

Lecture 8
Finding genes and other features in genomes II:
Ab initio methods
Mark Ragan
Institute for Molecular Bioscience University of Queensland

27 March 2014

© Mark Ragan 2005-14
except as indicated

Key concepts
Gene finding: beyond simple sequence matching

Genomic features other than genes
The challenge of finding other features in genomes Machine learning approaches

Objective functions
Maximum-likelihood estimation

Finding genes ab initio: ORF scanning
Genes that code for proteins contain one or more ORFs
The ORF starts with an initiation codon (usually but not always ATG) and ends with a termination codon (TAA, TAG or TGA)

Searching for stretches of DNA that begin with ATG and end with a termination codon is therefore one way of looking for genes
Complication: six (potential) reading frames

Finding genes ab initio: ORF scanning (2)
Usually productive in bacterial genomes
Problematic in eukaryotic nuclear genomes  Long intergenic regions  Exons can be short & separated by long introns  Typical threshold for ORFs > 300 bps

Codon bias
 Codons occur with different frequencies in genes of a particular organism
(“codon bias”)  Example: leucine is specified by six codons (TTA, TTG, CTT, CTC, CTA, CTG) but in human nuclear genes it’s most frequently coded by CTG, and only rarely by TTA or CTA

 Most genomes exhibit codon bias, which often differs from species to species
 Real exons are expected to display this bias; chance series of triplets are not  ORF-scanning programs are either told (by the operator) what bias to look for, or can learn this bias from a training set (known exons in a particular organism)  Codon bias is a special case (N=3 in-frame) of sequence bias (N=1 to N=large)

CpG islands, (G+C)-rich regions and genes
For slightly more than half of human genes, the transcriptional start site occurs in a CpG-rich region
This is particularly true for “housekeeping” genes:
the first coding exon usually occurs in a CpG island

Mammalian genomes show elevated gene density in (G+C)-rich regions
Brown, Genomes 2

Gene features again (in a little more detail)
TATA box Initiation codon Donor/acceptor signals polyA signal(s)

5

Exon 1

Exon 2

Exon 3

3

Promoter signals

mRNA
Additional features typically analysed
G+C% composition N-mer or codon composition (coding versus non-coding) Repetitive elements (masked)

In genomes of complex eukaryotes, regulatory promoter elements occur in conserved locations

 Regulatory sequences have distinctive sequence features related to their role as recognition signals for the DNA-binding proteins involved in gene expression  Only a pattern (motif), not a precise sequence of nucleotides, may be conserved  In each species, different sets of genes have different motifs  Motifs may be only weakly conserved between species (e.g. human - mouse)
Brown, Genomes 2

Features of exon-intron splice sites
The GT-AG rule

5

Exon ...A56 G73 G100 T100 A62 A68 G84T63 … Intron

Subscripts refer to the frequency (percent) within a given sample of eukaryotic genes, that the indicated nucleotide appears in that position relative to the nearby intron-exon junction.

…[12xC/T] N C65 A100 G100 N… Exon

3

Brown, Genomes 2

In eukaryotes, the 3′ end of most mRNAs is polyadenylated
Polyadenylation sites are specified by a motif similar to AAUAAA

Brown, Genomes 2

Machine learning
Many important / interesting gene features

 show regularities, but don’t have identical sequences  are embedded in a complex background  locations of true instances (positives) are known
Our goal

 extract and generalise information from known true instances, then  apply it to find (all) true unknown instances in a different dataset
Data mining: discovery of unrecognised properties of the data Knowledge discovery (KDD): discovery of new knowledge

Machine Learning: an area of computer science that focuses on prediction of new knowledge, based on properties of known data. A learning algorithm generalises from experience (i.e. from the probability distribution of instances in the training set). Its performance is evaluated by success in recovering known knowledge from the test set. The machine-learning algorithm is typically a mathematical or statistical model such that if its parameters are optimally estimated, the model will perform KDD with the maximum possible success.

Machine learning approach to ab initio gene finding (1)

Features of known genes
Promoter-element motif scores & positions Transcriptional start site in CpG island Codon bias & correspondence with ORF Splice-site motif scores & positions Poly(A) signal motif scores & positions Presence of ORF ORFs in G+C islands (…) (…)

Support vector machine Artificial neural network Genetic algorithm Bayesian network Machine-learning model
Training or conditioning

Machine learning approach to ab initio gene finding (2)
Machine-learning model
Support vector machine Artificial neural network Genetic algorithm
Trained or conditioned

Bayesian network
Apply to

Unexplored genomic sequence
With the result that...

Genes are found !

Machine learning approach to ab initio gene finding (3)
Features of known genes
Promoter-element motif scores & positions Transcriptional start site in CpG island Codon bias & correspondence with ORF Splice-site motif scores & positions Poly(A) signal motif scores & positions Presence of ORF ORFs in G+C islands (…) (…)

Training or conditioning

Machine-learning
algorithm

Background sequence
(ideally with no genes)

One class of machine-learning algorithm:

Hidden Markov Model (HMM)
Consider a random variable x existing at a particular time t

x(t)
x(t) can adopt any of a number of (usually discrete) values that depend only on the value of the immediately pre-existing state x(t-1).

More precisely, given the values of x at all times t, the conditional probability distribution of x(t) depends only on the value of state x(t-1). Values at other times have no effect on x(t). This is the Markov property of an HMM.

x(t-1)

x(t)

The arrow represents the transition probability from x(t-1) to x(t).

HMMs (continued)
We can interpret {... t-1, t, t+1...} as a succession of adjacent nucleotide positions in a genome sequence, or of adjacent amino acid positions in a protein.

x(t-1)

x(t)

x(t+1)

We can’t directly observe these conditional probability distributions. However, we can consider that each gives rise to (emits), with a defined probability, an observable state y.

x(t-1)

x(t)

x(t+1)

Hidden

y(t-1)

y(t)

y(t+1)

Observed

Each observable state {... y-1, y, y+1...} corresponds to a nucleotide or amino acid.

HMMs (continued)
The succession of values described by the model

x(t-1)

x(t)

x(t+1)

can describe an actual sequence region if we allow deletions (D) and insertions (I). By convention, insertion (but not deletion) states are allowed to extend (loops, below).
The blue arrows represent the path that best captures relationships within this feature in a particular dataset. All this is hidden to the user
Zvelebil & Baum, Figure 6.7(A), page 183

Matches (rectangles) and insertions (diamonds), but not deletions (circles), are considered to emit an observable state (in this HMM = a residue).

HMMs (continued)
How is the best path determined? By optimising a performance measure (quality score). The components of this score, in turn, are based on an objective function.
The general problem: we want to estimate the value of parameter . We have a set of independent observations (x1...xi) that we can think of as having been drawn from a distribution whose probability density function f is unknown but has certain wellbehaved properties. Maximum likelihood estimates the true value of  as L ( | x1...xi) = f (x1...xi | ) =  f (xi | )
all i

Likelihood of a parameter value, given the observed data

=

Probability density function for those data, given that parameter value

We often define an estimator of average likelihood as (1/N) L Because the numerical value of L is often tiny, we often work with ln L

HMMs (continued)

Internally, the HMM finds the set of transition probabilities that maximises the likelihood of the observed data, given the model and the probability density function.
Insertion probabilities are usually drawn from a background expectation (nucleotide or amino acid frequencies) Transition probabilities into & out from each state must sum to 1 Overall the problem cannot be solved exactly, but there are good (numerical) estimation procedures

HMMs (continued)

Zvelebil & Baum, Figure 6.7, page 183

The HMM thus captures probabilistic relationships associated with the feature (here, four successive amino acids) and outputs them simply (as above) or in a more-complex way. For more details on HMMs and other machine-learning approaches in bioinformatics, see Zvelebil & Baum chapters 6 & 10.

HMMs (continued)

By training or conditioning the HMM on data known to represent the class of feature you want to find (e.g. known bacterial ORFs), it will become able to find instances of this feature class elsewhere (e.g. ORFs in newly sequenced bacterial genomes). Some classes of machine-learning algorithms can benefit by training not only on positive cases, but also on negatives (i.e. known to lack instances of the feature).

Other popular machine-learning approaches include Neural networks (NNs) Other NN-based algorithms (e.g. profile HMMs, self-organising maps) Inductive logic programming Genetic algorithms (various types) Support vector machines Random ForestsTM Bayesian networks (various types)

“Grail II” neural network gene-finder

Note that this NN has two hidden layers.
Uberbacher, Xu & Mural, Methods in Enzymology 266:259-281(1996) from Zvelebil & Baum, Figure 10.15, page 387

Let’s revisit a question from Lecture 6:

Why is it sometimes difficult to locate genes?
Is it a problem for... POSSIBLE REASON Alternative splicing (eukaryotes) SEQUENCE COMPARISON YES MACHINE LEARNING NO

Low-abundance or no cDNA / EST
Exons can be short & introns very long Incomplete conservation of exon-intron junctions Poorly conserved or no ortholog in other species Weak or multiple types of motifs

YES
YES YES YES USUALLY

NO
MOSTLY NO MOSTLY NO NO SOMETIMES

Figure  Garland Publishing (1998)

Classes of features often discovered via machine learning

Features of genes in prokaryotes, eukaryotes & viruses (Lecture 6)

Sequence-based signals of all types
Inter- and intra-molecular interaction, epigenetics, subcellular localisation...

Small RNAs (tRNAs, miRNAs...) Features of protein structure e.g. folds, binding / interaction sites

Tissue-specific expression patterns
Disease-specific expression patterns Pathways and networks

(Some of the) perils of using machine learning to find features in genomes
Machine learning doesn’t transport us to a parallel, statistics-free universe

Is a positive training set available? Is it large enough? Is a negative training set available? Is it large enough?

How can we validate our model?
What does it mean to validate a computational model?

What performance measure should we use? How can we avoid overfitting?

Does our model have enough power for the application?
We can’t deal with 10 million potential matches

Computational complexity & heuristics

Recommended reading
Finding genes & other features in genomes
Zvelebil & Baum, Understanding Bioinformatics – Chapters 6, 9 & 10

General background
Primrose & Twyman, Principles of Genome Analysis and Genomics Chapters 9, 16 & 18 Brown, Genomes 3 – Chapters 1 & 5 (plus bits of Chapters 7 & 8)

[email protected]

bioinformatics

Comments

Content

Sponsor Documents

Recommended