Bioinformatics

Published on May 2017 | Categories: Documents | Downloads: 40 | Comments: 0 | Views: 555
of 10
Download PDF   Embed   Report

Comments

Content

Bioinformatics – Supervision 2
General questions
1. What problem does CLUSTALW address, and how does it work?
Extending dynamic programming alignment algorithms to solve the
multiple alignment problem (i.e. finding the best alignment between
more than two sequences) can be very expensive - for two sequences we
compute a 2D matrix, while for n, we must construct a n-dimensional
hypercube and evaluate 2n − 1 predecessors.
CLUSTALW is perhaps the most widely used progressive alignment
algorithm. That is a family of algorithms for solving the multiple
alignment problem by producing multiple alignments from a number of
pairwise alignments. As this approach doesn’t necessarily give the best
alignment of the n sequences (as the dynamic programming approach
would), it is heuristic in nature, but much more efficient.
Outline of CLUSTAW algorithm:
• pairwise alignment: align all possible pairs of sequences against
each other, obtaining a similarity matrix, such that
similarityi,j =

exactM atches
alignmentLength

• guide tree creation: use the similarity matrix and a clustering
method such as UPGMA or neighbour-joining to create a guide
tree.
• carry out a multiple alignment: merge sub-alignments, guided
by the guide-tree until a final alignment of all initial sequences is
achieved. However not all pairwise alignments can fit well into
the final alignment, so we need to discard those. The following
demonstrates one such situation:

1

2. How can we measure the quality of a multiple sequence alignment?
We can calculate the entropy by summing over the entropies of each
column. The entropy of a column is the negated sum over all possible
bases of the percentage of times that base is met at that position in the
aligned sequences, multiplied by logarithm of that percentage. That is
the entropy of column Ei is
X
Ei = −
pi,x log(pi,x )
x∈{A,C,G,T }

3. Discuss three ways to reconstruct DNA sequences if we only get to see
(partly overlapping) fragments of DNA.
2

To reconstruct a DNA sequence when we only see partly overlapping
fragments of it, we can use:
• Hamiltonian graphs using “reads”:
– nodes: reads.
– edges: if two reads overlap by more than a certain threshold
we put an edge between them.
– algorithm: find a Hamilton cycle (visit each node exactly
once) to reconstruct the sequence.
– complexity: NP-complete.
• Hamiltonian graphs using “k-mers”:
– nodes: k-mers. Construct a node for every k-mer appearing
as a substring in some set of reads. Define a suffix of a k-mer
to be the string formed by all its nucleotides except the first
one and prefix - all the nucleotides except the last one.
– edges: directed. There is an edge between k-mer A and k-mer
B if the suffix of A = prefix of B.
– algorithm: find a Hamilton cycle.
– complexity: NP-complete.
• Eulerian graphs using “k-mers”:
– nodes: affixes.
– edges: directed. Instead of having the k-mers be nodes, we
have the k-mers as edges here. There is an edge K(some kmer generated from the input reads) between node A and B
if the affix A is prefix of K and the affix B is a suffix of K.
This construction is called de Bruijn graph.
– algorithm: we want to find a Eulerian cycle (visit each edge
exactly once). Such cycle exists if on only if the de Bruijn
graph is balanced. That is for each node its indegree (number
of edges that go into that node) equals its outdegree (number
of edges that leave the node).
– complexity: determinating if the graph is balanced or not
is O(n), where n is the number of edges. However I can’t
understand how we find the actual cycle (if say the genome
is not cyclic, so it does matter which node we start the cycle
from)...
3

4. What is the ”additivity” property for a matrix, and why is it useful to
have?
A matrix D is additive if for any for indices i,j,k,l it is true that:
Dij + Dkl ≤ Dik + Djl = Dil + Djk

5. Explain what microarray data is.
Microarray data is data describing some activity of genes (also called
expression level – the amount of mRNA for that particular gene) in
different cells, that might be under varying conditions and is collected
in different time points. These data sets are very large and needs to be
carefully processed.
There are two typical experiments that produce microarray data:
• differentiation – compare activity under different conditions.
• temporal expression – explore how activity changes with time.

4

Algorithms
We would like to compare DNA sequences from different species, so we can
understand evolution. We do that by building trees that represent the hierarchy of organisms. These are called phylogeny trees, they are labelled
(hypothetical ancestors are shown) and consist of nodes (called taxonomic
units) that split into:
• leaves - each leaf is a different existing species.
• internal nodes - called hypothetical taxonomic units; these are species
that are not existent now, but we think existed before and evolved in
other species.
Building phylogeny trees can be split into two phases - building an unlabelled tree and labelling it:
• parsimony based methods– we choose the simplest scientific explanation that fits the evidence – we allow as fewest mutations as possible
(as we assume minimal mutations is most likely). Parsimony based
method work on already construed, unlabelled trees to produce a final,
labelled phylogeny tree.
• distance based methods– cluster nodes in such a way, that there is minimal distance between nodes within a cluster and maximum between
clusters. These methods produce an unlabelled tree.
*Comment:* I wrote all of these, because I feel slightly confused about
all these terms...
1. Fitch parsimony
• abstract problem – this is a parsimony based method for labelling
an already constructed tree to produce a phylogeny tree.
• practical use – understanding evolution.
• outline of the algorithm
• space and time complexity

5

2. Sankoff parsimony
• abstract problem – this is a parsimony based method for labelling
an already constructed tree to produce a phylogeny tree.
• practical use – understanding evolution.
• outline of the algorithm

• space and time complexity

6

3. UPGMA
• abstract problem – this is a distance based method for constructing an unlabeled tree. This tree is rooted and it’s ultrametric
– the distance from the root to any leaf is the same. It is also a
hierarchical clustering algorithm.
• practical use – understanding evolution.
• outline of the algorithm –
(a) initialisation:
– for each species i we create a new cluster Ci = i.
– construct a new tree, where each cluster is a leaf.
(b) iteration:
– find i and j, such that the distance d(Ci , Cj ) between
cluster Ci and Cj is minimal.
T
– create a new cluster Ck = Ci Cj
– create a corresponding node in the tree, a hight
– remove Ci and Cj .
– if there is only one cluster left - terminate.

d(Ci ,Cj )
.
2

• space and time complexity – time complexity is O(n) (n is the
number of species) - on each iteration we create a new cluster and
remove two clusters, so we decrease the number of clusters by one.
We build a tree, which in the worst case is a full binary tree with
n leaves, which gives O(nlogn) space complexity.
4. Neighbor Joining
• abstract problem – this is a distance based method for constructing an unlabeled tree. This tree is unrooted, binary and it’s
additive. It is also a hierarchical clustering algorithm.
• practical use – understanding evolution.
• outline of the algorithm –
(a) initialisation:
– start from completely unresolved ”star” tree.
– calculate all pairwise distances, filling in the matrix Dij =
distance between sequence i and sequence j.
7

(b) iteration:
– for the current number of taxa r, we compute a new matrix
Q:
r
r
X
X
Qij = (r − 2)Dij −
Dik −
Djk
k=1







k=1

*Comment:* I don’t understand why we compute that
matrix exactly.
find i and j, such that Qij is minimal.
create a new node u, pairing i and j.
calculate the distance from i and j to the new node u.
update the distance matrix.
if the number of nodes is 3, terminate.

• space and time complexity – I really don’t think I understand this
algorithm, but given that on each iteration we calculate a matrix
Q, where for each element we need to sum over O(n) elements, I
suppose the time complexity is O(n3 ) and the space complexity –
O(n2 ).
5. K-means clustering (Lloyd Algorithm)
• abstract problem – a partitioning clustering algorithm.
• practical use – analyse changes of activity in genes and functional
similarity among genes.
• outline of the algorithm –
(a) initialisation:
– choose k, the number of clusters to partition the data into.
– arbitrarily assign k centres - these will be updated to the
mean values of the clusters - µ1 , µ2 , ..., µk
(b) iteration (while cluster centres are changing):
– assign each data point to the cluster with the closest centre to that data point. In other words, assign a data point
xi to cluster Cj , s.t. j = argmin1≤i≤k |d(xi , µi )|

8

– after all data points are assigned, calculate the new mean
value (centre) of each cluster. That is
µi =

X d
|Ci |
d∈C
i

• space and time complexity – each iteration is O(nlog(k)), where
n is the number of data points, as we need to assign each data
point to the best cluster possible (so need to search through all
clusters – I assumed we can optimise the assigning to a cluster
procedure with a binary search of some sort). Thus the overall
time complexity is O(Inlog(k)), where I is the number of iterations. Which rises an interesting question - is the assignment
guaranteed to converge (thus the algorithm to terminate)?
Space complexity is O(n).
6. Progressive greedy K-means
• abstract problem – a partitioning clustering algorithm.
• practical use – analyse changes of activity in genes and functional
similarity among genes.
• outline of the algorithm – works in a similar way to Lloyd’s algorithm, but assigns points to clusters in a different way:
(a) initialisation:
– choose k, the number of clusters to partition the data into.
(b) iteration:
– for each point-cluster pair x-C we calculate “moving gain”.
That is cost(P ) − cost(Px→C ), where P is the current partitioning and Px→C is the partitioning after moving x to
cluster C.
– if for all point-cluster pairs this cost is ≤ 0 then terminate.
– pick the point-cluster pair x-C for each this “moving gain”
is the largest and move x to C.
• space and time complexity – time complexity is O(Ink), where I
is the number of iteration, n number of data points and k number
of clusters. Space complexity is O(n).
9

7. Markov Clustering
• abstract problem – a partitioning clustering algorithm. Unlike
the two k-means clustering algorithms above, Markov clustering
does not require specifying the number of clusters in advance.
• practical use – analyse changes of activity in genes and functional
similarity among genes.
• outline of the algorithm –
(a) initialisation:
– create an associated with the input graph adjacency matrix M .
– create a distance matrix M 0 out of the adjacency matrix
M - i.e. we want matrix entry Mij0 to show the probability
that j will be reached from i. Thus M 0 is just a normalised
Mij
version of M : Mij0 = P M
kj
k

(b) iteration (until convergence):
– expand by taking the eth power of the matrix, M 0e , where
e is the expansion parameter.
(Mij )r
– inflate by computing Mij0 = P (M
0 r , where r is the ink
kj )
flation parameter.
• space and time complexity – space complexity is O(n2 ) - the space
that we need for the distance matrix M 0 . The inflation step of the
iteration takes O(n2 ) time, as we have to fill in each element of
the matrix. However, the expansion step takes O(log(e)n3 ), as it
involves O(log(e)) matrix multiplications. This gives O(Ilog(e)n3 )
overall complexity, where I is the number of iterations.

10

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close