Working

Published on March 2017 | Categories: Documents | Downloads: 44 | Comments: 0 | Views: 197

of 9

Content

Fast approximation algorithms for ﬁnding node-independent paths in networks

Dougla Dou glass R. Whi White te Depart Dep artment ment of Ant Anthr hrop opolo ology, gy, Unive Universit rsity y of Califo California rnia,, Irvin Irvine, e, CA 9269 92697 7

M. E. J. Newm Newman an Santa Fe Institute, Institute, 1399 Hyde Park Roa Road, d, Santa Fe Fe,, NM 87501

(Dated: June 30, 2001) A network is robust to the extent that it is not vulnerable to disconnection by removal of nodes. The minimum number of nodes that need be removed to disconnect a pair of other nodes is called the connectivity of the pair. It can be proved that the connectivity is also equal to the number of node-independent paths between nodes, and hence we can quantify network robustness by calculati calcu lating ng numbers numbers of node-in node-independen dependentt paths paths.. Unfor Unfortunat tunately ely,, compu computing ting such numbers is known kno wn to be an NP-har NP-hard d pro proble blem, m, tak taking ing exponent exponential ially ly lon long g to run to com comple pletio tion. n. In this this paper, we present an approximation algorithm which gives good lower bounds on numbers of node-independent paths between any pair of nodes on a directed or undirected graph in worstcase time which is linear in the grap graph h size. A varian variantt of the same algori algorithm thm can also calcu calculate late all the k -components of a graph in the same approximation. Our algorithm is found empirically to work with better than 99% accuracy on random graphs and for several real-world networks is 100% accurate. accurate. As a demon demonstra stration tion of the algo algorithm rithm,, we apply it to tw two o large graphs for whic which h the traditional NP-hard algorithm is entirely intractable—a network of collaborations between scientists and a network of business ties between biotechnology ﬁrms.

I.

INTR INTRODU ODUCTI CTION ON

The logical connections between the structural properties of a graph, such as its robustness to the removal of vertices, and properties of graph traversal, such as numbers of paths between vertices, have been widely studied and many rigorous results are known from graph theory (Harary (Harary 1969, 1969, Chartr Chartrand and and Lesniak Lesniak 1996). 1996). An example of particular interest in the study of social networks is that of node-independent paths on graphs—sets of paths between speciﬁed pairs of vertices that share no vertices in common other than their starting and ending points. Greater points. Greater numbers numbers of such paths paths between nodes provid pro videe graphs graphs with greate greaterr cohesi cohesion. on. The number number of node-independent paths between an initial vertex i and a ﬁnal vertex f  is also the minimum number of vertices which must be removed from the graph in order to disconnect i and f   from one anothe another, r, and is theref therefore ore a direct measure of the resilience of the graph to vertex deletion. dele tion. Unfortunat Unfortunately ely,, the calculation calculation of num numbers bers of node-independent paths between pairs of vertices is computationally diﬃcult, taking time exponential in the size of the graph in the most most general general case. case. For all but the smallest smal lest or sparse sparsest st graphs, graphs, this makes exact coun counting ting of node-independent paths intractable. In this paper, we present some fast and accurate approximation algorithms which give good lower bounds on the numbers of nodeindependen inde pendent t paths pathfrom s in large graphs reason reasonable able time. Some deﬁnitions the theory of in graphs will help to

deﬁne the concepts required for this this approach (White and Harary 2001). Consider a graph G   that has n vertices and size m edges. Node-indepen Node-independen dentt paths paths from an initial initial vertex i to a ﬁnal vertex f   in G   are paths from i to f   with no vertices in common other than i and f . For each each pair pair of distinct vertices (i, f ) in G , let K (i, f ) be the maximum number of such node-independent paths. The local connectivity κ(i, f ) of two distinct and nonadjacent vertices i and f   in G   is the minimum number of vertices that must be removed (minimum separating cutset) to disconnect disco nnect them. them. In particular, particular, if   κ(i, f ) = 0 then i and f  are disconnected in G . For any two distinct and nonad jacent vertices (i, f ) it is intuitively reasonable, but nontrivial to prove, that K (i, f ) = κ(i, f ) (Menger 1927). Technically, the local connectivity of   i and f  is not deﬁned if they are adjacent, but for simplicity we will deﬁne local connectivity in this case to be equal to the maximum number K (i, f ) of node-independent paths between i and f . This allows us to state that for any distinct ( i, f ) in G : K (i, f ) = κ (i, f ), and if G  is is a complete graph, then K (i, f ) = κ (i, f ) = n − 1. If the conne connectiv ctivity ity κ(i, f ) is gr grea eate terr th than an or equa equall to som somee nu numbe mberr k , th then en i and f are said said to be k connected conne cted.. A maximal subset subset S   of   G  such that all vertex pairs (i, f ) are k -conn -connecte ected d via paths within S   is called a k -compo -componen nent. t. White White and Harary (2001) call a k -com -componen ponentt a cohesive cohesive block. block. A maximal subset S   of vertices each pair of which is k -connected via paths which

2 are allowed to include vertices not in S  we will call an extracohesive block. Globally, a graph G  is said to have connectivity κ(G ) if no removal of fewer than κ vertices will increase the number of components of   G . Again it is reaso reasonable nable b but ut nontrivial that G   has connectivity connectivity κ(G ) if and only if between any two distinct vertices (i, f ) of   G G , there are at least κ node-independent paths (Menger 1927). Thus, if K (G ) is the pairwise minimum of   K K (i, f ) over all distinct (i, f ) in G , then for any graph G , κ(G ) = K (G ). ). In Section II we present an exact algorithm for calculating K (i, f ), the number of node-independent paths between a speciﬁed pair of vertices, that requires exhaustive back-tracking back-tracki ng and computation tim timee that scales exponentially tial ly with graph size. In Section Section III we give a family of approximation algorithms for computing a lower bound on K (i, f ) that can be used either for undirected or directed graphs. The computation time taken by the latter algorithms scales linearly with graph size and hence the algorithms are not limited to use with small graphs. For small graphs, the exact algorithm serves as a baseline for evaluating evalu ating the accuracy of the approximation algorithm algorithms, s, and we perform such an evaluation in Section IV, using both artiﬁcial (computer generated) graphs, and graphs from real-world applications. In Section V we then apply to ionally largely graphs, for which the approximation exact algorithmalgorithms is computational computat intract intractable. able. In Section VI we discuss brieﬂy the application of our algorithm to the calculation of the k -components of a graph, and in Section VIII we give our conclusions. II. AN E EXA XACT CT ALGO ALGORIT RITHM HM F FOR OR NODE-INDEPENDENT NODE-INDEPENDE NT PA PATHS THS

The st The stan anda dard rd al algo gori ritthm for ﬁndi ﬁnding ng all all nod nodeindependent paths from an initial vertex i to a ﬁnal vertex f  in a graph is the exhaustive back-tracki back-tracking ng algorithm as follows. 1. With each vertex in the graph we associate a label which may take the value “available,” meaning it is available to be incorporated into a path, “unavailable,” meaning it is never to be used as part of a path, or an integer numerical value k , indicating that it is a part of the k th node-independent path found between i and f . Initially all vertices are labeled “available” except the initial vertex i, which is labeled “unavailab “unavailable.” le.” We also keep a record of the current number n of node-independent paths on the graph, whose value is initially zero. 2. W Wit ithi hin n th thee se sett of vert vertic ices es whic which h are are curr curren entl tly y marked “available,” we search for a new path from i to f  (i.e., one which has not previously been considered as the nth path). If one exists, we increase the value of   n n by one and then mark all the vertices along alo ng the chose chosen n pat path h with with the nu numer merica icall valu aluee

k = n, to indicate that they have been used as part of the nth path. Then we repeat from step 2.

3. If no path of available vertices exists, we take all the vertices which belong to the n th path and mark them “avail “available” able” again. Then we decrease n by one, and repeat from step 2. 4. Wh When en there there are no more more possibl possiblee candid candidate atess for the ﬁrst path from i to f , the algorithm ends. The number of node-independent paths on the graph is the highest value obtained by n during the run. This algorithm can be used for either direct or undirected graphs. It will always correctly give the number of nodeindependent paths between i and f , but it is in general very slow. slow. In the typ typical ical case the total number number of paths between two vertices scales exponentially with graph size, and runnin runningg time time is th thus us als alsoo at lea least st expo exponen nentia tiall in graph size. Improve Improvemen ments ts on the algorithm algorithm are possible. For example, example, the algorithm algorithm as we have described described it separately considers conﬁgurations in which the same set of paths between i and f   are present on the graph, but are just labeled in a diﬀerent diﬀerent order. order. Removing Removing this redundancy can speed the calculation by a substantial factor. facto r. Howeve However, r, ov overall erall scaling scaling is still still at least exponentialIninpractice graph size. this limitation means that the algorithm can be used for graphs up to a few tens of vertices—more in the case of graphs with low average degree than ones with high average average degree. For large graphs of hundreds hundreds or thousands of nodes, which have become increasingly common in recent years, the exhaustive algorithm is entirely impractic impra ctical. al. Instead, Instead, therefore, therefore, we turn to approxima approxima-tion methods to estimate numbers of node-independent paths. III.

APP APPRO ROXIMA XIMATION TION ALGO ALGORITH RITHMS MS

For many problems whose exact numerical solution is slow, faster algorithms exist which will give approximate answers. answe rs. In the best case, these algorithm algorithmss give an absolute bound on the true answer to the problem, and we here present a one-parameter family of approximation algorithms for counting node-independent paths which all do exactly exactly this. this. The one paramete parameterr in this family family of algorithms allows us to tune the time taken by the algorithm, with the payoﬀ being that quicker members of the family provide poorer bounds on the number of nodeindependen indepe ndentt paths. paths. Our algorithm algorithm runs in time linear linear in the number of vertices in the graph, with only the leading constant varying with the value of the parameter. The fundamental idea behind our method is as follows. As before we ﬁnd node-independent paths by selecting one path from the initial vertex of interest i to the ﬁnal vertex f , marking the vertices along that path as taken, so that they cannot be used by any other path, and then looking for another path among the remaining vertices.

3 This process is repeated until no more paths from i to f   exist. exist. Instea Instead d of running running through through all poss possibl iblee pat paths hs from i to f  however, as before, we note that longer paths are less likely to belong to large sets of node-independent paths pat hs than shorter shorter ones. ones. If we choose choose a very very long and circuitous path from i to f  as our ﬁrst path, for example, then that choice removes from the graph a large number of vertices, which makes it hard to ﬁnd many other nodeindependen inde pendentt paths paths among the remaining remaining vertices. vertices. Thus our chances of ﬁnding many node-independent paths will on average be improved if we favor short paths from i to f . Our algorithm algorithm takes takes this idea to its most extrem extremee conclusion, and considers only shortest  paths paths from i to f . In its simplest form, our algorithm is as follows. 1. Initially label all vertices “available.” Set n = 0. 2. Search for the shortest path of available vertices from i to f . If such such a path path exists, exists, incr increas easee n by one, label all vertices along this path as belonging to the nth path and hence no longer available, and repeat from step 2. 3. If no path exists from i to j , the algorithm ends. The ﬁnal value of   n n is a strict lower bound on the number of node-independent paths from i to f . If the ﬁnal value of   n is equal to either di or df , the degrees of the initial and ﬁnal vertices, then this bound is in fact the exact number of node-independent paths, since the smaller of the two degrees provides an upper bound on the number of paths. Note that, even though we consider only shortest paths from i to f , this does not mean that all paths found are the same same length length,, sin since ce the shortest shortest path path com compose posed d of “available” vertices is not necessarily the same length as the shortest path on the complete graph. The shortest path between two vertices on any graph can can be foun found d in ti time me O(m) by bre breadt adth-ﬁ h-ﬁrst rst sea searc rch, h, where m is the number of edges in the graph, and since min(di , df ) is an upper bound on the number of times we have to search for the shortest path, worst-case total running time is O(m min(di , df )). In Fig. 1 we illustrate the application of this algorithm to pairs of vertices on three diﬀerent undirected graphs. In the ﬁrst graph, Fig. 1a, the algorithm works perfectly: it not only gives the correct result, it also tells us that the result is correct, since the number of paths found is equal to the degree of, in this case, both i and f . In the second graph, Fig. 1b, the algorithm again gives the correct result, but cannot tell that the result is correct, since the number of paths found is less than the degree of either i or f . In the third graph, Fig. Fig. 1c, the algorithm algorithm does not give the correct result, but still gives a correct lower bound on the number of node-independent paths. The algorithm as we have described it is not yet complete ple te.. Some Some graphs graphs have have more more tha than n one short shortest est path of the same length between a speciﬁed pair of points. What Wh at dowould we do thi ca case se?? The Th e most moshortest st tho thorou rough gh approach beintothis gos through each path in

i (a)

i (b)

f

i (c)

f

f

FIG. 1: Examples of the application of our algorithm to ﬁnd node-independent paths between vertices i and f   in three thr ee simpl simplee graphs graphs.. In each case the ﬁrst ﬁrst path path found found by the algorithm is colored red and the second one (if it exists) green. turn, marking their vertices as taken, and searching for all shortest paths between i and f  in the remaining subset of vertices, and repeating until no more paths exist from i to f . This algorithm algorithm however however is considerably considerably slower slower than tha n the one descr describe ibed d above above.. In the ty typic pical al case, the number of shortest paths between two points i and f   is bounded by an increasing polynomial nα in the number of vertices and, allowing for the O(m) time taken by the breadth-ﬁrst search procedure, the total running time of the algorithm is O(nα m), which is slower than linear in system syste m size, even for sparse graphs. We can however recover linear performance by a slight modiﬁcation of the algorithm. For given initial and ﬁnal vertices i and f  we calculate all shortest paths between them, then choose p of those paths to pursue further, or fewer if there were not p fo found. und. The value value of   p can be tuned to vary the running running time of the algorithm. algorithm. Each of the p paths chosen is eliminated from the graph in turn, and the shortest path calculation is repeated on the remaining remaining subgraph, subgraph, just as before b efore.. It is easy to see that the running time of this algorithm is bounded above by O( pmin(di ,df ) m), and in the extreme case where p is set to 1, is just O(m), which is fast enough for even the largest graphs found in real-world applications. p shortest And how do we choose the particular set of   p paths that we use in this algorithm? Many strategies are possible, but perhaps the simplest is to choose them at random making use of a suitable random number generator, ato r, and this is what we do in the presen presentt work. Repeated runs of the algorithm on the same data may give diﬀerent results, although all runs will give correct lower bounds on the numbers numbers of node-indepen node-independen dentt paths. In this case, clearly the highest of these lower bounds is our best bound on the actual number of paths. This behavior is typical of many stochastic approximation algorithms.

4 p tot total al incorr incorrect ect

1 2 3 4 5 6

732 166 102 93 86 88

perc percen entag tagee error error 3.85 0.87 0.54 0.49 0.45 0.46

TABLE I: Comparison of the results of the approximation algorithm, algorithm, for variou variouss values of the parameter parameter p, against exhaustive enumeration results for 100 random graphs G n,m n,m with n = 20 vertices and m = 40 edges. The second column column indicates indicates for how many of the 19 000 distinct pairs of vertices in the 100 graphs the number of node-independent paths was incorrectly calculated. The third column is the corresponding percentage of the time the approximation algorithm is in error.

by our approximation algorithm, when the parameter p was set to its minimum minimum value value of 1. This corresponds corresponds to 3.85% of all pairs. The other 18 268, the vast vast majority majority, were calculated correctly by the algorithm. The number of errors falls sharply as the value of   p p is increased, eventually leveling oﬀ at about 90 when p reaches 4, which corresponds to a percentage error of about 0.5%. For values of   p higher than this, no improvement in seen in the number of pairs calculated correctly, which presumably indicates that the remaining incorrect pairs are of the type seen in the third panel of Fig. 1, for which no value p , however large, will ever give the correct answer. of   p Of course course,, random random graphs graphs are not generi generic; c; they they are a very very special special subset subset of all graphs. graphs. Howev However, er, these results indicate that, under appropriate conditions, our algorithm can achieve an accuracy of better than 99%. B

IV.

Zac Zachar hary’s y’s kara arate te c club lub

TES TESTS TS O OF F TH THE E AL ALGO GORIT RITHM HM

In this section we test our algorithm on a number of graphs which are small enough that we can also perform the exact exhaustiv exhaustivee enumerati enumeration on of node-i node-indepen ndependen dentt paths described in Section II. This allows us to compare the results from the approximation algorithm with the known correct results for the same graph and hence determine how well the algorithm performs in a variety of cases. We ﬁr ﬁrst st te test st th thee algo algori rith thm m on a set set of comp comput uter er-genera gen erated ted rando random m graphs graphs.. The These se ha have ve the adv advan antag tagee that we can generate many of them with statistically similar properties, allowing us to gauge quantitatively how often the results given by our approximation algorithm diﬀerr from the exact answer. We then further test the diﬀe algorithm algori thm on two two real-worl real-world d graph graphs, s, for both of whic which h it turns out to work perfectly. A

Ra Rand ndom om gr grap aphs hs

As our ﬁrst test of the algorithm we have generated one hundred random graphs, of the type normally denoted G n,m n,m , i.e., graphs with a ﬁxed number of vertices n and a ﬁxed number of edges m, with edges placed between pairs of vertices chosen uniformly and independently at random (Bollobas 1985). In this test we used n = 20 vertices and m = 40 edges, so that the average vertex degree is four. These graphs are small enough and sparse enough that exhaustive enumeration of node-independent paths between each of the 21 n(n − 1) = 190 pairs of vertices can be performed in a just a few seconds. seconds. In Tab Table le I we show a comparison of this exhaustive enumeration with runs of the approximation algorithm. The table shows that, that, out of the 19 000 dist distinct inct pairs of verticesthe in the onenumber hundredofgraphs, 732 of thempaths were accorded wrong node-independent

For our second test of the algorithm, we use real world data, drawn drawn from Zachary’s Zachary’s well-k well-known nown “karate “karate club” club” study (Zachary 1975, 1977). During two years of ethnographic observation of 34 members of a karate club, a karate teacher (T, #1, Mr. Hi) and a club administrator (A, #34, John) werebyinraising disputefees about whether the club’s solvency (teacher) or to byimprove holding down costs (administrator). This resulted in each calling meetings at which they hoped to pass self-serving resolutions by encouraging attendance by their own supporters. The formation of factions was visible to the ethnographer and evident in meeting attendance, which varied in factional proportions according to the convener. Ultimately the teacher was ﬁred, set up a separate club, and the factional split became the basis for each person’s choice of which of the new clubs they would join. Zachary collected data on friendships between pairs of individua indi viduals ls within within the club. He constructe constructed d network networkss in which friendships were weighted according to the number of contexts (karate and other classes, tournaments, bars and hangouts) in which the individuals in question met. Here we consider consider only the unweighte unweighted d version of the friendship network, which has 34 nodes including the teacher and administrator. Applying our approximation algorithm to the karate club data and comparing the results with exhaustive enumeration of node-independent paths, we ﬁnd again that with p = 1, the algorithm makes a moderate number of errors—typically about 10 pairs of vertices out of 561 are accorded the wrong number of node-independent paths, about a 2% error. Howeve However, r, this falls oﬀ sharply as p is increased, and for p > 4 we ﬁnd that the algorithm performs the calculati calculation on perfectly perfectly on almost almost all runs. All 561 path-counts are exactly the same as the exhaustive enumeration. In the left panel of Fig. 2 we show a hierarchical clustering tree for the karate club data, where the distance

5 7 31 11

22

12

2 20 4

13

1 14

17

6

9

18 5

8

3

28

29 10

34 23 32

30

33

15

27 24 1 2 3 3   3   4 3   9 2   1   8   2   3   3   2   2   2   2   6 7   1   5 1   1   1   1   1   2   2   2   2   1   1   1   3   4   2   4   4   8   1   0   5   6   0   9   1   0   9   7   8   3   7   1   3   2   6   5   2

25

21

26

16 19

FIG. 2: Left: hierarchical clustering of the karate club dataset, based on a distance between pairs of vertices equal to the inverse of the number of node-independent paths between them. Right: minimum spanning tree for the same. between vertices in the dataset is the reciprocal of the number of node-independent paths between them. As the ﬁgure shows, vertices 1, 2, 3, 33, and 34 are at the core of the club, with other club members belonging to the community through their connections with one of these. In the right panel of Fig. 2, we show the minimum spanning tree for the same calculation, which reveals indeed that the network splits roughly into two parts, one centered around vertices 1 and 3, and one around vertices 33 and 34. The minimum minimum spanning spanning tree of a component component of n vertices within a graph is the set of   n − 1 edges which connects all vertices in the component while having the maximal weight, weight, where the weight in this case is the number of node-independen node-independentt paths. In cases where pairs of vertices tied for number of node-independent paths, we broke the tie in favor of the pair separated by the shortest geodesic distance. Thiss cal Thi calcul culati ation on is a good example example of the speed of the algorithm algorithm also. The exact enumeratio enumeration n of all nodeindependent paths for this graph took six hours on a current workstation (circa  2001). With p = 5, our approximation algorithm took 28 seconds on the same computer to get identical answers.

of cooked taro by women express a desire to maintain intimate relations between households. We use Hage and Harary’s construction of the taro graph, in which edges correspond to ﬁrst or second choices of exchange partner or reciprocated third choices (Hage and Harary 1991). Applying our algorithm with p = 5 to this network, we again ﬁnd that the numbers of node-independent paths calculated are in exact agreement with the exhaustive calculati calcu lation on for all pairs of vertices vertices.. Once again, again, the algorithm provides not just a lower bound, but a perfect enumeration of node-independent paths, in a fraction of the time taken by the exhaustive algorithm. V.

In this section we give two examples of applications of our approximation algorithm to networks for which exact enumeration of node-independent paths is impossible, because the networks’ size and density makes the exhaustive backtracking algorithm of Section II computationally intractable. A

C

APP APPLICA LICATION TIONS S

Col Collabo laborat ration ion net netwo work rk

Taro exc exchang hange e net networ work k

Our third test of the algorithm uses data from a network with very sparse and uniform links and much local structure. struc ture. Among Among the Orokaiv Orokaivaa of Papu Papuaa New Guin Guinea, ea, Sc Schw hwimm immer er (1973) (1973) collec collected ted dat dataa from from the vil villag lagee of Sivepe Siv epe on taro taro exchan exchange ge between between 20 hou househ sehold olds. s. At feasts, raw taro is given by men to start a social relationship or to transform an existing one into one where the giver attains “Big Man” status. Small but frequent gifts

We have have applied applied our algorithm algorithm to the collaborati collaboration on network of 271 scientists at the Santa Fe Institute—an interesting case study since the institute focuses on interdiscipl terdi sciplinary inary research. research. Actors Actors in the network network are scientists in residence at the Santa Fe Institute during any part of calendar year 1999 or 2000, and their collaborators. tor s. Tw Twoo actors actors are consid considere ered d to ha have ve a tie between between them (a collaboration) if they coauthored one or more scientiﬁc scien tiﬁc papers toge together ther during during the same period. The

6 33 154 184

181 159

183

213

36 192 207

35

243

6 34

261

155193

246 138

89

147

4 245

40

27

148

7

51

71

8 58

87

136

38

135

149

260 3 248169 180 127 86 182 70 5 90 121 137 107 191 112 113 12 85 72 108

39 197 120

118

206

131 250 196

21

109

119

28

151

2

134 244164

186

139 141

55

53

185 125

66

37 189

249 123

204 124

14

143

13

163 162

194

126

117

54 56

129

122

16

179 83

203

247 1

144

205 59

84 146

265

50

160 263

150

264

145

142

171

67 262

165

235

266

68

233

234 65 232

170 236

195

63

69

188

172 238

9

64

10 187 48

49

237 132

20

133

111 110

22

32

18 255

252

23

198

166 256 253

19

167 254 168

177

200 251

178 176

199 271

258 267 153

43 268

269

259

270

41

42

227

152

101 225

223

226

103

77

222

156

92

157

224

104 73

218

100 102

79 91

46 219

45 47

216

76

97

74 78

221 217

62

96 175

52

231 229

75 99

98

201

31

25

81

82

29

106 95

215

228

30

93

220

230

202

61

105

94

214 80

212

24

115

174

173

239

26

242

114 257

241 130

209

116

208 211

240 210

FIG. 3: Minimum Minimum spanning spanning tree of the Sant Santaa Fe Inst Institut itutee coll collaborati aboration on newtork discussed discussed in the text. Dotted Dotted lines denote known communities within the institute. data were compiled by S. Knutson from publicly available bibliometric sources, and from the institute’s technical reports. With p = 5, our appoximation algorithm took took about 70 seconds to calculate numbers of node-independent paths for all pairs of vertices. vertices. In Fig. 3 we show the resul resulting ting minimum mini mum spanni spanning ng tree for the net network work.. The structure visible in this ﬁgure reﬂects closely the known scientiﬁc organizatio organi zation n of the institute. institute. The largest com componen ponent, t, shown at the top of the ﬁgure, comprises 118 vertices,

or about 44% of the total, and represents three subject areas, demarcated by the dotted lines. The uppermost of the three is a group of researchers working predominantly on the structure of RNA, and is spearheaded by the scientists represented by vertices 3, 4, 5, 34, and 51. Below that on the left is a group working on mathematical models in ecology, spearheaded by the scientist numbered 2. To the rig right ht is a group group wo worki rking ng in stati statisti stical cal ph physi ysics, cs, spearheaded spearhe aded by the scientists scientists numbered numbered 1, 12, and 16. The two next largest components of the graph also rep-

7 86

165

131

42

69

120 88

187

142

4 238

172

91

113112 230 256 309 116 307 125

34 242 54 262

29185 14 12 190

265 209

80

136

191 50 291 183

179

13

245

213

241 89

140

239

233 293 290 96

208

130

51 300 181

21

92 299 162 182 228294 15 203 160272

240 282

119

1 271

127 65 97

55 158 71 163

2 18 76 104 90 105273 122 211 275 48 199 249 123 1744110 225 219 84 47 178 31 114 222 148 22168 128 83 180 72 287 37 166 218 43 153 292 146 117 53 304 100 36 281 56 60 138 10873 81 157 132101 9 270 155 188 137 14566176 189 159 124 248 277 35 77 266 204 41 196 261 28 289 107 214 260 106 280 286177 22 144 3141 202 33 58 257 200237 236 26 255 251 283 254 126 252 258 70 224 129 5 46 99 27 264 259217 87195 8 263 247 301 167 171 24 115 57 205 223 94206 52 231 274 169 194 61 67 143305 93 154 133 216 139 267 276 186 25 74 215 295 111 22030302 173 246 121 197192 297 243 11118 10 63 235 134 227 212 184 210 103 40 98 150 164 95 59

156

38

296

79

49 152

232

226 279

135

244

234

250

85

17082

151

39

149

23

201 288

253

207

308 298 62 174229 616 278 303

19

7

285

20

109 198

193

175 306269 78

147 75

45

32 168 310

64 102

161 284 268

FIG. 4: Minimum spanning tree of the network of biotech companies discussed in the text. resent known groupings within the institute. The second largest, shown at the bottom left of the ﬁgure is a group working on HIV, led by the researcher represented by vertex vert ex 74. The smaller smaller component component to the righ rightt of that is a group working on immunology, led by scientist 26.

the collaboration collaboration network; network; one would need other data, such as educational records, to detect interdisciplinary work. Nonetheless, the Santa Fe Institute is an interesting subject for study study because there are  collaborative collaborative ties between people with widely diﬀerent interests, something

Thus it appears that the numbers of node-independent paths, and our approximation to them using our algorithm, rith m, are extra extractin ctingg a signiﬁcan signiﬁcantt amou amount nt of structure structure from this particular network. It may appear strange that the research community of an institute which is ostensibly interdisciplinary divides so clearly along lines of research topic. One might think that the point of interdisciplinary research is precisely to avoid avoid such such divisi divisions ons.. This This ho howe weve verr is falla fallacio cious: us: it is a mistake to assume that working in interdisciplinary research necessarily means you have to work in all al l  disciplines. In fact, most researchers concentrate on only one or two areas. What makes makes the work interdis interdiscipl ciplinary inary is that those those areas are often often not the area areass in which which the scie scienntist in question question receiv received ed their original training. Of course,

which is rare in more traditional research environments.

information about scientists’ training is not contained in

tion (Powell 1996, Powell et al. 2001) 2001).. The data data used

B

Bio Biotec techno hnolog logy y ﬁr ﬁrms ms

Our second second exampl examplee appli applicat cation ion is to a netwo network rk of formal inter-organizational relationships made by dedicated biotechnology ﬁrms (DBFs) focusing on research on me medi dici cine ness fo forr human umans. s. Co Coll llabo abora rati tive ve ties ties of ﬁnance, R&D, commercialization, and licensing, connecting biotech ﬁrms, pharmaceuticals, ﬁnance and venture capital, universities, research institutes, and government agencies underwent considerable growth during the period 1988–1999. The biotech industry emerged from this period with an intensively networked form of organiza-

8 here where compiled by W. Powell and K. Koput from the industry journal BioScan. Two ﬁrms are considered to have a symmetric tie if a contract is reported by a DBF. We have selected for analysis the 310 connected ﬁrmss from ﬁrm from among the 445 DBFs in the netw network ork.. FigFigure 4 shows the minimal spanning tree. The ﬁrm at the center of the ﬁgure is Genentech (36, red), one of the industry leaders, immediately surrounded by other major players (green) such Genzyme (41), Chiron (22) and, somewhat somewhat further away away, Genet Genetics ics InstiInsti-

independentt paths betwee independen between n the nodes of a network. network. In practice, the algorithm appears to give excellent bounds on path counts, giving path counts which are in error less than 1% of the time on any of the graphs tested, and in many cases agreeing perfectly with exhaustive enumeration methods. methods. The algorithm algorithm runs in time time linear in the size of the network, a huge improvement over the exhaustive enumeration methods which take time exponential in network size. This makes possible the calculation of numbers of node-independent paths on much larger networks

tute (38), CellTech (84), and then ArQule (255) and Amgen (5). This This con concen centri tricc struct structure ure of major star starss with with retinue surrounded by secondary stars and their retinue, and so on out to the margins of the graph, correlates well with descriptions of the cohesive core of the industry given by Powell et al. (2001). It also contrasts markedly with the collaboration network of the previous section, in which, rather than a concentric layered structure, we saw clear separate work-groups surrounding individual group leaders.

than have previously previously been feasible. feasible. We have have given two applications of the algorithm to networks of moderate size (circa  300 nodes), for which it showed short computation times (< ∼ 100 seconds) and produced useful results that appear to capture the cohesive structures of the networks.. The algorithm also has potential works potential application applicationss in the calculation of   k -components for networks.

VI.

FIND FINDING ING K-CO K-COMPON MPONENTS ENTS

We can use the algorithm described here to ﬁnd subgraphs of nodes network in which all nodes are connected by atwithin least ka node-indepen node-i ndependen dentt paths. Once we have the numbers of node-independent paths between all pairs ﬁnding such subgraphs is a trivial matter of creating the graph in which there is an edge connecting only those pairs with at least k paths on the original graph, and then ﬁnding ﬁnding all components. components. These subgr subgraphs aphs are the extracohes extracohesive ive blocks discussed discussed in the introduct introduction ion (White and Harary 2001). They are similar but not identical to the k-compo -component nentss of the graph. The diﬀer diﬀerence ence is that a k -component is a subgraph in which all pairs of nodes are connected by at least k node-independent paths which each run entirely within the subgraph. The extra condit con dition ion that that the paths must must run wit with h the the sub subgra graph ph k makes -components harder to calculate than extracohesive blocks. blocks. Howeve However, r, our algor algorithm ithm can be modiﬁ modiﬁed ed to calculate calcu late them also as follows. follows. Every k -component S   of a graph will be a subgraph of an extracohesive block B at level k. Hence, Hence, by taking the subgraph subgraph B   and applying the algorithm again, a smaller subgraph is found for which node-independent paths are computed internally. By successive iterations of this procedure, k -components can be found. found. The success success of our algorithm algorithm whe when n applied to the test graphs of Section IV suggests that in fact these subgraphs should be a good approximation to the true k -component of the network. VII.. VII

CON CONCLU CLUSIO SIONS NS

In this paper we have introduced a fast new algorithm for comput computing ing lowe lowerr bounds bounds on the num numbers bers of node node--

ACKNOWLEDGMENTS

The author authorss wo would uld like like to thank thank Michel Michelle le Gi Girv rvan, an, Frank Harary, Harary, and Woody Powell Powell for useful useful comment commentss and suggestions, and Sarah Knutson and Woody Powell for providing providing the network data used in Sections Sections V A and V B. DRW DR thanks John Padgett Padgeand tt and the SFI ing group on W Coevolution of States Markets forworkhospitality and funding while this work was carried out. The work reported here was funded in part by the National Science Foundation. REFERENCES

[1] B. Bollob´aas, s, Random Graphs, Academic Press, New York (1985). [2] Chartrand, G. and L. Lesniak 1996 Graphs and Digraphs, 3rd edition, Chapman and Hall, London. [3] Hage, Per and Frank Harary 1991 Exchange in Oceania, Clarendon Press, Oxford. Graph The Theory, ory, Addison-Wesley, [4] Reading, Harary Harary,, F. 1969 Graph MA. [5] Menger, Menger, Karl 1927 “Zur allgemein allgemeinen en Kurvent Kurventheoheorie,” Fundamenta Mathematicae   10:96–115. [6] Powell, Powell, Walter W. 1996 “Inter-o “Inter-organiz rganization ational al collaboration in the biotechnology industry,” industry,” Journal of Institutional and Theoretical Economics  120:197– 215. [7] Powell, Walter W., Douglas R. White, Kenneth W. Koput, and Jason Owen-Smith 2001 “Evolution of a science-ba scien ce-based sed industry: Dynamic Dynamic analyses analyses and network visualization of biotechnology,” Santa Fe Institute working paper. [8] Schwimmer, Erik 1973 Exchange in the social structuree of the Orok tur Orokai aiva: va: tr trad aditi itiona onall and and em emer ergen gent t id ideeol olo ogie giess in the Nor North thern ern Dis Distri trict ct of Pap Papua ua New Guinea, St. Martin’s Press, New York.

9 [9] White, White, Douglas R. and Frank Frank Harary 2001 “The cohesiveness of blocks in social networks: Connectivity and conditional density,” Sociological Sociological Methodology, in press. [10] Zachary, Wayne W. 1975 “The cybernetics of conﬂict in a small small group: An information information ﬂow model, model,””

unpublished masters thesis, Department of Anthropology, Temple University. [11] —1977 “An information ﬂow model for conﬂict and Anthropological ological ﬁssion in small groups,” Journal of Anthrop Research   33:452–473.

Working

Comments

Content

Sponsor Documents

Recommended