Multivariate Tobit

Published on February 2017 | Categories: Documents | Downloads: 37 | Comments: 0 | Views: 460
of 40
Download PDF   Embed   Report

Comments

Content

 

TOBIT MODELS FOR MULTIVARIATE, SPATIO-TEMPORAL AND COMPOSITIONAL DATA

Chris Glasbey, Dave Allcroft and Adam Butler

Biomathematics & Statistics Scotland

 

QUESTION 1: How to analyse data with lots of zeros, such as:

Variety 1 2 3 4 5 6 7 8 9 10 ... 30 31 32

Winter wheat showing lodging 2

Crop lodging (%) Trial 1 2 3 4 5 0 0 0 0.3 7.7 0 66.7 0 0 0

0 1.3 0 0 0

0 0.7 0 0 0.7

0 1.7 1.0 6.7 0 0 0 2.7 0.3 10.0

6 0

7 0.4

0 0 0 0 0

0 0 0 0 0

0 0 0 0 5.0 0 0 3.3 0 0 11..7 28.3 0. 0.3 0 0 0 0 0 37.7 0 0 0 0 0 0 1.0 0 0   ...   ...   ...   ...   ...   ...   ... 3.3 3.0 0 2.0 11.0 0 0.2 0 0.3 0.3 0 9.3 0.3 0 30.0 1. 1.3 0 0. 0.3 8. 8.3 0 0

 

QUESTION 2: How to summarise high-dimensional food intake data?

   0    0    0    3    )   g    (    d   a   e   r    b   n   w   o   r    b

   0    0    0    2

   0    0    1

   0

0

1000

2000

3000

white bread (g)

2-dimensional marginal plot, weekly intakes of 2200 adults 3

 

QUESTION 3: What can be done if rainfall is needed at a finer spatial scale than recorded?

402km squares

  disaggregation

  ⇒

4

82km squares

scale

 

QUESTION 4: Do compositions of beef and pork differ? Pork

Beef

++ protein + + + + + + + + + + + + 0.8 + + + + + +++ + + + + + + ++ + + + + + + + + + 0.6 + + + ++ +++ ++ ++  + + + ++ + + + 0.4 + + ++ + + +++ + +

0.2 0.4 0.6 0.8

0.2 0.4 0.6 0.2

+ +

carbs

0.8

carbs +

0.2

0.4

0.6

0.8 fat

0.2

Fish

0.8

0.6

0.8 fat

+

protein

+

0.2

0.8

+

0.4

0.6

+

+ 0.4 +

0.6 +

0.2

carbs 0.4

0.6

Beverages

++ + ++ + ++ ++ ++++ protein +++ ++++ ++ 0.2 +++ ++ 0.8 + + +++++ + + + +++ + 0.6 0.4 + + + + ++ + + ++ + + ++ + + ++ + + 0.6 0.4 + ++ + + +

0.2

0.4

+ + protein + ++ + + + + ++ ++ + ++ + + + ++ + + + ++ ++ 0.8 + + + + ++++++ ++ + + + + ++ + + ++ + + ++ ++ ++ +++++ ++ + ++ + ++ + + + +++ + 0.6 + ++++ + ++++ +++ ++ +++ +++++ 0.4 ++ ++++ + + 0.2 +++ ++ ++ ++

0.8 fat

+ ++++   + + + ++ + ++ 0.8 ++++ ++++ + ++ ++ + carbs++++++++++++ +++   +   ++ + + + + + + + ++ + +++ + + +++ 0.4 0.2

0.2

0.6

0.8 fat

5  

Gaussian models are the motorway network of statistics!

6  

Binary data (Z  (Z ) can be modelled by Gaussians, using Probit model: Z   =



0   if   if   Y  0 1   otherwise

 ≤

 

  ∼ N(α +  + βx  βx,, σ 2)

where   Y 

   4

   2    Z

   1

   Y

   0

   0

   2  −

0

2

4

6 x

8

10 10

0

2

4

6 x

8

10 10

7  

So can non-negative data (Z  (Z ), ), using Tobit (or Latent Gaussian) model: Z   =



0   if   if   Y  0 f  f ((Y  Y ))   otherwise

 ≤

 

  ∼ N(α +  + βx  βx,, σ 2)

where   Y 

   4

   4

   2

   2

   Z

   Y    0

   0

   2  −

0

2

4

6 x

8

10 10

0

2

4

6 x

8

10 10

8  

James Tobin (Econometrica, 1958)

9  

10  

PLAN 0. Introduction 1. Univariate data – data  – crop lodging 2. Multivariate data – data  – food intake 3. Spatio-temporal data – data  – rainfall 4. Compositional data – data  – food composition 5. Summary

11  

1. UNIVARIATE DATA – CROP LODGING

Variety 1 2 3 4 5 6 7 8 9 10 ... 30 31 32

Crop lodging (Z  (Z ) Trial 1 2 3 4 0 0 0 0.3 0 66.7 0 0 0

0 1.3 0 0 0

0 0.7 0 0 0.7

5 7.7

6 0

7 0.4

0 1.7 1.0 6.7 0 0 0 2.7 0.3 10.0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 5.0 0 0 3.3 0 0 11..7 28.3 0. 0.3 0 0 0 0 0 37.7 0 0 0 0 0 0 1.0 0 0   ...   ...   ...   ...   ...   ...   ... 3.3 3.0 0 2.0 11.0 0 0.2 0 0.3 0.3 0 9.3 0.3 0 30.0 11..3 0 00..3 88..3 0 0

12  

A square-root transformation normalises the non-zero data, so assume: Z ij   =



 ≤

0   if   if   Y ij 0   Y ij2   otherwise

where   Y ij

 ∼ N(vi +  + t  t j   , σ 2)

Estimate   v ,   t   and Estimate and   σ 2 by numerically maximising the likelihood Φ(eeij ) Φ(



Z ij ij =0

φ(eij )   where   eij   =



Z ij ij >0

Z ij

 

− vi − t j σ

13  

√  Crop lodging square-root ( Z ) 1 2 Trial 3 4 Variety   vˆ   tˆ =   0.9 –1.5 –1.4 0.4 1   –0.5   0 0 0 0.5   0 0 0 0 2   –2.8 3   1.7 4   – 5   –2.6 6   –0.3 7   –2.3

 

8   0.6 9   –0.8 10   –3.0 ... ...   30   0.9 31   0.2 32   0.9

 



             

      σˆ   = 1.6

8.2 0 0 0 0

1.1 0 0 0 0

0.8 0 0 0.8 0

1.0 0 0 0.5 0

5 6 7 3.5 –0.7 –0.9 2.8 0 0.6 1.3 0 0 2.6 0 1.6 3.1 2.2

0 0 0 0 0

0 0 0 0 0

1.8 0 0 1.3 5.3 0.5 0 0 0 0 0 6.1 0 0 0 0 0 0 1.0 0 0 ...   ...   ...   ...   ...   ...   ... 1.8 1.7 0 1.4 3.3 0 0.4 0 0.5 0.5 0 3.0 0.5 0 5.5 1.1 0 0.5 2.9 0 0

14  

Diagnostic plots using standardised residuals:   eˆij   =

censored scatter plot

ij Z ij

 

− vˆi − tˆ j σˆ

Kaplan-Meier estimator &   Φ

15  

2. MULTIVARIATE DATA – FOOD INTAKE    0    0    0    3    )    (   g    d   a   e   r    b   n   w   o   r    b

   0    0    0    2

   0    0    0    1

   0

0

1 0 00

2 000

3000

white bread (g)

UK Da Data ta Ar Arch chiv ivee (E (Ess ssex ex Un Univ iver ersi sitty) y):: week eekly ly in inta takkes of 51 food types by 2200 adults.

16  

Model intake of food  food   j  by adult  adult   i   by by:: Z    = ij



 ≤ 0

if   Y ij

 

0

ij ) f  j (Y ij

 

otherwise

 

where   Y 

N(µ   ,   1)

 ∼

ij

 j

and   f  j−1  is a quadrati and  quadraticc power power transfo transformati rmation on though though the origin origin γ  + α2Z 2γ  Y    =   f  j− 1(Z ) =   α1Z γ 

Model fitting step 1: Estimate   µ j ,   α   and  Estimate and   γ  by   by regressing non-zero Z’s on normal scores

17  

For example, for intake of white bread:

untransformed (Y  (Y ))

transformed (Z   =   f  f ((Y  Y )))

18  

Further assume  assume   Y i.

V )   (V ii MVN((µ, V ) MVN



1, so also correlation matrix)



Model fitting step 2:

Estimate   V  j Estimate  jk k  by maximising the pairwise likelihood: i



where

− −

Φ2( µ j , µk ; V  j  jk k)

ij , Z ik ik ) =  p((Z ij  p

 p((Z ij , Z ik )  p

  

φ(Y ij

− µ j ) Φ

φ(Y ik

− µk ) Φ

φ2(Y ij

if   Z ij   = 0,   Z ik   = 0

 

− − −  − − −  −  µk V  j jkk (Y ij ij µ j ) 2 1 V  j jk k

  if   Z ij   >   0,   Z ik   = 0

µ j V  j jkk (Y ik µk ) 2 1 V  j jk k

  if   Z ij   = 0,   Z ik   >   0



 

− µ j , Y ik − µk; V  j jkk)

 

otherwise

19  

ˆ V 

 

Foo oods ds re-o re-ordere rderedd

20  

 − 1)1)//2 = 1275 parameters  − 1275  parameters in  in   V 

N ((N  We prefer to have fewer than  than   N 

In Factor AnalysisL V    = V 

2 ) =   B B T  + Σ β l β lT   + diag  diag   (σ12, . . . , σN 

l=1

Equivalently



L ij   =   µ j   + Y ij



 jll f iill   + eij B j

l=1

where   f  , f  , . . . , f   where i1

 ∼

and   eij and

i2

iL 2 N(0 (0,, σ j )

N(0 (0,, 1) 1) are  are latent variables



21  

Model fitting step 3: Estimate   B   and Estimate  and   Σ  using the maximum likelihood algorithm due to Joreskog ˆ  in place of sample covariance matrix V  in (1967), modified by using  V  To maximise:

L   =   − log |BBT  + Σ| − trace trace[( [(B B B T  + Σ)−1 ˆ V ]] V  = j V   jk k 1. Obtain initial estimate of   Σ:   σˆ j2  = 1 maxk ˆ j ˆ 1/2Ω(Θ I )1/2 2.  Bˆ   =  Σ ˆ −1/2 ˆ −1/2 ˆ where   Θ   is  where is   L L  diagonal of largest eigenvalues of  Σ V  Σ and   Ω  is the  and the   N  L  matrix of corresponding eigenvectors





× ×  ×

 L  with respect to  to   Σ

3. Numerically Numerically maximise maximise

4. Repeat steps 2 and 3 until convergence

| |





22  

ˆ: V  V :

L = 1

 

L = 2

L = 3

 

L = 4

23  

Factor loadings  Bˆ   (L   = 2)

24  

3. SPATIO-TEMPORAL DATA – RAINFALL We have 12 hourly arrays (1200km Here are hours 3-5:

 × 600km) of storm in Arkansas USA

We will build a model using fine-resolution data Then use it to disaggregate data at a coarser scale and see how well we recover the fine scale

25  

Similar to the multivariate model: Step 1:   We transform rainfall to a censored Gaussian variable (Y  (Y )) via a quadratic power transformation 2:  We estimate autocorrelations (V  Step 2: We (V )) at a range of spatial and temporal lags by maximising pairwise likelihoods

26  

ˆ V  Time lag 0 4

.49

32 1 .83 0 1. .89 0 1

.68 .73 .75 2

..5672 .65 .66 3

..5526 .58 .59 4

Time lag 1 hour 4 3 2 1

.44 .50 .47 .57 .53 .49 .63 .59 .55 .51

0 .68 .65 .60 .55 .51 0 1 2 3 4

27  

V    we use To model model   V   use a spat spatio io-t -tem empo pora rall Gaus Gaussi sian an Mark Markov ov Rand Random om Fiel Fieldd (GMRF), because rainfall disaggregation requires simulation from conditional distributions Therefore p(Y  Y )) 1

 ∝

  1

1 exp

|V |2

−

1 (Y  2

 − µ)T V −1(Y  − µ)



where   V − is the precision matrix, with non-zero entries specifying the conditional dependencies between elements in  in   Y 

For example, a  a 3

× 3 × 3  neighbourhood:

t-1

t

requires 5 parameters, if we allow for symmetries

t +1

28  

Extending Rue and Tjelmeland (2002), we approximate both space and time by a torus. Therefore, all matrices are Toeplitz block circulant (TBC), and

• the first row summarises a matrix 1 compute   V  V    from  from   V − via two 3-D Fourier transforms: • we can compute  −

N i-1 N  j 1 N t 1 ij∗ t   = V ijt

then



k =0

l=0

  1 V 000 000,kls ,kls   = N iN  j N t

1,kls exp 000,kls V 000

ik   j l   ts 2π ι N i + N  j + N t

  − − − − − ∗  

  s=0

N i 1 N  j 1 N t 1 i=0

 j=0  j =0

t=0



1 ik   j l   ts + + exp 2π ι V ij N i N  j N t ijtt



29  

Model fitting step 3: We estimate GMRF parameters by minimising 1 ˆij V  ijtt i2 + j 2 + t2 i

 j



t

  × ×

For neighbourhood size size   5

5

Time lag 0

− V ijijtt



2

3:

Time lag 1 hour

30  

Model diagnostics: Bivariate histogram of pairs of wet locations at a spatial separation of 8km

0

50mm observed

50mm expected

31  

Model diagnostics: Histogram of rainfall for locations for which the adjacent location was dry

— observed, - - - expected

32  

Disaggregation Gibbs sampling to update blocks of   5

× 5  pixels (Y  (Y A)

Conditional distribution is multivariate normal, obtained from V AA   V AB µA Y A , MVN V B A   V B B µB Y B



  ∼   

where dimension of neighbourhood  neighbourhood   Y B   is is   (3

× 92 − 52) = 218

constrain   Y A  such that Use rejection sampling to constrain  rainfall

Z A  matches observed



33  

Which are the 2 simulated disaggregations?

scale

34  

Which are the 2 simulated disaggregations?

Simulation 1

Observed

Simulation 2

scale

35  

4. COMPOSITIONAL DATA – FOOD COMPOSITION Beef

Pork

+ protein + +++ +++ + ++ +++ 0.8 +++ +++ + + + +++ + ++ +++ + + ++++ 0.6 + ++ ++++++++  ++++ + ++++ ++ 0.4 + ++ + +++ + +

0.2 0.4 0.6 0.8

0.2 0.4 0.6 0.2

+ +

carbs

0.8

carbs +

0.2

0.4

0.6

0.8 fat

0.2

Fish

0.8

0.6

0.8 fat

+

protein

+

0.2

0.8

+

0.4

0.6

+

+ 0.4 +

0.6 +

0.2

carbs 0.4

0.6

Beverages

+ +++ ++ ++ ++ +++++protein ++ +++ 0.2 ++++ ++++ 0.8 + + ++ ++++ + + + + +++ + 0.4 + + +++ + + +++0.6 + + ++ + + ++ + + 0.6 0.4 + ++ + + +

0.2

0.4

+ + + +++protein + +++ ++ + ++ + + + ++ + 0.8 ++ +++ + + +++++++ + + + + +++ + + + + + +++ + + ++ +++++ +++ ++ ++ ++ + ++ ++++ 0.6 + + ++++ ++ + +++ +++ ++ ++++ +++++ 0.4 ++ ++++ + + 0.2 + +++ ++ ++ ++

0.8 fat

+ ++++   + + + ++ + ++ 0.8 ++++ ++++ + ++ ++ + carbs+++++++++++ +++   +   + + + + + + + +++ + +++ + + + ++ +

0.2

0.4

0.2

0.6

0.8 fat

USDA Nutrient Database: composition of 7270 foods in 25 food groups

36  

We model food compositions by:

{ − Y  :   X  ∈   ∈ △} X 

Z   =   arg min X 

  ∼ MVN MVN((µ , V ) V )

  where   Y 

ensure   Y T 1 = 1, by constraining  (Where we ensure  constraining   µT 1 = 1   and and   V J   = 0)

37  

Z l   =

Y D

Y 2

In   D  dimensions, if   Y 1 In

≤ 

0

≤···≤  

if   if   l

L   1 Y l  + D L



 i=1

≤L

Y i   otherwise

 ≥ 0  ≥

where   L  is smallest integer s.t.   Z  where Model fitting Forr   D Fo

≤ 3, we compute likelihoods analytically

Forr   D >   3, we use MCMC: a Gibbs sampler alternately simulating: Fo (Y  Z )  by rejection sampling

•• µ   and |   V  and 

38  

Maximum likelihood estimates: Pork

Beef

+ ++ +++ +++ +++ protein +++ +++ 0.8 ++ +++ + ++++ + ++ + ++ +++ ++ ++ ++ 0.6 ++ + +++++++ + + + +++ + + +++++ + 0.4 + + +++ + ++ ++ +++ ++ 0.2

0.2 0.4 0.6 0.8

carbs

0.2 0.4 0.6 0.8

carbs 0.2

0.4

0.6

0.8

fat

+

0.2

Fish

0.8

0.6

0.8

fat

protein

0.8

0.8 +

0.2

0.4

0.6

0.2

+

0.2

0.4

Beverages

++ + ++++ ++ ++++ + protein ++ +++++ +++ + 0.2 + + ++++0.8 +++++ ++++++ ++ + +++ + +++ ++++++ + + + + +++ 0.6 0.4+ + +++ ++ + + + ++ + 0.6 0.4 + + + +

carbs

+ +++++ ++ ++++ + protein +++++++ +++++ + + ++++ 0.8 ++ ++++ ++ ++++++++ +++++ + + ++++ ++++ ++ ++ + +++++ 0.6 ++ + +++++++++ + ++ + + ++++ ++ + ++ ++++++ 0.4 ++ + + ++++ + + ++ ++ 0.2

fat

0.4++ + + ++ + + + + + + 0.6 ++++ +++ + + +  + + ++ + + + ++ +++ ++ ++ + ++ + + + 0.8 +++++ + + ++ + + + + + ++ + carbs++++++++++++ ++++++++ + + +   +++ ++ + +++++ ++ ++++++ 0.2

0.4

0.6 +

0.6

0.4 0.2

0.8

fat

Likelihood ratio test shows beef and pork to be different

39  

5. SUMMARY We have developed Tobit models for data that are:  – crop lodging lodging –  – additive model 1. Univariate Univariate – 2. Multivariate Multivariate –  – food intake – intake  – Latent Factors model 3. Spatio-temporal – Spatio-temporal  – rainfall – rainfall  – GMRF model 4. Compositional – Compositional  – food composition composition –  – bivariate normal model Issues remaining: Efficient estimation

• Model diagnostics • Generalisations when model does not fit Further details are in papers on http://www.bioss.sari.ac.uk/staff/chris.html

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close