256x Filetype PDF File size 0.16 MB Source: www.pauldickman.com
computer methods and programs in biomedicine 81(2006)272–278
journal homepage: www.intl.elsevierhealth.com/journals/cmpb
Relative survival analysis in R
MajaPohara,∗, Janez Stareb
a Department of Medical Informatics, University of Ljubljana, Vrazov trg 2, SI-1000 Ljubljana, Slovenia
b Department of Medical Informatics, University of Ljubljana, Slovenia
article info abstract
Article history: Relative survival techniques are used to compare the survival experience in a study cohort
Received 7 March 2005 with the one expected should they follow the background population mortality rates. The
Received in revised form 12 January techniquesareespeciallyusefulwhenthecause-specicdeathinformationisnotaccurate
2006 ornotavailablesincetheyprovideameasureofexcessmortalityinagroupofpatientswith
Accepted12January2006 acertaindisease.Thereareseveralapproachestomodelingrelativesurvival,butthereisno
widely used statistical package that would incorporate the relevant techniques. The exist-
Keywords: ingsoftwarewasmostlywrittenbytheauthorsofdifferentmethods,indifferentcomputer
Relative survival languages and with different requirements for the data input, which makes it almost im-
Rsoftware possibleforausertochoosebetweenavailablemodels.WedescribeourRpackagerelsurv
Regression that provides functions for easy and exible tting of several relative survival regression
models.
©2006ElsevierIrelandLtd. All rights reserved.
1. Introduction population life tables. Obviously, r(t) can be any non-negative
number,althoughthemethodsaremostoftenappliedtodata
Survivalanalysisisconcernedwithstudyingtherandomvari- wherer(t) is less than 1.
able T, representing the time between entry to a study and Severalapproachestomodellingrelativesurvivalexist,but
some event of interest (e.g. death, recurrence of disease ...). all of the existing programmes (for example: Surv3 [2], SAS
InsteadofthecumulativedistributionfunctionF(t),weusually macrosandStatafunctions[3],RSurvRfunction[4])focuson
estimate the cumulative survival function S(t), which is de- only one of the models and use specically organised general
nedasS(t)=P(T>t)=1F(t).Inthecaseofthenalevent populationtables, makingitdifcultfortheuserstocompare
beingdeath,S(t)istheprobabilityofbeingaliveattimet.When different methods.
the nal event of interest is death due to a certain disease, We present three R [5] functions organised as a package
but the causes of death are not known, it is not possible to called relsurv that largely simplify the usage of relative sur-
directly estimate the proportion of dead due to the disease vival regression models. All the functions require the same
in question. We then resort to the methods of relative sur- basic organisation of the data and can be used with any for-
vival. The cumulative relative survival function r(t) is dened matofthepopulationmortalitydata.
[1] as Section 2 briey describes the three most commonly used
regression approaches and gives an outline of the theory for
r(t):= SO(t) (1) the ve tting methods used in the relsurv functions. Section
S (t)
P 3 describes the functions, their arguments, the preparation
of the data and the returned values. The usage is further ex-
whereS (t)denotesobservedsurvivalandSP(t)standsforpop- plainedthroughanexampleinSection4.Incasetheuserdoes
O
ulationorexpectedsurvival,whichisestimatedonthebasisof not yet have the relevant population mortality tables organ-
∗ Corresponding author. Tel.:+386 1 543 7785; fax: +386 1 543 7771.
E-mail address: maja.pohar@mf.uni-lj.si (M. Pohar).
0169-2607/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved.
doi:10.1016/j.cmpb.2006.01.004
computer methods and programs in biomedicine 81(2006)272–278 273
isedintherequiredratetableformat,theappendixprovides the form E(t) = exp{′z+
′fu(t)}, where z is the vector of
somequickhelp. the predictor variables, fu the vector of the follow-up in-
tervalindicators(withcomponentsfu,wherefu(t) = 1on
i i
interval i and 0 elsewhere) and and
are the vectors of
2. Relative survival regression models regression coefcients. If we denote the interval-specic
∗
observedandexpectedsurvivalproportionsbyp andp
ki ki
The relative survival literature most frequently refers to the respectively, Eqs. (2)–(4) lead us to
additive models, dominating especially in cancer research. The
mainassumptionisthatthehazardofeveryindividualcanbe ln ln pki =′z+
, (6)
split into two additive components: p∗ i
ki
(t) = (t) + (t), (2) which is a generalized linear model with a binomial er-
O P E
ror structure and a complementary log–log link function
whereP(t)isthehazardeverypatienttakesbecauseofhisage, combinedwithadivisionbyp∗.
ki
sex and cohort year (or any other combination of covariates (ii) The glm model with the Poisson error structure [6]: the
included in the population mortality data). (t) denotes the methodissimilartoHakulinen–Tenkanenanditrequires
E
excesshazard,specicforthediseaseinquestionand (t)is the same grouping of the data. Again the is assumed
O E
usedfor the observed hazard - the one we can estimate from to be of the (t) = exp{′z+
′fu(t)} form. The number of
E
our data. Using the equality deaths dki is assumed to be distributed as Poisson(ki),
where = · y and y is the person-time at risk
ki E,ki ki ki
(t)dt for group k of observations on time interval i. Denoting
S(t) = e , (3) ∗
the expected number of deaths as d , Eq. (2) can thus be
ki
wecanwrite writen as
SO(t) ∗
S (t) = S (t) × S (t) =⇒ S (t) = , (4) d
O P E E ki ki ′z+
S (t) = +e i (7)
P y y
ki ki
givingthesameexpressionas1.Severalttingapproachesfor ln( d∗)=′z+
+ln(y ). (8)
the additive model exist and a review of the three methods ki ki i ki
described in this paper is given in [6]. Thisimpliesageneralizedlinearmodelwithoutcome
Themultiplicativemodelsassumethehazardcomponentsto ∗
dki, Poisson error structure, link ln(ki d ) and offset
ki
bemultiplicative: ln(yki).
Themodelcanberewritten[6]toyieldthesamelike-
(t) = (t) · (t). (5) `
O P E lihood function as the Esteve model, meaning that the
grouping of the variables is not crucial. The idea of split-
Such a model does not assume that the observed hazard is ting the observations into subject-band observationsand
always greater than the population hazard, but has a less ob- taking d∗ /y as a kind of an average of individual val-
ki ki P
vious interpretation as the additive model. ues however causes the estimates to slightly differ from
Thethirdoptionarethetransformationmodels[7]thatmake `
those in the Esteve model.
noassumptionabouttherelationship between the observed, `
(iii) TheEsteveadditivesurvivalmodel[10]:themethoduses
the population and the excess hazard. All the individual sur- individual data and estimates the coefcients using the
vival times are rst transformed to a different scale (by taking maximumlikelihoodapproach.Thelikelihoodforthead-
intoaccountthegeneralpopulationmortality),wheretheycan ditive model is
befurther analysed by any of the ordinary survival models.
Ourprogramsprovideforttingthetransformationmodel, n tj n
the Andersen multiplicative model [8], and three differ- dj
L() = ( (t )) exp (s)ds = ( (t )
O j O P j
ent approaches to tting the additive model. Only the last j=1 0 j=1
four have been widely used in the past, while the trans-
formation approach has only recently been published and tj
′z+
′fu(t ) ′z+
′fu(s)
+e j )d exp ( (s) + e )ds ,
given its simplicity, it might become popular in the fu- j P
0
ture, especially in studies where long-term observations are (9)
common.
where t is the event time for person j and d the status
j j
(i) Hakulinen–Tenkanen additive survival method [9]:inor- indicator. The log-likelihood function is then
der to t the model, the patients must rst be grouped
n
into K strata, indexed by k, with one stratum for each ′z+
′fu(tj)
l() = d ln(P(t ) +e )
combinationoftherelevantpredictorvariables (age, sex, j j
cohort year, stage ...)andalife table must be estimated j=1
for each stratum, with intervals indexed by i. The (t)is n tj n tj
E e′z+
′fu(s)ds (s)ds. (10)
assumedtobeamultiplicativefunctionofthecovariates P
and constant within each time interval i. We write it in j=1 0 j=1 0
274 computer methods and programs in biomedicine 81(2006)272–278
Thelast term on the right-hand side in Eq. (10) does not changes—forexampleon1Januaryandontheindividual’s
dependonor
andcanbeomitted.Itfollowsthatonly birthday.
theexpectedhazardsattimeofeventareneededforeach rstrans ts the transformation model as described
individual. in (v). If only the transformation times are needed,
(iv) The Andersen multiplicative model [8]: the model as- this can be done directly by the survexp function
sumes (t) = (t)exp{′z}, which is the same as in the (survival package) or by function rstrans, where
E 0
Coxmodel.Theobservedhazardthusbecomes the transformed times are returned in output value
y(fit<-rstrans(...), y<-fit$y).
′z
(t) = (t) (t)e . (11)
O P 0
The package also includes the functions for testing
This model can be rewritten in the form goodness-of-t and plotting relevant graphs for all the above
modelswithmethodsdescribedin[11].Additionally,twodata
′z+1 ln( (t))
(t) = (t)e P , (12)
O 0 setsareincludedinthepackage.Oneiscalledrdataandcon-
fromwhichitisobviousthatthisisaCoxmodelwithan tainssurvivaldatathatcanbeusedasanexample(seeSection
additional time-dependent variable with the coefcient 4 for more details). The slopop data set contains the popula-
xedto1.Though (t)isacontinuousprocess,thepop- tion mortality tables for Slovenia.
P
ulation mortality tables are usually organised in yearly 3.1. Usage
intervals. In order to t the model using the Cox model
procedures, data can be split in yearly intervals, with the All the functions are called in the same way, for example:
(t) updated on each of the intervals.
P
(v) The transformation approach [7]: the individual survival rstrans(formula, data, ratetable, int, na.action,
times t are rst transformed as init, control)
y = FP(t), (13) Twodatasetsarerequiredforanyrelative survival model.
where F is the cumulative distribution function of a Oneisourobserveddataset, which we will pass to the func-
P tion as argument data. The other is the population mortality
person of a certain age, sex and cohort year (or any tablethatwewanttocompareourobserveddatato,andthisis
othercombinationincludedinthepopulationtables)that speciedintheargumentratetable.Thepopulationmortal-
would apply if the person is a typical representative of ity tables havetobeorganisedasanobjectofclassratetable
the general population. This distribution function is cal- (dened in package survival), default is the survexp.us ta-
culated from the general population mortality data. The ble that contains the US data (also in the survival package).
valuesycanthusbeinterpretedastheachievedvalueson Themodelisspeciedwithintheargumentformula.Itis
theexpectedcumulativedistributionfunctionforeachin- organised following the syntactic rules of an R language for-
dividual.Bytransformingtothenewscale,thepopulation mulaandthussimilartoanyothersurvivalpackageformula.
hazardisautomaticallytakenintoaccount,consequently Theleft-handsidemustbeaSurvobject.Forexample,iftime
all what is left is precisely the disease-specic hazard, and status are the survival time variable and the censoring
which we can thus directly model. One of the possibili- indicator, and x is a covariate, then the command may be
ties is to use the Cox model
′z rstrans(Surv(time,status)∼x)
(y) = 0(y)e . (14)
If the variables that are used for the expected survival
3. Therelsurv package calculation (for example age, sex and year) are not organised
and not named in the same way as in the population tables,
Thecoreofthepackagearethreefunctionsthattthemodels aspecialtermratetableistobeincludedintheformula,for
described in the previous section: example:
rsadd ts an additive model. As described in Sec- Surv(time,status)∼x+ratetable(age=age,sex="male",
tion 2, different methods of estimation exist, and the year=diag)
user can choose among them through the method ar-
`
gument. The default is the Esteve method (iii) which Argument int species the number of yearly intervals to be
is specied by "max.lik", the other two options are usedinrsaddfunction.intcaneitherbeasinglenumberora
method="glm.bin" and method="glm.poi" for models vector.Thelatterisusedwhentheintervalsarenotalloneyear
described in (i) and (ii) of Section 2. When using one long, for example int=c(0,0.5,1,5,10) means a couple of
of the glm methods, the observed and expected sur- shortintervalsatthebeginningandlongerintervalsthereafter
vival proportions for each group are returned as object with the maximum follow-up time of 10 years. Each interval
groups. includestherightendpoint,butnottheleftone,exceptforthe
rsmul ts the multiplicative model as described in (iv). rstintervalwhichalsoincludestheleftendpoint.Inthiscase
A more computationally intensive alternative is cho- the third interval would therefore be (1,5].
sen by specifying method="mul1"that splits the intervals This option can also be used in rsmul and rstrans func-
at every time point in which the population mortality tions, serving simply as the maximum follow-up time after
computer methods and programs in biomedicine 81(2006)272–278 275
which all the data are censored, enabling in this way a more rsadd(Surv(time,status)∼x,ratetable=slopop,
straightforward comparison of the models. data=my.data)
Missing data are handled in the usual way through the ar-
gument na.action and the initial values for the model can assumesthatthedatasetmy.datahasvariablesage,sexand
bespeciedviatheargumentinit.Thenumberofiterations yearorganisedintheformatthatmatchesslopopratetable.
and other tting specications can be specied through the
control argument, which follows the glm.control function 3.3. Values
in the rsadd function and the coxph.control function in
rsmulandrstrans. The rsadd function returns an object of class rsadd, while
rsmulandrstransreturnacoxphobject.Allthereturnedval-
3.2. Preparing data uescanbeviewedassuchorbythehelpofsummaryfunction.
Apartfromthecoefcients,theirstandardizedvaluesandthe
The functions follow syntactic conventions of the R package other usually returned values, the common object returned
survivalandthedatashouldbeorganisedasrequiredbythe byallthefunctionsisnamedyandcontaintsthesurvivaland
survivalfunctions. In particular, this means that we need a censoring times for each individual, organised as a Surv ob-
timeandacensoringvalueforeachsubjectandanyadditional ject. In the case of rstrans function, these are already the
numberofcovariateswewouldliketoincludeinthemodel.All transformed times and this function can thus also be used
the functions also support time-dependent data, with times just for making the transformation.
andeventsorganisedinthestandardwayinaSurvobject. The max.lik method of rsadd function is based on max-
The main feature of any relative survival method is the imum likelihood and therefore the output also contains the
comparison between the observed and population mortality log-likelihoodvaluesstoredasloglik.Theothertwomethods
data. The latter have to be organized as a ratetable object are based on the glm function and the returned objects pre-
(see Appendix A for details) and specied in the ratetable serveallitsvalues.Theoriginalandthegroupeddataareboth
argument of the used function. The only thing that remains stored in the output object as data and groups, respectively.
to be done is matching the variables in our data set with the The original data set contains the demographic variables in
namesofthevariablesthattheratetableisgroupedby.Age the format that matches the rate table (stored as X1, X2,...)
intheratetableobjectisalwaysexpressedindaysandthere- and the covariates in the format in which they were used in
fore the samemustbedonewiththeageintheobserveddata themodel.Foreachofthegroupsinthemodel,objectgroups
set. The calendar year must be specied in the date format contains the number at risk (ld), number of events (nd), and
andthecodingofthesexvariablemustbedoneinthematch- ∗ ∗
the values p (ps), d (dstar) and ln(yki)(lny) dened in Sec-
ingway(intheUSandSlovenetablessexisdenedasafactor ki ki
tion 2. This object can be useful in a more detailed analysis of
variable with levels “male” and “female”, so either a factor the data. Additionally, a note about the number of the groups
variable or an integer with values 1 and 2 will work). in which the observed number of deaths is smaller than the
Theuser, however, does not have to change the data set— expected is reported (this gives a very basic indication about
all the matching can be done by the help of a special term goodnessoft).
in the formula named ratetable. In order to construct this
term, rst check the names of the ratetable variables. For
the slopop rate table the names are 4. Example
>attributes(slopop)$dimid
"age" "year" "sex" Toillustrate the usage of the program, we use a subset of data
from the study of survival of patients after acute myocardial
These names are now to be matched with the variable infarctionthatisincludedinthepackageasthelerdata.The
names(oranyfunctionofthem),forexample: data were collected in the study carried out at the University
Clinical Center in Ljubljana and contain 1040 patients diag-
Surv(time,status)∼x+ratetable(age=age*365.24, nosedbetween1982and1986andfollowedupuntil1997.Dur-
sex="male",year=diag) ing this time 547 deaths occurred and as the causes of death
are not given, this is a good example of the need of the rela-
Thenamesontheleftofeachequalitysignarethedimen- tive survival methodology. The organisation of the data is as
sionnamesoftheratetablenames,whilethenamesonthe follows:
right are the variable names in the observed data set. This ex- >rdata[1:2,]
ampleisforadatasetinwhichnovariablesexisincludedas time cens age sex year agegr
all the patients are male, the calendar year variable is stored 1 2657 1 68 2 24Jun82 62-70
as diag (diagnosis year), and age is in years and must there- 2 1097 1 63 2 31Aug82 62-70
forebemultipliedby365.24tomatchthedataintheratetable Time is measured in days and year of infarction is ex-
(note that .24 is used in all R calculations as there are 24 leap pressed in R date format. Age is measured in years and a cat-
years per century). egoricalvariableagegrcontainingfouragecategories(“under
If the variables in the observed data set are of the same 54”, “54–61”, “62–70”, “71–95”) is formed. The censoring indi-
type and have the same names as the ratetable variables, the cator is specied in variable cens and is coded 0 (censoring)
termratetablecanbeomittedfromtheformula,i.e.thecall and1(event).
no reviews yet
Please Login to review.