Title: | Datasets from "Modelling Survival Data in Medical Research" by Collett |
---|---|
Description: | Datasets for the book entitled "Modelling Survival Data in Medical Research" by Collett (2023) <doi:10.1201/9781003282525>. The datasets provide extensive examples of time-to-event data. |
Authors: | Mark Clements [aut, cre] |
Maintainer: | Mark Clements <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-01-10 10:22:39 UTC |
Source: | https://github.com/mclements/collett |
Clinical trial of 44 patients with chronic active hepatitis randomised to either the drug prednisolone or an untreated control group.
active_hepatitis
active_hepatitis
A data frame with 44 rows and 3 variables:
treatment
integer treatment (1=prednisolone, 2=control)
time
integer survival time from admission to study (months)
status
integer event indicator (1=event, 0=right censored)
See Collett (2023)
For female breast cancer patients from Middlesex Hospital. The dataset includes the result of staining using Helix pomatia agglutinin (HPA).
bcancer
bcancer
A data frame with 45 rows and 3 variables:
stain
integer for negative staining (=1) or positive staining (=2)
time
integer time in months for survival
status
integer for status at end of follow-up (0=censored, 1=death)
For details about the study design, see Leathem and Brooks (1987).
The dataset is described in Example 1.2 and Table 1.2 (Collett, 2023, pages 6-7).
Leathem AJ, Brooks S. Predictive value of lectin binding on breast-cancer recurrence and survival. The Lancet. 1987 May 9;329(8541):1054-6. doi:10.1016/S0140-6736(87)90482-X
library(survival) plot(survfit(Surv(time,status)~stain, data=bcancer), col=1:2, xlab="Survival time (months)", ylab="Survival") legend("topright", legend=c("Negative staining","Positive staining"), col=1:2, lty=1, bty="n")
library(survival) plot(survfit(Surv(time,status)~stain, data=bcancer), col=1:2, xlab="Survival time (months)", ylab="Survival") legend("topright", legend=c("Negative staining","Positive staining"), col=1:2, lty=1, bty="n")
Placebo controlled trial of bladder cancer patients randomised to thiopeta or to placebo
bladder
bladder
A data frame with 86 rows and 6 variables:
patient
integer patient number (1-86)
time
integer survival time in months
status
integer status of patient (0=censored, 1=recurrence)
treat
integer treatment group (1=placebo, 2=thiotepa)
init
integer initial number of tumours
size
integer diameter of larger initial tumour in cm
See Collett (2023)
A study of 37 patients with leukaemia in complete remission who received a non-depleted allogenic bone marrow transplant.
bone_marrow
bone_marrow
A data frame with 37 rows and 9 variables:
patient
integer patient number (1-37)
time
integer survival time in days
status
integer status of patient (0=alive, 1=dead)
rage
integer age of patient in years
dage
integer age of donor in years
type
integer type of leukaemia (1=AML, 2=ALL, 3=CML)
preg
integer Donor pregnancy (0=no, 1=yes)
index
double index of cell-lymphocyte reactions
gvhd
integer graft-versus-host disease (0=no, 1=yes)
See Collett (2023)
Patient outcome following bone marrow transplantation
bone_marrow_tx
bone_marrow_tx
A data frame with 2204 rows and 9 variables:
id
integer patient id
leukaemia
character type of leukaemia (CML,ALL,AML)
age
character age group of patient in years (<=20, 21-40, >40))
match
integer indicator for whether there was a donor gender match (0=no, 1=yes)
tcell
integer indicator for whether there was T-cell depletion (1=yes, n=no)
ptime
integer time to platelet recovery (days)
pcens
integer event indicator for platelet recovery (1=event, 0=censored)
rdtime
integer time to relapse of death (days)
rdcens
integer event indicator for relapse or death (1=event, 0=censored)
See Collett (2023)
Recurrence free survival in breast cancer patients
breast_rfs
breast_rfs
A data frame with 686 rows and 11 variables:
id
integer patient id
treat
integer hormonal treatment (0=no tamoxifen, 1=tamoxifen)
age
integer patient age (years)
men
integer menopausal status (1=premenopausal, 2=postmenopausal)
size
integer tumour size (mm)
grade
integer tumour grade (1,2,3)
nodes
integer number of positive pymph nodes
prog
integer progesterone receptor status (femtomoles)
oest
integer oestrogen receptor status (femtomoles)
time
integer recurrence-free survival time (days)
status
integer event indicator (0=censored, 1=relapse or death)
See Collett (2023)
The datasets are based on the official .zip file. A table for the dataset names and file names sorted by file name is here:
Dataset name | File name |
-------------------- | ----------------- |
illustration | "A numerical illustration.dat" |
leukaemia | "Bone marrow transplantation in the treatment of leukaemia.dat" |
bone_marrow | "Bone marrow transplantation.dat" |
ovarian | "Chemotherapy in ovarian cancer patients.dat" |
active_hepatitis | "Chronic active hepatitis.dat" |
granulomatous | "Chronic granulomatous disease.dat" |
tamoxifen | "Clinical trial of tamoxifen in breast cancer patients.dat" |
prostatic | "Comparison of two treatments for prostatic cancer.dat" |
kidneytx | "Comparisons between kidney transplant centres.dat" |
liverbase | "Data from a cirrhosis study (baseline).dat" |
liver_counting | "Data from a cirrhosis study (in counting process format).dat" |
lbrdata0 | "Data from a cirrhosis study (lbr data).dat" |
HELP | "Health evaluation and linkage to primary care.dat" |
dialysis | "Infection in patients on dialysis.dat" |
bone_marrow_tx | "Patient outcome following bone marrow transplantation.dat" |
bcancer | "Prognosis for women with breast cancer.dat" |
pulmonary | "Pulmonary metastasis.dat" |
breast_rfs | "Recurrence free survival in breast cancer patients.dat" |
ulcer | "Recurrence of an ulcer.dat" |
bladder | "Recurrence of bladder cancer.dat" |
mammary | "Recurrence of mammary tumours in female rats.dat" |
valve | "Survival following aortic valve replacement.dat" |
tplant | "Survival following kidney transplantation.dat" |
ducks | "Survival of black ducks.dat" |
mice | "Survival of laboratory mice.dat" |
liver | "Survival of liver transplant recipients.dat" |
myeloma | "Survival of multiple myeloma patients.dat" |
lung | "Survival of patients registered for a lung transplant.dat" |
gcancer | "Survival of patients with gastric cancer.dat" |
melanoma | "Survival times of patients with melanoma .dat" |
livertx | "Time to death while waiting for a liver transplant.dat" |
IUD | "Time to discontinuation of the use of an IUD.dat" |
kidney | "Treatment of hypernephroma.dat" |
And now sorted by the dataset names:
Dataset name | File name |
-------------------- | ----------------- |
active_hepatitis | "Chronic active hepatitis.dat" |
bcancer | "Prognosis for women with breast cancer.dat" |
bladder | "Recurrence of bladder cancer.dat" |
bone_marrow | "Bone marrow transplantation.dat" |
bone_marrow_tx | "Patient outcome following bone marrow transplantation.dat" |
breast_rfs | "Recurrence free survival in breast cancer patients.dat" |
dialysis | "Infection in patients on dialysis.dat" |
ducks | "Survival of black ducks.dat" |
gcancer | "Survival of patients with gastric cancer.dat" |
granulomatous | "Chronic granulomatous disease.dat" |
HELP | "Health evaluation and linkage to primary care.dat" |
illustration | "A numerical illustration.dat" |
IUD | "Time to discontinuation of the use of an IUD.dat" |
kidney | "Treatment of hypernephroma.dat" |
kidneytx | "Comparisons between kidney transplant centres.dat" |
lbrdata0 | "Data from a cirrhosis study (lbr data).dat" |
leukaemia | "Bone marrow transplantation in the treatment of leukaemia.dat" |
liver | "Survival of liver transplant recipients.dat" |
liver_counting | "Data from a cirrhosis study (in counting process format).dat" |
liverbase | "Data from a cirrhosis study (baseline).dat" |
livertx | "Time to death while waiting for a liver transplant.dat" |
lung | "Survival of patients registered for a lung transplant.dat" |
mammary | "Recurrence of mammary tumours in female rats.dat" |
melanoma | "Survival times of patients with melanoma .dat" |
mice | "Survival of laboratory mice.dat" |
myeloma | "Survival of multiple myeloma patients.dat" |
ovarian | "Chemotherapy in ovarian cancer patients.dat" |
prostatic | "Comparison of two treatments for prostatic cancer.dat" |
pulmonary | "Pulmonary metastasis.dat" |
tamoxifen | "Clinical trial of tamoxifen in breast cancer patients.dat" |
tplant | "Survival following kidney transplantation.dat" |
ulcer | "Recurrence of an ulcer.dat" |
valve | "Survival following aortic valve replacement.dat" |
As an alternative to using the R datasets, the collett_data
function allows for reading from the original .dat files that
are stored in the package.
collett_data(name)
collett_data(name)
name |
Character string with the original filename |
A data-frame
Maintainer: Mark Clements [email protected] (ORCID)
Useful links:
head(collett_data("A numerical illustration.dat")) ## which is equivalent to: head(illustration)
head(collett_data("A numerical illustration.dat")) ## which is equivalent to: head(illustration)
Time from dialysis to infection for patients with diseases of the kidney.
dialysis
dialysis
A data frame with 13 rows and 5 variables:
patient
integer patient id
time
integer time to infection (days)
status
integer event indicator (0=censored, 1=infection)
age
integer age in years
sex
integer sex of the patient (1=male, 2=female)
See Collett (2023)
Black ducks, Anas rubripes, were followed the US Fish and Wildlife Service.
ducks
ducks
A data frame with 50 rows and 6 variables:
duck
integer duck indicator
time
integer survival time in days
status
integer status of bird (0=alive or missing, 1=dead)
age
integer age group (0=hatch-year bird, 1=bird aged >= 1 year)
weight
integer weight of bird in g
length
integer length of wing in mm
See Collett (2023)
Survival of patients with gastric cancer
gcancer
gcancer
A data frame with 90 rows and 4 variables:
patient
integer patient id
time
integer survival time in days
status
integer event indicator (0=censored, 1=dead)
treat
integer treatment arm (0=chemotherapy alone, 1=chemotherapy and radiotherapy)
See Collett (2023)
Trial comparing interferon with a placebo.
granulomatous
granulomatous
A data frame with 128 rows and 12 variables:
patient
integer patient number (1-128)
time
integer time to first infection (days)
status
integer status of patient (0=censored, 1=infection)
centre
integer treatment centre; see Collett (2023, page 504)
treat
integer treatment group (0=placebo, 1=interferon)
age
integer age in years
sex
integer sex (1=male, 2=female)
height
double height in cm
weight
double weight in kg
pattern
integer pattern of inheritance (1=X-linked, 2=autosomal recessive)
cort
integer use of corticosteroids at trial entry (1=used, 2=not used)
anti
integer Use of antibiotics at trial entry (1=used, 2=not used)
See Collett (2023)
A clinical trial for patients in a residential detoxification programme. Patients were randomised to either get a referral to a HELP clinic or not.
HELP
HELP
A data frame with 447 rows and 7 variables:
subject
integer subject id
days
integer time to linkage to primary care in days
status
integer event indicator (0=no linkage, 1=linkage)
age
integer age of patient in years
gender
integer gender of the patient (0=female, 1=male)
housing
integer Homelessness status (0=homeless, 1=housed)
linkage
integer assistance to linking to healthcare (0=no, 1=yes)
Collett (2023) defines this dataset as "help", however that leads to issues with using R's help system. We have changed the dataset name to "HELP". Moreover, the book uses the variables "Time" an d"Help", whereas the dataset includes variables "days" and "linkage", respectively.
Artificial data on patient survival classified according to factors a and b
illustration
illustration
A data frame with 37 rows and 4 variables:
a
integer factor a
b
integer factor b
time
integer event time
status
integer event status (1=event, 0=right censored)
See Collett (2023).
A very simple dataset showing potential right censoring for time to discontinuation of the use of an IUD.
IUD
IUD
A data frame with 18 rows and 2 variables:
time
integer Time in weeks to discontinuation of the use of an IUD
status
integer Indicator for whether the IUD was discontinued: 0=No, 1=Yes
These data are reported in Table 1.1 (Collett, 2023, page 6).
This study was undertaken at the University of Oklahoma Health Sciences Center to investigate survival among 36 patients with a kidney tumour (hypernephroma). Standard tangent included chemotherapy and immunotherapy, with some patients also having a nephrectomy, or surgical removal of the kidney. For further details, see Lee and Wang (2013).
kidney
kidney
A data frame with 36 rows and 4 variables:
nephrectomy
integer indicator for nephrectomy (0=No; 1=Yes)
age
integer age group (1=<60; 2=60-70; 3=>70
time
integer for the follow-up time in months
status
integer for status at the end of follow-up (1=died; 0=censored)
Lee ET, Wang J. Statistical Methods for Survival Data Analysis. New York, NY: John Wiley & Sons; 2013, fourth edition. https://www.wiley.com/en-sg/Statistical+Methods+for+Survival+Data+Analysis%252C+4th+Edition-p-9781118095027
Transplant survival rates by recipients of organs from deceased donors. No event was defined as being alive with a functioning graft at the last known follow-up.
kidneytx
kidneytx
A data frame with 1439 rows and 9 variables:
patient
integer patient id
centre
integer transplant centre (1-8)
tsurv
integer transplant survival time (days)
tcens
integer event indicator (0=censored, 1=transplant failure)
dage
integer donor age (years)
dtype
integer donor type (0=deceased following brain death, 1=circulatory death)
rage
integer recipient age (years)
diab
integer diabetic status (0=absent, 1=present)
cit
double cold ischaemic time (hours)
See Collett (2023). Thirty-five patients had tsurv==0 (that is, the transplanted kidney did not function).
DATASET_DESCRIPTION
lbrdata0
lbrdata0
A data frame with 42 rows and 3 variables:
patient
integer patient id
time
integer date of measurement (days)
lbr
double log bilirubin level
See Collett (2023)
Bone marrow transplantation in the treatment of leukaemia
leukaemia
leukaemia
A data frame with 23 rows and 8 variables:
patient
integer patient id
time
integer survival time in days
status
integer event indicator (0=alive, 1=dead)
group
integer disease group (1=ALL, 2=low-risk AML, 3=high-risk AML)
page
integer age of patient in years
dage
integer age of donor in years
precovery
integer platelet recovery indicator (0=no, 1=yes)
ptime
character time in days to return of platelets to normal level (if precovery=1)
See Collett (2023). Note that ptime will need conversion:).
Survival of liver transplant recipients
liver
liver
A data frame with 1761 rows and 7 variables:
patient
integer patient id
age
integer patient age in years
gender
integer patient gender (1=male, 2=female)
disease
integer primary disease (1=PBC, 2=PSC, 3=ALD)
time
integer time to event (days)
status
integer cof>0
cof
integer cause of graft failure (0=functioning graft, 1=rejection, 2=thrombosis, 3=recurrent disease, 4=other)
See Collett (2023)
Artificial data
liver_counting
liver_counting
A data frame with 54 rows and 7 variables:
patient
integer patient id
start
integer start time (days)
stop
integer stop time (days)
status
integer event indicator (0=censored, 1=uncensored)
treat
integer treatment group (0=placebo, 1=Liverol)
age
integer age of the patient at start of study (years)
lbrt
double logarithm of bilirubin level
See Collett (2023). Note that the variable for log of bilirubin differs to that for "liverbase".
Articial data
liverbase
liverbase
A data frame with 12 rows and 6 variables:
patient
integer patient id
time
integer survival time in days
status
integer event indicator (0=censored, 1=uncensored)
age
integer age of the patient (years)
treat
integer treatment group (0=placebo, 1=Liverol)
lbr
double logarithm of bilirubin level
See Collett (2023)
Investigate the time on the liver transplantation list.
livertx
livertx
A data frame with 281 rows and 7 variables:
patient
integer patient id
time
integer time on the list
status
integer event indicator (0=censored, including having a transplant, 1=died on the list)
age
integer patient age in years
gender
integer patient gender (1=male, 0=female)
bmi
double body mass index (kg/m^2)
ukeld
integer UK endstage liver disease score
See Collett (2023). A higher UKELD is associated with worse disease severity.
Survival of patients registered for a lung transplant
lung
lung
A data frame with 196 rows and 7 variables:
patient
integer patient id
time
integer time from registration to the earlist of removal from list, last known follow-up date, 30 April 2012, or death (days)
status
integer event indicator (0=censored, 1=dead)
age
integer age in years
gender
integer gender (1=male, 2=female)
bmi
double body mass index
disease
integer disease (1=COPD, 2=fibrosis, 3=suppurative, 4=other)
See Collett (2023)
This is an animal experiment to compare the use of retinyl acetate (related to vitamin A) across the study (treatment) to treatment with retinyl acetate to 60 days and then no further treatment (control). The female rats all had mammary tumours.
mammary
mammary
A data frame with 254 rows and 4 variables:
rat
integer id for each rat
treatment
integer treatment arm indicator (1=treatment, 0=control)
time
double follow-up time (days)
status
integer recurrence indicator (0=no, 1=yes)
See Collett (2023)
Comparing two immunotherapy treatments for patients with melanoma
melanoma
melanoma
A data frame with 30 rows and 4 variables:
age
integer age group (1=21-44, 2=41-60, 3=61+)
treatment
integer treatment arm (1=BCG, 2=C. parvum)
time
integer survival time (months)
status
integer event indicator (0=censored, 1=dead)
See Collett (2023)
Laboratory study of survival for two groups of mice exposed to radiation.
mice
mice
A data frame with 181 rows and 3 variables:
environment
integer type of environment (1=standard, 2=germ-free)
causeofdeath
integer cause of death (1=thymic lymphoma, 2=reticulum cell sarcoma, 3=other causes)
time
integer survival time (days)
See Collett (2023). Note that are no censored event times.
Patients diagnosed with multiple myeloma who were diagnosed and treated with alkylating agents at West Virginia University Medical Center for ages 50-80 years.
myeloma
myeloma
A data frame with 48 rows and 10 variables:
patient
integer for a patient identifier
time
integer survival time in months
status
integer for status at follow-up (0=Alive, 1=Dead)
age
integer age at diagnosis in years
sex
integer for sex of the patient (1=male, 2=female)
bun
integer level of blood urea nitrogen at diagnosis (unit assumed to be mg/dL based on the normal range for adults reported by https://en.wikipedia.org/wiki/Blood_urea_nitrogen)
ca
integer serum calcium at diagnosis in mg/dL
hb
double for serum hemoglobin level at diagnosis in g/dL (equivalently, grams per 100 mL)
pcells
integer percent of plasma cells in the bone marrow at diagnosis
protein
integer indicator for whether or not the Bence-Jones protein was present in the urine at diagnosis (0=absent, 1=present)
Krall et al (1975) did not provide the units for all of these measurements. In their analyses, they used some data transformations: log(bun). Collett (2023) converted data from Krall et al (1975): BUN is reported by Krall and colleagues as X1=log(BUN), however the log base and unit is unclear; Krall and colleagues reported for 65 individuals, including those younger than 50 and older than 80.
Krall JM, Uthoff VA, Harley JB. A step-up procedure for selecting variables associated with survival. Biometrics. 1975 Mar 1:49-57. doi:10.2307/2529709
## To be completed.
## To be completed.
Trial for treatment of ovarian cancer patients comparing cyclophosphamide alone with cyclophosphamide combined with adriamycin.
ovarian
ovarian
A data frame with 26 rows and 7 variables:
patient
integer identifer
time
integer survival time from randomisation in days
status
integer event indicator (0=right censored, 1=event)
treat
integer treatment (1=single, 2=combined)
age
integer age of patients in years
rdisease
integer extent of residual disease (1=incomplete, 2=complete)
perf
integer performance status (1=good, 2=poor)
See Collett (2023)
Randomised controlled trial from the Veteran's Administration Cooperative Urological Research Group. Includes patients who had stage III cancers and were randomised to placebo or daily oral treatment with 1.0 mg of diethylstilbesterol (DES).
prostatic
prostatic
A data frame with 38 rows and 8 variables:
patient
integer patient identifier
treatment
integer treatment indicator (1=placebo; 2=daily treatment with 1.0 mg of diethylstilbesterol (DES))
time
integ er survival time from trial entry to end of follow-up in months
status
integer for follow-up status (0=alive or died from other causes, 1=died from prostate cancer
age
integer age at trial entry in years
shb
double serum hemoglobin at trial entry in g/dL
size
integer size of the primary tumour in cm^3
index
integer Gleason index based on histopathology
TBC.
Andrews DF, Herzberg AM. Data: a collection of problems from many fields for the student and research worker. Springer Series in Statistics; Springer New York, NY; 1985. doi:10.1007/978-1-4612-5098-2
A very simple dataset with no censoring
pulmonary
pulmonary
A data frame with 11 rows and 1 variables:
time
integer survival time from pulmonary metastasis to death in months
See Collett (2023)
Clinical trial for breast cancer patients comparing combined tamoxifen and radiotherapy with tamoxifen alone.
tamoxifen
tamoxifen
A data frame with 641 rows and 18 variables:
id
integer patient identifier
treat
integer treatment group (0=tamoxifen+radiotherapy, 1=tamoxifen)
age
integer patient age at study entry (years)
size
double tumour size (cm)
hist
integer tumour histology (1=ductal, 2=lobular, 3=medullary, 4=mixed, 5=other)
hr
integer hormone receptor level (0=negative, 1=positive)
hb
integer Haemoglobin level (g/l)
andis
integer axillary relapse (0=no, 1=yes)
lsurv
integer time to local relapse or last follow-up (days)
ls
integer local relapse (0=no, 1=yes))
asurv
integer time to axillary relapse or last follow-up (days)
as
integer axillary relapse (0=no, 1=yes)
dsurv
integer Time to distant relapse or last follow-up (days)
ds
integer distant relapse (0=no, 1=yes)
msurv
integer time to second malignancy or last follow-up (days)
ms
integer second malignancy (0=no, 1=yes)
tsurv
integer time from randomisation to death or last follow-up (days)
ts
integer status at last follw-up (0=alive, 1=dead)
See Collett (2023)
Survival following kidney transplantation
tplant
tplant
A data frame with 434 rows and 7 variables:
patient
integer patient id
donor
integer donoe id
time
integer survival time in days
status
integer event indicator (0=censored, 1=graft failure or death with a functioning graft)
age
integer patient age (years)
diabetes
integer diabetes status (0=absent, 1=present)
cit
double cold ischaemic time, the time in hours between retrieval of the kidney from the donor and the transplantation
See Collett (2023)
A double-blind trial comparing two treatments for ulcers. Data from Belgium.
ulcer
ulcer
A data frame with 43 rows and 6 variables:
patient
integer patient id
age
integer age at the end of the trial in years
duration
integer duration of verified disease (1: <5 years, 2: >=5 years
treatment
integer treatment arm (1=A,2=B)
time
integer time since last visit (months)
result
integer result of the last visit (1=no ulcer detected, 2=ulcer detected)
See Collett (2023)
Patients following an aortic valve replacement are measured for left ventricular mass index (LVMI).
valve
valve
A data frame with 988 rows and 11 variables:
id
integer patient id
futime
double total follow-up time from date of surgery (years)
status
integer event indicator (0=censored, 1=death)
time
double time of LVMI measurement after surgery (years)
lvmi
double standardised LVMI
age
integer age of patient in years
sex
integer sex of patient (0=male, 1=female)
redo
integer previous cardiac surgery (0=no, 1=yes)
emerg
integer operative urgency (0=elective, 1=urgent or emergency)
dm
integer preoperative diabetes mellitus (0=no, 1=yes)
type
integer type of valve (1=human tissue, 2=porcine tissue)
See Collett (2023)