lung.utf8.md

Data Analysis and Statistical Inference
Chris Harris
20 April 2015

Summary

Lung cancer accounts for the highest deathrate, compared to any other cancer, across all demographic profiles. Because the symptoms don’t show up until it is too late to produce any effective medical intervention, people with a history of smoking, coal mining, genetic markers, or other lung predispositions, should consider occasional screening. This study produces compelling evidence that a new treatment, which appears in the late 1970’s, shows promise on certain types of lung cancer, using graphical analysis, but due to variability over the entire spectrum of lung cancers, fails the hypothesis test, under the normal model.

Introduction

The purpose of this study is to compare two treatment methods for lung cancer to determine if there is any rational justification that the test treatment yields better results than the standard treatment. Both treatment methods involve chemotherapy, but different drug combinations are applied, one using a ‘standard’ mix, which serves as a control, another using an alternate mix, which will be referred to as ‘novel’ throughout the rest of the paper. This data set is so popular in statistical circles, that the supporting literature article has been dissociated from the results, even though it was probably a landmark finding in the 1970’s. At this point, the research front has moved well beyond the benchmark, as the discussion section reveals, but for now, take a look at the data[1]:

137 obs of 8 var...

V1 = Treatment denotes the type of lung cancer chemotherapy: 1 (standard), 2 (test)
V2 = CellType denotes the type of cell involved: 1 (squamous), 2 (small cell), 3 (adeno), 4 (large)
V3 = Survival is the survival time in days since the treatment
V4 = Status denotes the status of the patient as dead or alive: 1 (dead), 0 (alive)
V5 = Karnofsky is the Karnofsky score: measure of treatment effectiveness
V6 = Diag is the time since diagnosis in months
V7 = Age is the age in years
V8 = Therapy denotes any prior therapy: 0 (none), 10 (yes)

Since the ultimate goal of any medical procedure is to extend life, evaluate V1 and V3.

Results

Graphical Interpretation

Overall distribution

summary(lung$V3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    25.0    80.0   121.6   144.0   999.0

Segregated distribution

summary(standard$V3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    25.0    97.0   115.1   153.0   553.0

summary(novel$V3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   24.75   52.50  128.21  117.25  999.00

Graphical Analysis

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     -59.445        1.607

Statistical Interpretation

n.standard + n.novel = N = 137

n.standard      = 69
mean.standard   = 115.1
st.dev.standard = 112.7

n.novel         = 68
mean.novel      = 128.2
st.dev.novel    = 193.8

delta = mean.novel - mean.standard = 128.2 - 115.1 = 13.1

hypothesis null delta  = 0
           alt        !=

st.err          = sqrt(193.8^2/68 + 112.7^2/69) = 27.14
marg.err(95)    = 1.96*27.14 = 53.2
pt.est(95)      = 13.1 +/- 53.2
conf.int(95)    = (-40.1,66.3)

pnorm(-13.1,0,27.14) = 0.315
alpha(95-2sides)     = 0.025

pnorm > alpha : preserve null : no significance with variability

check:
Z = (13.1 - 0)/27.14 = 0.483
pnorm(-abs(0.483))   = 0.315

Discussion

In the graphical analysis section, the data points were sorted in ascending order and plotted on the same graph, to determine how one data set compared to the other. To take things one step further, the novel data was plotted against standard data, yielding a linear regression slope of 1.61, which implies the novel data set might be superior to the standard, from a graphical perspective. However, when applying a normal statistical model to the data set, the noise is so great, that there is no scientific reason to reject the null hypothesis.

What could have caused such high volatility, examine V2, described as ‘CellType’, a potential response variable:

table(lung$V1,lung$V2)

##    
##      1  2  3  4
##   1 15 30  9 15
##   2 20 18 18 12

barplot(table(lung$V1,lung$V2))

Without access to the original paper published by the Veteran Administration to establish experimental design, it looks like the researchers may have pushed the tougher cases toward the novel treatment, dooming its success in clinical trial. To compare both treatments fairly, each one should have an equal distribution of cancer types, consistent with a stratified sampling method. Perhaps one group would respond to one method, while another group would respond to a different method, based on the cancer mechanism, providing a niche application for the novel treatment. With only 8 distinct variables per patient, confounding variables should be taken into consideration. When a successful outcome occurs, interview doctors and patients to uncover evidence supporting factors not included in the study.

Delving into the current literature for some clues[2]:

“Small-cell lung cancer (SCLC) accounts for 15%–18% of all cases. In recent years the incidence of SCLC has decreased. SCLC is strongly associated with tobacco smoking… Staging [within SCLC] has been performed according to a two-stage system developed by the Veteran Administration Lung Cancer Study Group (VALSG) in the USA dividing patients into limited and extensive disease. Limited disease was defined as tumour tissue that could be encompassed in a single radiation port and extensive disease was defined as any tumour that extended beyond the boundaries of a single radiation port.”

With only one category in the initial study, SCLC has branched into 2 groups. Furthermore, SCLC grades: 1, 2, or recurrence after remission, 3, require unique chemotherapy / radiation combinations[3].

Conclusion

Through inferential evidence gathered from literature review, graphic interpretation, and statistical analysis, an experimental design template emerges:

Break the cancer types into different levels, based on known mechanisms.

Choose a standard approach in each level, which represents the best efficacy for a given level.

Try one unique competing method across all the levels, in anticipation some levels reflect improvement, while other levels deteriorate.

After each round, choose the best approach for a particular cancer level.

When new strategies arise, iterate through steps 1 to 5, recognizing cancer levels themselves could evolve, as patients live longer, treatment resistance develops, or more information becomes available.

References

[1] Kalbfleisch, J. and Prentice, R., “Veteran Administration Lung Cancer Study Group Data” in “The Statistical Analysis of Failure Time Data”, pp 223-224, Wiley: New York (1980).

[2] M. Sørensen, M. Pijls-Johannesma, & E. Felip, On behalf of the ESMO Guidelines Working Group, “clinical practice guidelines : Small-cell lung cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up”, Annals of Oncology 21, Supplement 5: v120–v125 (2010).

[3] Cristian Rapicetta, Sara Tenconi, Tommaso Ricchetti, Sally Maramotti, and Massimiliano Paci, “Ch 13: Surgery in Small-Cell Lung Cancer: Past, Present and Future” in “Lung Diseases - Selected State of the Art Reviews”, Elvisegran Malcolm Irusen, Ed, InTech: Rijeka, Croatia (2012).

Veteran Administration Lung Cancer Study

Data Analysis and Statistical Inference Chris Harris 20 April 2015

Summary

Introduction

Since the ultimate goal of any medical procedure is to extend life, evaluate V1 and V3.

Results

Graphical Interpretation

Overall distribution

Segregated distribution

Graphical Analysis

Statistical Interpretation

Discussion

With only one category in the initial study, SCLC has branched into 2 groups. Furthermore, SCLC grades: 1, 2, or recurrence after remission, 3, require unique chemotherapy / radiation combinations[3].

Conclusion

Through inferential evidence gathered from literature review, graphic interpretation, and statistical analysis, an experimental design template emerges:

Break the cancer types into different levels, based on known mechanisms.

Choose a standard approach in each level, which represents the best efficacy for a given level.

Try one unique competing method across all the levels, in anticipation some levels reflect improvement, while other levels deteriorate.

After each round, choose the best approach for a particular cancer level.

When new strategies arise, iterate through steps 1 to 5, recognizing cancer levels themselves could evolve, as patients live longer, treatment resistance develops, or more information becomes available.

References

Data Analysis and Statistical Inference
Chris Harris
20 April 2015