“Only those samples which fit the model best in cross validation were included” (whistleblower) “I suspect that we likely disagree with what constitutes validation” (Potti and Nevins)

toilet-fireworks-by-stephenthruvegas-on-flickr

Potti (part 2)

So it turns out there was an internal whistleblower in the Potti scandal at Duke after all (despite denials by the Duke researchers involved ). It was a medical student Brad Perez. It’s in the Jan. 9, 2015 Cancer Letter. Ever since my first post on Potti last May (part 1), I’ve received various e-mails and phone calls from people wishing to confide their inside scoops and first-hand experiences working with Potti (in a statistical capacity) but I was waiting for some published item. I believe there’s a court case still pending (anyone know?)

Now here we have a great example of something I am increasingly seeing: Challenges to the scientific credentials of data analysis are dismissed as mere differences in statistical philosophies or as understandable disagreements about stringency of data validation.[i] This is further enabled by conceptual fuzziness as to what counts as meaningful replication, validation, legitimate cross-validation.

If so, then statistical philosophy is of crucial practical importance.[ii]

Here’s the bulk of Perez’s memo (my emphasis in bold), followed by an even more remarkable reply from Potti and Nevins.

Bradford Perez Submits His Research Concerns

The Med Student’s Memo

This document was written by Bradford Perez, then a third-year medical student, in late March or early April 2008. Working in the laboratory of Anil Potti, Perez presented what biostatisticians describe as an excellent critique of the flawed methodology employed Duke genomics researchers.

I want to address my concerns about how my research year has been in the lab of Dr. Anil Potti. As a student working in this laboratory, I have raised my serious issues with Dr. Potti and also with Dr. Nevins in order to clarify how I might be mistaken. So far, no sincere effort to address these concerns has been made and my concerns have been labeled a “difference of opinion.” I respectfully disagree. In raising these concerns, I have nothing to gain and much to lose.

In fact, in raising these concerns, I have given up the opportunity to be included as an author on at least 4 manuscripts. I have also given up a Merit Award for a poster presentation at this year’s annual ASCO meeting. I have also sacrificed 7 months of my own hard work and relationships that would likely have helped to further my career. Making this decision will make it more difficult for me to gain a residency position in radiation oncology. …

I joined the Potti lab in late August of last year and I cannot tell you how excited I was to have the opportunity to work in a lab that was making so much progress in oncology. The work in laboratory uses computer models to make predictions of individual cancer patient’s prognosis and sensitivity to currently available chemotherapies. It also works to better understand tumor biology by predicting likelihood of cancer pathway deregulation. Over the course of the last 7 months, I have worked with feverish effort to learn as much as possible regarding the application of genomic technology to clinical decision making in oncology. As soon as I joined the lab, we started laying the ground work for my own first author publication submitted to the Journal of Clinical Oncology and I found myself (as most students do) often having questions about the best way to proceed. The publication involved applying previously developed predictors to a large number of lung tumor samples from which RNA had been extracted and analyzed to measure gene expression. Our analysis for this project was centered on looking at differences in characteristics of tumor biology and chemosensitivity between males and females with lung cancer. I felt lucky to have a mentor who was there in the lab with me to teach me how to replicate previous success. I believed the daily advice on how to proceed was a blessing and it was helping me to move forward in my work at an amazingly fast rate. As we were finishing up the publication and began writing the manuscript, I discovered the lack of interest in including the details of our analysis. I wondered why it was so important not to include exactly how we performed our analysis. I trusted my mentor because I was constantly reminded that he had done this before and I didn’t know how things worked. We submitted our manuscript with a short, edited methods section and lack of any real description for how we performed our analysis. I felt relieved to be done with the project, but I found myself concerned regarding why there had been such a pushback to include the details of how we performed our analysis. An updated look at previous papers published before I joined the lab showed me that others were also concerned with the methods of our lab’s previous analyses. This in conjunction with my mentor’s desire to not include the details of our analysis was very concerning. I received my own paper back with comments from the editor and 4 reviewers. These reviewers shared some criticisms regarding our findings and were concerned about the lack of even the option to reproduce our findings since we had included none of the predictors, software, or instructions regarding how we performed this analysis. The implication in the paper was that the study was reproducible using publicly available datasets and previously published predictors even though this was not the case. While I still maintained respect for my mentor’s experience, I felt strongly that we needed to include all the details. Ultimately, I decided that I was not comfortable resubmitting the manuscript even with a completely transparent methods section because I believe that we have no way of knowing whether the predictors I was applying were meaningful. In addition to the red flags with regard to lack of transparency that I mentioned already, I would like to share some of the reasons that I find myself very uncomfortable with the work being done in the lab.

When I returned from the holidays after submitting my manuscript, I started work on a new project to develop a radiation sensitivity predictor using methods similar to those previously developed. I realized for the first time how hard it was to actually meet with success in developing my own prediction model. No preplanned method of separation into distinct phenotypes worked very well. After two weeks of fruitless efforts, my mentor encouraged me to turn things over to someone else in the lab and let them develop the predictor for me. I was gladly ready to hand off my frustration with the project but later learned methods of predictor development to be flawed. Fifty-nine cell line samples with mRNA expression data from NCI-60 with associated radiation sensitivity were split in half to designate sensitive and resistant phenotypes. Then in developing the model, only those samples which fit the model best in cross validation were included. Over half of the original samples were removed. It is very possible that using these methods two samples with very little if any difference in radiation sensitivity could be in separate phenotypic categories. This was an incredibly biased approach which does little more than give the appearance of a successful cross validation. While this predictor has not been published yet, it was another red flag to me that inappropriate methods of predictor development were being implemented.

After this troubling experience, I looked to other predictors which have been developed to learn if in any other circumstances samples were removed for no other reason than that they did not fit the model in cross validation. Other predictors of chemosensitivity were developed by removing samples which did not fit the cross validation results. At times, almost half of the original samples intended to be used in the model are removed. Once again, this is an incredibly biased approach which does little more than give the appearance of a successful cross validation. These predictors are then applied to unknown samples and statements are made about those unknowns despite the fact that in some cases no independent validation at all has been performed.

A closer look at some of the other methods used m the development of the predictors is also concerning. Applying prior multiple T-tests to specifically filter data being used to develop a predictor is an inappropriate use of the technology as it biases the cross validation to be extremely successful when the T-tests are performed only once before development begins. This bias is so great, that accuracy exceeding 90% can be achieved with random samples. ….

My efforts in the lab have led me to have concerns about the robustness of these prediction models in different situations. Over time, different versions of software which apply these predictors have been developed. In using some of the different versions of software, I found that my results were drastically different despite the fact that I bad been previously told that the different versions of the classifier code yielded almost exactly the same results. The results from the different versions are so drastically different that it is impossible for all versions to be accurate. Publications using different versions have been published and predictions are claimed to be accurate in all circumstances. If a predictor is being applied in a descriptive study or in a clinical for any reason, it should be confined that the version of software that is being used to apply that predictor yields accurate predictions in independent validation.

….Some other predictors which have been developed in the lab claim to predict likelihood of tumor biology deregulation. The publication which reports the development of these predictors was recently accepted for publication in JAMA. The cancer biology predictors were developed by taking gene lists from prominent papers in the literature and using them to generate signatures of tumor biology/microenvironment deregulation. The problem is in the methods used to generate those predictors. A dataset consisting of a conglomerate of cancer cell lines (which we refer to as IJC) was used for each predictor’s development an in-house program, Filemerger, was used to bring the gene list of the IJC down to include only the relevant genes for a given predictor. At that point, samples were sorted using hierarchical clustering and then removed one by one and reclustered at each step until two distinct clusters of expression were shown. This step in and of itself biases the model to work successfully in cross validation although an argument could be made that this is acceptable because the gene list is already known to be relevant. The decision regarding how to identify one group of samples as properly regulated and the other as deregulated is where the methods become unclear. There is no way to know if the phenotypes were assigned appropriately, backwards, or if the two groups accurately represent the two phenotypes in question at all.….

After an earlier publication which claimed to make extremely accurate predictions of chemosensitivity (Potti et al., Nature Medicine, 2006), I think that it was assumed that It was easy to generate predictors. More recent events have shown that the methods were more complicated and perhaps different than first described. Given the number of errors that have already been found and the contradicting methods for this paper that have been reported, I think it would be worthwhile to attempt to replicate all the findings of that paper (including methods for development AND claimed validations) in an independent manner. More recently, when we’ve met with trouble in predictor development we’ve resorted to applying prior multiple T-tests or simply removing multiple samples from the initial set of phenotypes as we find that they don’t fit the cross validation model. These methods which bias the accuracy of the cross validation are not clearly (if at all) reported in publications and in most situations the accuracy of the cross validation is being used as at least one measure of the validity of a given model. ….

At this point, I believe that the situation is serious enough that all further analysis should be stopped to evaluate what is known about each predictor and it should be reconsidered which are appropriate to continue using and wonder what circumstances. By continuing to work in this manner, we are doing a great disservice ourselves, to the field of genomic medicine, and to our patients. I would argue that at this point nothing that should be taken for granted. All claims of predictor validations should be independently and blindly performed. ….

I have had concerns for a while; however I waited to be absolutely certain that they were grounded before bringing them forward. As I learn more and more about how analysis is performed in our lab, the stress of knowing these problems exist is overwhelming. Once again, I have nothing to gain by raising these concerns. In fact, I have already lost. …

Read the full memo by Perez here. My earlier post, “What have we learned from the Anil Potti training and test data fireworks” is here.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In their remarkable letter (in reply to Perez) below, Potti and Nevins admit that they had conveniently removed data points that disagreed with their model. In their view, however, the cherry-picked data that do support their model give grounds for ignoring the anomalies. Since the model checks out in the cases it checks out, it is reasonable to ignore those annoying anomalous cases that refuse to get in line with their model.[ii]

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Extracts from:

“Nevins and Potti Respond To Perez’s Questions and Worries” (the full letter is here)

Dear Brad,

We regret the fact that you have decided to terminate your fellowship in the group here and that your research experience did not tum out in a way that you found to be positive. We also appreciate your concerns about the nature of the work and the approaches taken to the problems. While we disagree with some of the measures you suggest should be taken to address the issues raised, we do recognize that there are some areas of the work that were less than perfect and need to be rectified.

…….. I suspect that we likely disagree with what constitutes validation…..

We recognize that you are concerned about some of the methods used to develop predictors. As we have discussed, the reality is that there are often challenges in generating a predictor that necessitates trying various methods to explore the potential. Clearly, some instances arc very straightforward such as the pathway predictors since we have complete control of the characteristics of the training samples. But, other instances are not so clear and require various approaches to explore the potential of creating a useful signature including in some cases using information from initial cross validations to select samples. If that was all that was done in each instance, there is certainly a danger of overfitting and getting overly optimistic prediction results. We have tried in all instances to make use of independent samples for validation of which then puts the predictor to a real test. This has been done in most such cases but we do recognize that there are a few instances where there was no such opportunity. It was our judgment that since the methods used were essentially the same as in other cases that were validated, that it was then reasonable move forward. You clearly disagree and we respect that view but we do believe that our approach is reasonable as a method of investigation.

……We don’t ask you to condone an approach that you disagree with but do hope that you can understand that others might have a different point of view that is not necessarily wrong.

Finally, we would like to once again say that we regret this circumstance. We wish that this would have worked out differently but at this point, it is important to move forward.

Sincerely yours,

Joseph Nevins

Anil Potti

[i] I have recently received letters from people who tell me that any attempt to improve on statistical methodology or to critically evaluate–in a serious manner– people’s abuses of statistical concepts, is an utter waste of time and tantamount to philosophical navel gazing. Why? Because everyone knows, according to them, that statistics is just so much window dressing and that political/economic expediency is what drives kicking data into line to reach their pre-ordained conclusions. On this view, criticizing Potti and Nevins falls under the navel-gazing umbrella. I wonder how these people would feel were they the ones who signed up for personalized trials based on Potti and Nevins’ cutting-edge model.

Here’s my reply to them: If you want to come out as a social constructivist (it’s all a matter of social negotiation), data nihilist, dadaist, irrationalist, fine. But if you put up your shingle purporting to be a statistical advisor or reformer, as someone who deserves to criticize other people’s interpretations of tests, as one who might issue a ‘friend of the court’ brief to the Supreme Court on interpreting statistical significance tests–hiding your true data nihilism–then you’re being dishonest, misleading, and acting in a fraudulent manner. See my response to a comment by Deirdre McCloskey.

[ii]The model, to them, seems plausible; besides, they’re trying for a patent, and it’s only going to be the basis of your very own “personalized” cancer treatment!

A Timeline of The Duke Scandal

The Cancer Letter’s Previous Coverage

My earlier Potti Post.

Filed under: evidence-based policy, junk science, PhilStat/Med Tagged: Potti scandal