Is there a bug in the toxicogenetics challenge learderboard? We have done crossvalidation to get an estimate of the RMSE we should obtain on the leaderboard. Our RMSE estimate and the one we get on the learderboard differ greatly. We argue such discrepancy is very unlikely. In addition, most team seems to be in the 0.2983  0.30 range. Again, somehow unlikely, unless everyone is using the same approach? Finally, the random prediction submission of the organising team achieving a RMSE of 0.0 is just ... weird!
Anyone else sharing our opinion?
Help get this topic noticed by sharing it on
Twitter,
Facebook, or email.

Hi Sebastien,
Thanks for pointing this out. We also noticed it and we are looking into it.
By the way, the RMSE of 0 is not a random prediction (as suggested by the file name) but it is a test with the Gold Standard. So a 0 is expected. By the way, how are you computing the RMSE? And what is the value you typically get in cross validation?
Thanks!
Gustavo 



I also got both training and crossvalidation RMSE values (less than 0.2) much smaller than those on the leader board (around 0.3). Do others get something similar?


Most of us have this problem. You might read through this thread for some more information:
http://support.sagebase.org/sagebase/... 

Hi everyone!
I just made my first submission today. And I noticed the problem that
Sébastien pointed out 16 days ago still seems to be there. Most RMSEs are within the range 0.290.3. As Gustavo said that they are looking into this, would the organizers please clarify whether there is any problem with the calculation of RMSE on the leaderboard?
Thanks a lot!
Tao 

I think the idea of the bug should be seriously considered
Doing my tests and with several types of cross validation, the better my models appear to be on cross validation, the worst they appear on the leaderboard! This is against everything I ever worked on, and totally inconsistent with most statistical theory
I consider 3 distinct possibilities
a) The data on the leaderboard is markedly different than the training data available. Thus it is impossible to make reliable inference
b) The leaderboard engine is broken and the scores provided are meaningless
c) Statistics is wrong
If we dismiss option c), options a) and b) are the only valid explanations. On any case we the participants cannot do anything. If the data is from a different origin and not randomly selected from the population it is pointless to do any inference. Also if the leaderboard engine is broken we cannot get any reliable feedback from our models
Thanks for any help on this 

I don't think there is anything wrong with the leader board. I think people are under estimating the effect of sampling variability on RMSE. They are asking us to predict 397 cell lines using 487 cell lines worth of data, i.e. predict ~1/2 (397/884) cell lines using ~1/2 (487/884) observations. I continue to say that the *real* problem is little intracellular IC10 variation relative to the sampling and interpolation noise.


Just a little more on this topic with some concrete information
my first submission to leaderboard was a test. The "model" just output the average toxicity for each molecule and applied a tiny random perturbation 0f 0.1 x*std to differentiate between cell lines. The RMSE of this trivial approach model should in fact be about 0.2 for the training data. Now what is bizarre is that for the data in the leaderboard, the RMSE calculated by the leaderboard procedures is 0.3. This does not make sense: either there is a bug or the data is from a different population than from the training data
This effect can be easily confirmed with a little R script. Note that it does not involve any cross validation and does not have any perceivable modeling bias. 

Hi Tao, Andre and Andrew,
I apologise for my late response, I've been away for the past week. Two weeks ago I've exchanged a number of emails with the organisers to validate their was no bug in the leaderboard. They have double checked the code and are confident the leaderboard is bug free.
If you are concern on the computation of the RMSE, Federica told me that "the RMSE that is shown is the one computed on the whole matrix (all compounds and all individuals)". They also kindly computed the RMSE on the training set prediction of one of our model as a sanity check. We obtained the same numerical result. More on the RMSE can be found here: http://en.wikipedia.org/wiki/Rootmea...
That leave us with two problems:
1) The discrepancy between the learning data and the testing data, as you and other pointed out: http://support.sagebase.org/sagebase/...
I believe the stratified sampling method used to split the data is the cause here.
2) The low signal to noise ration, as pointed by Andrew: http://support.sagebase.org/sagebase/...
I believe, the challenge can still be saved with minor consequence:
Step 1) Replace the EC10 by EC50. EC50 have a higher signal to noise ratio. Being at the inflection point of the dose/response curve they are independent of the slope of the curve. The experiment carried by Andrew also support that statement.
Step 2) Randomly split the training / leaderboard / final test set. Some cell line, for which we knew the EC10, will end up in the final testing set. This is not a perfect situation but since the EC50 cannot be easily inferred from the EC10, this will have a minor effect.
Sébastien Giguère 

You can easily convert an EC10 to and EC50 and vice versa, if you have all of the other parameters of the curve. I'm guessing they have an EC50 file laying around somewhere, because the curves are usually fit with respect to the EC50, and then the EC10 is calculated off this value. Until we see some sort of variability data for these parameters, I don't know if moving to an EC50 will help. I asked a leading question in another forum  if they could define meaningful differences we could turn this into a classification problem  i.e. predic lowmediumhigh responses, instead of predicting the raw EC10. I messed with this approach a little myself, and couldn't get anything of much use, but it seems like this problem would have a greater chance of success.


Hi Sebastien and Andrew,
Andrew might be right on using EC10, but this is not the main issue. I believe the challenge may be solvable with EC50
But continuing on the above: check this out in R
the data.frame train contains the training data (487 x 106). The vector means, contains the mean value for each compound and stds the respective standard deviations
#test is a copy of train that will be filled with new data
test<train
#this is very detailed not using any R vectorizations to make this point clear
for(i in 1:nrow(test)) {
for(j in 1:ncol(test)) {
test[i, j] = rnorm(1,mean=means[j], sd=.1*stds[j])
}
}
#now the computation of the RMSE in detailed form
#train is the truth, test is the "modeled" data
SE<(testtrain)^2
MSE<mean(SE)
RMSE<sqrt(MSE)
RMSE
[1] 0.2000049
There: RMSE=0,.2, for the full training set. How can this same procedure give a score of 0.3 according to leaderboard scoring engine?
[edit: for some obscure forumformating reason the R script does not appear online. Perhaps you may receive it by email. On any case, I may post an image if you want  SOLVED] 

Andrew:
The compounds were screened at 8 concentrations (0.33 nanoM to 92.2 microM) using a multiple well plate. They fitted a curve to the measured concentration and provided the EC10 on that curve, see: https://www.synapse.org/#!Wiki:syn176...
It should be straight forward to replace EC10 values by EC50 values.
Andre:
In fact, the answer is quite simple. For every compounds, the mean toxicity across cell lines of the training data differ a lot from thoses of the leaderboard. See my comment (third one) in this discussion: http://support.sagebase.org/sagebase/...
Both problems must be addressed. 

glad to see all the discussion here. In short, our analyses indicate that the leaderboard is a good representation of the training data  data is highly comparable across the two. the script used to calculate RMSE on the leaderboard is correct  there is not a bug.
We are hosting a webinar on Thursday to discuss these issues with all interested participants. Please check out more information on the webinar through this forum conversation:
http://support.sagebase.org/sagebase/...
thanks,
Lara 

Hi Lara,
How is it possible that using the script I provided gives totally different results from the training set and the leaderboard? Note tha I'm not making any inference. Each "prediction" is just the average of the toxicity of each compound! So it is roughly equal for all instances
Using this approach, the RMSE for the training set is 0.2. For the leaderboard data is 0.3. How can a 50% increase occur when no information whatsoever about each cell line is even used. This is important as I'm not talking about prediction models, but about the statistical properties of the 2 datasets and/or the evaluation procedures.
Only 2 solutions are possible: either there is bug or the data comes from different populations.
Please try the script I gave and send a submission to the leaderboard and verify the results you get and compare them to the RMSE obtained from the training set..
Thanks for your help! 

Hi again,
this is my last post on this topic, I hope, and and stop nagging any of the followers of this thread. I will attend the webinar and wait for the answers.
I'm posting this because I would like that others to verify the same results that I'm getting. I have simplified my previous R script to concentrate on the essential: Please bear with me as I go through it all in detail, but it's only a 10 line script.
1. Read the data:
train<read.table("ToxChallenge_CytotoxicityData_Train.txt")
lead<read.table("ToxSubchallenge_1_Leaderboard_Submission_File_Format.txt")
2. compute the means for all molecules
means<colMeans(train)
3. Compute the RMSE for the training set if the predictions were fixed for all lines and equal to the mean toxicity of each molecule
preds<train
for(i in 1:nrow(preds)) preds[i, ] = means
SE<(predstrain)^2
MSE<mean(SE)
(RMSE<sqrt(MSE))
This should give a result of 0.1990314 which is the RMSE for the training set if the predictions are fixed and equal to the mean of each compound. There is no stochastic element here, so if anyone tries this should receive exactly the same result.
4. With this same procedure one can make a prediction for the leaderboard:
for(i in 1:nrow(lead)) lead[i, ] = means
write.table(lead, file="leadeboard_prediction.txt", quote=F)
If you submit this file ("leadeboard_prediction.txt") to the leaderboard you will receive an RMSE of 0.300629 which means something is wrong. Of course I'm not expecting the RMSEs to be identical, but a 50% increase is unseen and shows that either the populations (training set and leaderboard data) are distinct or there is a mistake in the computation of the RMSEs.
Best,
Andre 

Read Sébastien's comment in this very thread:
"If you are concern on the computation of the RMSE, Federica told me that "the RMSE that is shown is the one computed on the whole matrix (all compounds and all individuals)". They also kindly computed the RMSE on the training set prediction of one of our model as a sanity check. We obtained the same numerical result."
The problem is not with how the RMSE is computed. The problem arises from statistical considerations regarding how the training and leaderboard sets were created and how RMSE penalizes large errors. 

Hi Andrew,
Yes I read those. and also Lara's reply in which she says that the leaderboard data is good:
" In short, our analyses indicate that the leaderboard is a good representation of the training data  data is highly comparable across the two. the script used to calculate RMSE on the leaderboard is correct  there is not a bug. "
so, after the test I presented what exactly is wrong?
(And the RMSE is a fixed procedure, it penalizes larger errors the same way the standard deviation penalizes an outlier) 

Hi again
Yes I said I would not nag you again. But I believe this is really important. I now have the PROOF that something is really wrong with the leaderboard evaluation. It is NOT a data problem, it is an evaluation problem. And I'm pretty sure I nailed it
I submitted 3 different files with all values equal for each cell line according to the mean of the toxicity. So all lines are equal for each submission. What distinguishes each submission is that for some selected columns ( 14, 25, 28, 31, 38, 46, 47, 62, 63, 67, 80, 83, 85, 91 and 101) I multiplied the "prediction" scores by a factor:
Those submissions were
* ch1_sub_pred_fac01.txt) values were not changed
* ch1_sub_pred_fac02.txt) has these columns predictions multiplied by 0.8
* ch1_sub_pred_fac03.txt has these columns predictions multiplied by 0.5
Now the leaderboard gives these strikingly different predictions THE EXACT SAME SCORE! [you can check this on the leaderboard] which would be an impossibility if the evaluation system was running adequately. This result strongly suggests that the leaderboard evaluation system is broken and is not evaluating a large number of predictions.
Thanks for bearing with me.
Now I hope the organizers of this subchallenge notice this. I, and many others have lost a huge amount of time over this issue.
best,
Andre 

Is it possible that the organizers release the leaderboard set of test data earlier so that we can have more time to investigate the problem?


Andre 
How did you select the 15 columns for which you altered the prediction score? As it turns out, you selected exactly the columns that the organizers determined to contain 'nontoxic compounds'. This means that we do not believe these compounds are exhibiting cytotoxic effects and, as such, we omitted exactly those columns from the RMSE calculations and mean ranking shown on the leaderboard. Thus, you get exactly the same score regardless of what you predict in those 15 columns.
Lara 

All 
We plan to discuss the selection of the data partitioning at tomorrow's webinar. In the meantime, here is a description meant to outline the selection of the training vs test sets. Please bring your comments regarding this to tomorrow's webinar  we will not be answering individual questions regarding this issue within the forum as it will be more efficient to address everyone's concerns at the same time tomorrow:
Assume we have 90 samples. If we have 90 samples to be subdivided in 60 training and 30 test samples. A clustering of the 90 samples gives us 3 clusters, with EC10 values centered at 0, 1 and 2, and with a distribution of samples as
N(0) = 20
N(1) = 50
N(2)= 20
Now for the test set we take 10 samples from each of the clusters. In this way, the distribution of samples for the training (N_train) and for the test (N_test) are
N_train(0) = 10; N_test(0) = 10;
N_train(1) = 40; N_test(1) = 10;
N_train(2) = 10; N_test(2) = 10;
Therefore, the average mean and standard dev in the train and test are
mean_train=1; mean_test=1
stdev_train=0.577; stdev_test=0.816
If we create a prediction using the mean of the training set (plus/minus a small error), on average, the Mean Square error is going to be of the order of the
RMSE =  stdev_train  stdev_test ~ 0.24.
However, this doesn't mean that there is not enough information in the training about the test. Simply that the distributions are not the same.
Lara
on behalf of the Challenge organizers 

Hi Lara,
Thank you for your feedback.
To answer your question. I suspected, and still do, that the leaderboard is scoring is wrong,. I had an idea that the problem had to do with the way it was evaluating the data and the only information I had was the toxicity distribution, so I started with the ones that seemed the most easily isolated (mean>1.9). It was by luck that apparently the system is no accounting them
Now in your answer there is some troubling information. You mention that you are not accounting for those molecules, but that information was never given to the participants! So in fact there are not 106 compounds but rather 91! This issue alone changes many things and it is disturbing to know this only now. I certainly would have changed my approach had I knew this (as well as others)
This issue also raises the suspicion on what else is not being accounted for? Perhaps some cell lines are "outliers" and also not accounted for? Or there other "weird" molecules?
On your second post. I hope it was not like that the sampling was done for the testing set, as that procedure is obviously flawed! If you take 10 samples from each cluster, you are over representing clusters N(0) and N(2) and underrepresenting N(1) in the test set! This means that the test set is biased and severely limits the possible analyses and inferences to be accomplished with the training data
A random sampling of all instances would naturally been preferable, but if some clustering sampling would be desirable, the correct procedure is as follows: As your are using a 1/3 sampling factor, the correct sampling would be:
N_test(0)=6 ( or 7 because of rounding errors)
N_test(1)=16 ( or 17 idem)
N_test(2)=6 ( or 7 idem)
Notice that the same proportions of the original data are maintained in the test set as well as the mean and standard deviations!
Please check Mittag's "Statistical Methods of Quality Assurance" in which these aspects are well covered.
To conclude: Apparently some data was not included in the evaluation (without the participants knowing) and then a flawed sampling procedure was used for separating the test and the training tests. This is getting more and more frustrating and I'm starting to regret the investment I have made for this challenge.
Andre 

some more aspects on this issue
On further consideration, What is implied from your answer is such that the bias of the test and training set is such that it may simply impair any possible inference that can be done if we have no idea of the priors!
To have an idea of the severity of the problem take this example. Imagine that you are only trying to predict a genetic disease with a 1:100 prevalence. If your population is of 20,000 cases, you should have roughly 200 with the disease and 19,800 without it. Now suppose you take a sample of 100 individuals for testing from this population as it has been suggested: You would get
N(Pop_Train_Disease)=150
N(Pop_Train_NoDisease)=19750
N(Pop_Test_Disease)=50
N(Pop_Test_NoDisease)=50
As you may imagine it is simply not possible to predict the test set with the training set (or the reverse, by the way)  Particularly if the experimenters were not aware of this fact!  The minuscule errors in precision and recall will multiply 100 fold and will impair any type of inference! One can have an almost perfect model and the prediction results on the test set will be dismal. There is unfortunately no way around this.
Andre 

Hi Andre
I got the same problem, I use cross validation to optimize my model, but it performs worse than a random one. All in all, my random model performs best in all my tries. 

Hi Andre
Your point is well said! There is an obvious problem with the method to split the whole dataset into test and train.
Lara, would you please clarify more on what you mean by "clustering"? Based on what features did you do the clustering? How many clusters did you find? Are the EC10s for each cluster the pooled EC10s of all drugs?
Could the organizers release their scripts to do the clustering/selection and their script to calculate the RMSE? I think in that way we can figure out the missing details not included in the description of the challenge by ourselves. And I really hope we can get the leaderboard set of test data sooner.
Tao 

I disagree with Andre's example being impossible to predict.
In the training phase, the important thing is to have some diseased subjects and some nondiseased subjects, and some data which allows to discriminate (somewhat) between the two classes. Using these data then a predictive model can be built, and it does not seem critical whether the training set as a whole is a random sampling of the whole population, provided the disease and nondisease sets capture a reasonable amount of the variation in the two groups in the population.
In the prediction phase, a prediction might be made separately for each individual, so, the composition of the test set does not seem relevant at all, so I have no problem with compositional bias in the test set.
The question for me is whether the training set contains enough information to discriminate between diseased and nondiseased individuals. 

Jonathan,
The reverent Thomas Bayes would strongly disagree with you, and so do I.
We have written a paper on that exact topic where it is demonstrated why that approach is not correct, and proposes solutions to address it, but if and only if the priors are know, which in our case were inferred from the literature
Martins, IF, Teixeira, AL, Pinheiro, L., Falcao, AO. 2012. A Bayesian approach to in silico bloodbrain barrier penetration modeling. Journal of chemical information and modeling 52 (6), 16861697 [http://pubs.acs.org/doi/abs/10.1021/c...]
best
Andre
[edit: and you do not trust me, at least trust Richard Duda, that has written the reference work on pattern classification: [http://eu.wiley.com/WileyCDA/WileyTit...] The first 2 chapters prove exactly what I said] 

Andre, your paper looks very interesting. I certainly take your point that bias in the training set will lead to a model with less than the theoretical maximum predictive power, and that, if the aim is to fit a model to the training data which can best predict for the test set, then they should be compositionally similar. I'm still not sure I agree that all power will be lost, but I guess that depends on the bias and variance.
My thought is that there is a wider aim here, perhaps unspoken and not captured in the leaderboard metric, which is to make a model which is more generally applicable, so that it has potential to be applied to arbitrary test subjects. Alternatively, to answer the question of whether such a model is possible given the data in hand.
I think we are in agreement that the composition of the training set in either case is a key question. 

Lara and the organisers:
You just confirmed the point I made 27 days ago. All this time you denied their could be a flaw in the split of the data. Worst, you mislead us by giving us false answers to our questions. This is not the scientific way of leading a competition and is inexcusable.
Andre and others:
I do advocate for randomly splitting the data.
Bin:
We also get that problem, during crossvalidation, the best model is the one predicting the average for that compound. I also suspect all participants to have that problem. I proposed two solutions (http://support.sagebase.org/sagebase/...) that may alleviate the low signal to noise ratio at the origin of that problem.
Andre is right regarding the learnability in the case of different distributions. Any discrepancy between the training and testing set will make the problem much harder to learn. For the sake of transparency the organisers should redo the splits using a simple random approach.
Finally,
I urge the organisers of the competition to take our comments more seriously, in it's current state, the competition is not looking good. I also feel betrayed for been intentionally misinformed and that some information were hidden. I also have invested lots of time and will regret that investment if no major changes are made to the competition. 

There is a webinar today where I'm hoping the organizers will address these concerns. Let's all try to keep a cool head until then.




Yes, basically what Lara told us was correct and my suspicions as well. About this issue:
a) They are going to use other metrics to better try to discriminate the entries on the Leaderboard
b) Forget all compounds with averages values >1.9 (The columns I mentioned above) They are not being accounted for
c) The sampling was not random and the test set is biased to the original population. The good news on this issue is that the Final test set has the same properties as the leaderboard set, so this will give us room for improvement, as soon as the leaderboard data is release on Aug 30
All in all It was a nice webinar and I believe that the most critical aspects of this challenge were clarified. 



Here's a link to the webinar: https://www.youtube.com/watch?v=rQ7NO4...
More information by email later today. We are actively working on updating the leaderboard scoring harness to include the additional scoring metrics discussed on yesterday's call. Leaderboard will be open until August 30th at noon pacific, for everyone to assess their models with these new metrics. The leaderboard data will be released at that time.
Lara 

Hi Lara
Thanks for the Webinar yesterday! I noticed that the leaderboard is now ranked by a zscore. Could you please give us some details regarding how this zscore is calculated and how predictions are ranked?
Tao 

Tao 
The leaderboard has now been updated to include the following metrics:
 Mean ranking based on the RMSE
 overall RMSE
 mean ranking based on the Coefficient of determination
 Mean coefficient of determination
 Mean ranking based on Pearson Correlation
 Mean Pearson correlation
We are no longer including the zscore because we did not find it to be as informative.
Lara 

Hi Lara
Thanks for the update! I just have one more question. Which of the 6 metrics will be used to rank the predictions?
Tao 

Tao 
Apologies for the delayed response  this is an active topic of internal discussion. The final scoring will not directly mirror the metrics as provided in the leaderboard, although these metrics will be taken into consideration. The leaderboard is intended as a mechanism to help participants gauge the predictive power of their models but is not meant to demonstrate the exact methods for final scoring.
Lara 