Please refer to link:
Help get this topic noticed by sharing it on Twitter, Facebook, or email.
Thanks organizers and congratulations to everyone who did well in the final results. I have a few questions that I hope will aid in subsequent collaboration:
@Organizers: Would it be possible to get a breakdown of final scores by percept as done on the leaderboards? This will help us determine to what degree different approaches make sense for each.
@Clara/Yuanfang Guan: Would you be willing to share your feature set b, which are indicators of quadruplets? (I may have missed it, but do not see these data anywhere in your Synapse folders.) If I am understanding your description correctly, this is a function of the row number of the molecular descriptors, which raises a question about generalizability to future chemicals.
@IKW AllStars/Rick Gerkin: How exactly did you do your cross-validation to optimize your forest models and how did your CV results compare to your results on the leaderboards? I'd like to try to replicate some of your findings and explore ensembling your best forest models with some of the best models I found like ridge regression. And, just for fun, what does "IKW" stand for?
Looking forward to further discussions about this project!
both of my feature A set and B sets are included in my single folder.
this feature set can be generalized.
but you mis-understood it; or more likely, my description was not clear. for example: this is my set b: a chemical having the name propane, will have the following features labeled as 1: prop; rope, opan, pane. of course, a chemical with longer names will have more features as 1. such names were included in the training set.
it will be easily applied to any chemicals; as ALL chemicals existing in pubmed have names.
i believe my submission was the best of what could be get out of this training data. tuning may result in slightly better performance out of chance. but to substantially improve over this, one must need external data/software.
thank you and thank all participants for participating.
update: my 2 cents: i think what made the 2 standard deviation between me and the second place is the 0.2+0.8 operation between global and individual. because my submission didn't have much advantage in sub 2 for the average predictions. i think the top 4 in sub 2 are very similar, for less than 0.1 standard deviation difference; number 5 might be slightly further away. but i am sure organizers will give a better answer on this question from a more statistical/scientific point of view.
Thank you Yuanfang for the clarification on your feature set b--I understand it now as a kind of semantic fingerprint and this is indeed generalizable and clever! I'd like to give it a try. If I am understanding your description correctly the prediction from this feature set comprises half of your full prediction, which is a uniquely distinguishable characteristic of your approach.
I fully agree with the strategy of shrinking the individual predictions towards the mean values--this is a classical approach to reducing mean squared prediction error which is closely related to Pearson correlation. It would be good to learn how you arrived at the 0.2 and 0.8 weights and also more details on how you implemented your decision trees.
Interesting that both you and Rick Gerkin stuck strictly with tree methods--he appears to have grown some thick random forests.