Basic procedure for building the prediction model

The data we have will be mixture of the two (sometimes three) different populations.

  • Our goal is to fix the mixture model
  • Get critical measurement of any given sample
  • — a vector of “density/frequency/proportion” along the possible x coordinates.
  • Then use such a vector as independent variable
  • Build statistical model with clinical outcomes as dependent variable
  • Apply the model on the training data and predict on the unknown data
  • Challenges exist in this research project:

  • Out of three main clusters, information from normal group masks the other two
  • Two clinical diagnosis: OLK vs. normal, can have similar group three cell population. i.e. with D.I. > 2.3 for a few cells, or no cell with D.I. > 2.3 at all. These samples were diagnosed with further histo-pathology diagnosis
  • So, it is challenging and this can become opportunity also.
  • Can we show higher correlation between OLK and OSCC than that between normal OSCC?
  • Are the measurement we “extract” critical? Is there any statistical violation (with this analysis)?
  • Opportunities exist in this research project, and my hope:

  • Get a good fit on data from a given sample
  • Identify critical points (first or second derivative) if any, zero otherwise
  • Build SVM or other models based off these transformed data
  • Evaluate the model with cross-validate, 10-fold or leave-one-out cross validation
  • Create ROC curve for model evaluation
  • Test on independent dataset (hopefully coming soon)
  • It turns out that there are much I need to work out for this project

  • Other than the mixture model, which can not apply
  • I need to choose alternative ways
  • Fit a predominant simple model and determine the excess <– cause by other phenomenon
  • From the data, we can easily get CDF and compare the fit to the original data
  • Get the deciles, maybe?
  • Thanks go David Umbach, but the reality is it is easier to say than hand-on implementation. Need to think about it and understand it more.
  • I found a very interesting PIA at UChicago, and next I need to find what software they are using and whether it is free. It turns out that the webpage was out-of-date, and here is the new one. Talked to Lei-Ann and got great information.

    Mixture model by a Canadian

    I think that I had found the original post from a Canadian group.

    Besides Peter McDonald and his contribution to the MIX software, I found out a Juan Du Master’s student’s thesis. And there is a mixdist R package

    I am amazed to find out that mixture of models have been intensively study by so many researchers and for so many years! David Dowe. It is often an paradox, how much deep can an individual go into. I guess that is how an person can become an expert.

    Now, with the most superficial approach, I need to clear out some basic road blocks:

  • Detail properties of normal and gamma distribution
  • How gamma becomes a normal distribution
  • Chi-square test on goodness-of-fit and degree of freedom
  • With the general MIX program, it fits a set of data with “mixparameters” and proposed “kernels”, then it come back with a fit to the “histogram”, with chi-square test on the fitting. It should report parameters of those distributions that make up the mixture data. It might sounds like a good approach:

  • With a set of data coming from a mixture distributions
  • Fit with MIX or mixdist and assess the fit with Chi-square test
  • Pick whichever winners and extract the parameters from those distributions
  • Then, restore the mixture distribution with known parameters (and/or the proportions??)
  • In the end, take the derivatives (second) and finish the data transformation
  • The next topic will be SVM or any other clustering procedures for modeling and building the prediction model.