example for crossvalidation

in the example there is data of obesity,inactivity,diabetes with 354 datapoints with these data we are going to plot the 3Dgraph data.

test error: this particular method will divide the data into 5 equal parts to get the accurate output.for testing the obesity,inactivity,diabetes data we use this test error method

labelleddata=Partition[Flatten[Riffle[Range[Length[data]],data]],4,4];labelleddata[[1;;5]]

here

L=labelleddata;

n=1;

While[n<=4,sample[n]=RandomSample[L,71];

L=Complement[L,sample[n]];n++];

sample[5]=L;

datasample[n_]:=sample[n][[All,{2,3,4}]]Wenowestablish5setsoftrainingdataandtestdata:Alldatasamples=Table[sample[k],{k,1,5}];

TrainingData[k_]:=Partition[Flatten[Drop[Alldatasamples,{k}]],4,4][[All,{2,3,4}]]TestData[k_]:=datasample[k]

estimating prediction error

estimation prediction error=measured value-predicted value measured value× 100 .

example for estimation prediction error

to find the heart disesase lets take the data of the person’s chestpain,bloodcirculation,blocked arteries,weight to work on this data we will use the knn algorithm/vectormachines and many more machine learning methods to identify which method is suitable the cross validation will suggest the best method like train,test methods.in this example we diivided the data in to 4blocks so this is called four cross validation in this way we can find the exact value using the data.

CROSS VALIDATION AND ITS TYPES

CROSS VALIDATION: It is a statistical technique used in the machine learning and data analysis to find the accuracy of the data and predict the future data which helps for underfitting and overfitting.

leave one out cross validation: In this particular validation one part of the data will be taken as the test and remaining will be the train data set,for suppose if we have the 100 datasets in the first experiment the first dataset will be the tets and remaining 99 datas willl be the train.in the sameway the process will be done for each and every entity.it consumes a lot of time, and when the new dataaset is implemented into this process there are high chance to low accuracy and high errors.

K-Fold cross validation: In this validation the k is taken as the some value and the k value will be divided by the total dataset,then the dataset will be performed k times to get the accurate value,the mean of acurates gives the exact validation data.

Stratified K-Fold cross validation: This validation is similar to k-fold cross validation,here but it ensures that each fold has a similar class distribution to the original dataset. this validation is useful for the imbalanced datasets.

Time series cross validation: In this validation using the current data the future data will be predicted,for example taking the stock price datas,we have the previous 5 days data and we need to predict the next day stock price, here the time series cross validation will be used.

T-Test and crab molt model

T-Test: it is used to determine if a particular variable is statistically significant in the model,it is used to find the mean of two group of samples,there are three types of tests.

(1) one sample T-Test

(2)Two sample T-Test

(3)Two sample paired T-Test

The Crab Molt model looks at measurements of crabs before and after molting. Our main goal was to predict the crab size before molting based on their post-molt measurements. Using a simple model, we got an impressive R-squared value of 0.98, indicating the model predicts well based on the data. And also analyzed the pre-molt and post-molt data. They were similar in distribution, with a small mean difference of about 14.7 units.

 

 

QUADRATIC MODEL AND OVERFITTING

QUADRATIC MODEL:A quadratic model is also known as quadratic equation or quadratic function,this model describes about the relation between a dependent variable and independent variable using quadratic polynomial equation.

the euqation for the quadratic model is: y=ax2+bx+c

where , a,b&c are constants,but a not equal to zero.

x is independent variable

y is dependent variable.

OVERFITTING:verfitting is a common problem in machine learning and statistical modeling, where a model learns the training data too well and captures noise or random fluctuations in the data rather than the underlying patterns or relationships.

Key characterstics of overfitting:

  1. High Training Accuracy, Low Test Accuracy: An overfit model will perform extremely well on the training data, often achieving close to 100% accuracy or very low error. However, when tested on new data (validation or test set), its performance significantly degrades.
  2. Excessive Complexity: Overfit models are often overly complex, with too many parameters or too much flexibility. They may have intricate decision boundaries or functions that try to fit every data point precisely.
  3. Noise Capture: Overfitting models tend to capture the noise in the training data, which includes random variations or outliers that are not representative of the underlying patterns.

     

    Ways to mitigate overfitting:

    1. Simplify the Model: Reduce the complexity of the model by using fewer parameters or features. For example, in the case of deep neural networks, you can decrease the number of layers or neurons.
    2. Increase Training Data: Gathering more training data can help the model generalize better, as it has a larger sample to learn from.
    3. Cross-Validation: Use techniques like k-fold cross-validation to assess the model’s performance on multiple subsets of the data, which can provide a more robust estimate of its generalization performance.
    4. Regularization: Apply regularization techniques such as L1 or L2 regularization to penalize overly complex models and encourage simpler solutions.
    5. Feature Selection: Carefully choose and engineer relevant features, discarding those that do not contribute to the model’s predictive power.

CHI-SQUARE

The chi-square (χ²) statistic is a statistical test used to determine if there is a significant association between two categorical variables. It is useful when assessing relationships between nominal or ordinal variables. The test compares observed and expected frequencies in a contingency table. The formula for calculating the chi-square statistic is χ² = Σ [(O – E)² / E]. O represents observed frequency, and E represents expected frequency. The chi-square test involves formulating null and alternative hypotheses, collecting data and creating a contingency table, calculating expected frequencies, calculating the chi-square statistic, determining degrees of freedom, looking up critical values or finding p-values, and comparing the calculated statistic to the critical value or p-value. If the calculated statistic is greater than the critical value or if the p-value is less than the chosen significance level, the null hypothesis is rejected. The chi-square test is widely used in biology, social sciences, and market research to analyze categorical data and assess independence or association between variables.

09/11/2023-LINEAR REGRESSION

LINEAR REGRESSION

It helps us to find the best fit line for the given points.

The linear regression assumes a linear relationship between the dependent and independent variables and find the best fitting line which describes this relationship.

Equation for Linear regression is Y=b0+b1X+error

OBSERVARTION FROM THE GIVEN TABLE

The dataset contains the samples of obesity,inactivity,diabeties of the people living in the each state data of the united states for the year 2018.

The dataset also has Federal Information Processing Standards (FIPS) data for each of the variables % obesity, % inactivity and %diabetes.through this data we need to explore the cdc 2018 diabetes,inactivity,obesity data.