description of project 2

The data set offered comes from the data repository of the Washington Post on fatal shootings by police in the US

The data collection contains 17 features, which were

ID number: Displays the case’s file ID? (\numerical values} – 8002 entries)
name of the individual – involved individual’s name (7548 entries – {strings})
date of occurrence: the time, in 8002 entries in the DD/MM/YYY format, that the shooting took place
Method of death: Was he shot, or was he shot and tasered? (8002 entries)
age (7499 entries -{numerical}) of the offender
gender-sex (7971 entries) Assuming they are White, Asian, Hispanic, or Black, the m/f race is W/A/H/B (8002 entries).
condition (8002 entries)
symptoms of _ (bool) t/f (8002 entries) mental illness
level of threat: attack, other, or unknown (8002 entries)
flee: on foot, in a car, or not at all (7037 entries)
body camera – value of bool

PROJECT UPDATE

A regression model can be applied to the data given in a number of different ways, as can be shown from our analysis of the data.

 

details of the data

 

Obesity, diabetes, and inactivity rates for 2018 were included in the report, which was broken down by county and state. There was a different FIPDS number for each county in the state. Additionally, there were a number of additional criteria—such as economic and health-related ones—included on the website, but because there are so little data points available for them compared to the other categories, it is much worse to choose/include them.

 

the number of samples in each factor wasn’t the same in the data set provided, and there were only 354 samples with all three sets of values. There are numerous ways to handle this.

 

Duplicate points can be removed in a variety of ways, including scaling smaller groups up or down to ensure that all features have the same number of sample points. however, by duplicating a greater number of values, we will create a structure for a model that is fed inaccurate information and may perform poorly.

choose a more manageable number of points; pick the 354 or so instances that have all three attributes listed…. however they assert that the more samples, the better the model may be trained, but obtaining data in such a short period of time would be an impossible task. Since there is less data available to the machine, we have chosen the second alternative even though it might not produce accurate predictions. At least we won’t be feeding the model with made-up data.

 

There are various techniques to extract those typical 354-ish data points with all three covered… Excel was used to complete it.

process:-

 

We already knew what the FIPDS codes were for those who had all three values, which made this task a little bit easier. We simply copied all the common FIPDS codes to a different column, took the original FIPDS column and the new common fipds column, marked them for duplicates, gave those that are common a color code, and filtered them based on that. We repeated this process for all three sheets to create the data that will be used in Python to perform the analysis.

Numerous scatter plots of the data were created to get a general concept of how the regression line might appear.

diabetes-2018 cdc (1)

the link to the sheet that we will use is above

example for crossvalidation

in the example there is data of obesity,inactivity,diabetes with 354 datapoints with these data we are going to plot the 3Dgraph data.

test error: this particular method will divide the data into 5 equal parts to get the accurate output.for testing the obesity,inactivity,diabetes data we use this test error method

labelleddata=Partition[Flatten[Riffle[Range[Length[data]],data]],4,4];labelleddata[[1;;5]]

here

L=labelleddata;

n=1;

While[n<=4,sample[n]=RandomSample[L,71];

L=Complement[L,sample[n]];n++];

sample[5]=L;

datasample[n_]:=sample[n][[All,{2,3,4}]]Wenowestablish5setsoftrainingdataandtestdata:Alldatasamples=Table[sample[k],{k,1,5}];

TrainingData[k_]:=Partition[Flatten[Drop[Alldatasamples,{k}]],4,4][[All,{2,3,4}]]TestData[k_]:=datasample[k]

estimating prediction error

estimation prediction error=measured value-predicted value measured value× 100 .

example for estimation prediction error

to find the heart disesase lets take the data of the person’s chestpain,bloodcirculation,blocked arteries,weight to work on this data we will use the knn algorithm/vectormachines and many more machine learning methods to identify which method is suitable the cross validation will suggest the best method like train,test methods.in this example we diivided the data in to 4blocks so this is called four cross validation in this way we can find the exact value using the data.

CROSS VALIDATION AND ITS TYPES

CROSS VALIDATION: It is a statistical technique used in the machine learning and data analysis to find the accuracy of the data and predict the future data which helps for underfitting and overfitting.

leave one out cross validation: In this particular validation one part of the data will be taken as the test and remaining will be the train data set,for suppose if we have the 100 datasets in the first experiment the first dataset will be the tets and remaining 99 datas willl be the train.in the sameway the process will be done for each and every entity.it consumes a lot of time, and when the new dataaset is implemented into this process there are high chance to low accuracy and high errors.

K-Fold cross validation: In this validation the k is taken as the some value and the k value will be divided by the total dataset,then the dataset will be performed k times to get the accurate value,the mean of acurates gives the exact validation data.

Stratified K-Fold cross validation: This validation is similar to k-fold cross validation,here but it ensures that each fold has a similar class distribution to the original dataset. this validation is useful for the imbalanced datasets.

Time series cross validation: In this validation using the current data the future data will be predicted,for example taking the stock price datas,we have the previous 5 days data and we need to predict the next day stock price, here the time series cross validation will be used.

T-Test and crab molt model

T-Test: it is used to determine if a particular variable is statistically significant in the model,it is used to find the mean of two group of samples,there are three types of tests.

(1) one sample T-Test

(2)Two sample T-Test

(3)Two sample paired T-Test

The Crab Molt model looks at measurements of crabs before and after molting. Our main goal was to predict the crab size before molting based on their post-molt measurements. Using a simple model, we got an impressive R-squared value of 0.98, indicating the model predicts well based on the data. And also analyzed the pre-molt and post-molt data. They were similar in distribution, with a small mean difference of about 14.7 units.

 

 

QUADRATIC MODEL AND OVERFITTING

QUADRATIC MODEL:A quadratic model is also known as quadratic equation or quadratic function,this model describes about the relation between a dependent variable and independent variable using quadratic polynomial equation.

the euqation for the quadratic model is: y=ax2+bx+c

where , a,b&c are constants,but a not equal to zero.

x is independent variable

y is dependent variable.

OVERFITTING:verfitting is a common problem in machine learning and statistical modeling, where a model learns the training data too well and captures noise or random fluctuations in the data rather than the underlying patterns or relationships.

Key characterstics of overfitting:

  1. High Training Accuracy, Low Test Accuracy: An overfit model will perform extremely well on the training data, often achieving close to 100% accuracy or very low error. However, when tested on new data (validation or test set), its performance significantly degrades.
  2. Excessive Complexity: Overfit models are often overly complex, with too many parameters or too much flexibility. They may have intricate decision boundaries or functions that try to fit every data point precisely.
  3. Noise Capture: Overfitting models tend to capture the noise in the training data, which includes random variations or outliers that are not representative of the underlying patterns.

     

    Ways to mitigate overfitting:

    1. Simplify the Model: Reduce the complexity of the model by using fewer parameters or features. For example, in the case of deep neural networks, you can decrease the number of layers or neurons.
    2. Increase Training Data: Gathering more training data can help the model generalize better, as it has a larger sample to learn from.
    3. Cross-Validation: Use techniques like k-fold cross-validation to assess the model’s performance on multiple subsets of the data, which can provide a more robust estimate of its generalization performance.
    4. Regularization: Apply regularization techniques such as L1 or L2 regularization to penalize overly complex models and encourage simpler solutions.
    5. Feature Selection: Carefully choose and engineer relevant features, discarding those that do not contribute to the model’s predictive power.

CHI-SQUARE

The chi-square (χ²) statistic is a statistical test used to determine if there is a significant association between two categorical variables. It is useful when assessing relationships between nominal or ordinal variables. The test compares observed and expected frequencies in a contingency table. The formula for calculating the chi-square statistic is χ² = Σ [(O – E)² / E]. O represents observed frequency, and E represents expected frequency. The chi-square test involves formulating null and alternative hypotheses, collecting data and creating a contingency table, calculating expected frequencies, calculating the chi-square statistic, determining degrees of freedom, looking up critical values or finding p-values, and comparing the calculated statistic to the critical value or p-value. If the calculated statistic is greater than the critical value or if the p-value is less than the chosen significance level, the null hypothesis is rejected. The chi-square test is widely used in biology, social sciences, and market research to analyze categorical data and assess independence or association between variables.

09/11/2023-LINEAR REGRESSION

LINEAR REGRESSION

It helps us to find the best fit line for the given points.

The linear regression assumes a linear relationship between the dependent and independent variables and find the best fitting line which describes this relationship.

Equation for Linear regression is Y=b0+b1X+error

OBSERVARTION FROM THE GIVEN TABLE

The dataset contains the samples of obesity,inactivity,diabeties of the people living in the each state data of the united states for the year 2018.

The dataset also has Federal Information Processing Standards (FIPS) data for each of the variables % obesity, % inactivity and %diabetes.through this data we need to explore the cdc 2018 diabetes,inactivity,obesity data.