PROJECT UPDATE
A regression model can be applied to the data given in a number of different ways, as can be shown from our analysis of the data.
details of the data
Obesity, diabetes, and inactivity rates for 2018 were included in the report, which was broken down by county and state. There was a different FIPDS number for each county in the state. Additionally, there were a number of additional criteria—such as economic and health-related ones—included on the website, but because there are so little data points available for them compared to the other categories, it is much worse to choose/include them.
the number of samples in each factor wasn’t the same in the data set provided, and there were only 354 samples with all three sets of values. There are numerous ways to handle this.
Duplicate points can be removed in a variety of ways, including scaling smaller groups up or down to ensure that all features have the same number of sample points. however, by duplicating a greater number of values, we will create a structure for a model that is fed inaccurate information and may perform poorly.
choose a more manageable number of points; pick the 354 or so instances that have all three attributes listed…. however they assert that the more samples, the better the model may be trained, but obtaining data in such a short period of time would be an impossible task. Since there is less data available to the machine, we have chosen the second alternative even though it might not produce accurate predictions. At least we won’t be feeding the model with made-up data.
There are various techniques to extract those typical 354-ish data points with all three covered… Excel was used to complete it.
process:-
We already knew what the FIPDS codes were for those who had all three values, which made this task a little bit easier. We simply copied all the common FIPDS codes to a different column, took the original FIPDS column and the new common fipds column, marked them for duplicates, gave those that are common a color code, and filtered them based on that. We repeated this process for all three sheets to create the data that will be used in Python to perform the analysis.
Numerous scatter plots of the data were created to get a general concept of how the regression line might appear.
diabetes-2018 cdc (1)
the link to the sheet that we will use is above