POLICE SHOOTINGS BASED ON AGE AND RACE DATA
As the name of the update implies, we conducted a statistical study of the ages of people from various racial backgrounds who were killed by police, taking into account both age and race from the entire dataset. Asian (A), Black (B), Native American (N), Other (O), Hispanic (H), and White (W) are the different racial categories. We computed statistical metrics for the ages of the people in each racial category using Mathematica, including the median, mean, standard deviation, variance, skewness, and kurtosis.
the outcome
Asian Victims (A):
Age on Median: 35 years
Mean Years of Age: 35.96
11.59 Standard Deviation; 134.38 Variance
Black Victims (B):
Age on Median: 31 years
Mean Years of Age: 32.93
11.39 Standard Deviation; 129.70 Variance
Hispanic/H victimization:
Age on Median: 32 year
33.59 years of age
Deviation Standard: 10.74
Difference: 115.42
Victims of Native American (N):
Age on Median: 32 years
Average Age: 32.65%
Deviation: 80.90 Standard Deviation: 8.99
Other Victims (O):
Age on Median: 31 years
Averaging 33.47 years old
11.80 Standard Deviation; 139.15 Variance
White victims (W):
Age on median: 38 years
Averaging 40.13 years old
13.16 Standard Deviation; 173.24 Variance
This was only a portion of what we covered in class; the next report will have additional information on hypothesis testing and related topics.
LAST UPDATE EXTENSION
Kh-means Clustering: This technique divides the dataset into k groups, with each data point falling into the group that has the closest mean.
According to the research, when the number of clusters k is set to 2, k-means clustering clearly divides data from the lemniscate (infinity form).
The approach still produces a respectable clustering result when k is increased to 4, breaking the dataset into more focused, smaller clusters.
k-medication Grouping:
medoids are used in place of means in a manner similar to k-means. The data point in a cluster with the most central location is called a medoid.
For k = 2, k-medoids clustering likewise offers a distinct separation using the lemniscate dataset.
Nonetheless, the medoid, the most characteristic site of the clusters, is where they are generated. assemble.
Similar to k-means, k-medoids divide data for k = 4, but clusters are generated around the most central data points rather than the mean.
Applications with Noise Using Density-Based Spatial Clustering: DBSCAN
On the basis of dense data point regions, it creates clusters. As seen by the clusters created in the samples given, DBSCAN is less sensitive to outliers than k-means and k-medoids.
DBSCAN found four clusters in the lemniscate example, most likely indicating dense regions divided by less dense or noisy regions.
Comparative Observations:
The report provides visuals to show that, regardless of whether a natural cluster count is more or lower than k, k-means and k-medoids are sensitive to the choice of k and will always divide the data into k clusters.
Since DBSCAN doesn’t need a set number of clusters, it can discover any number of clusters depending on data density, which in some circumstances might lead to a more logical clustering.
The visual results show that if the right k is selected, k-means and k-medoids may successfully identify clusters for geometric designs with distinct separations (such the lemniscate).
The benefit of DBSCAN, however, lies in its capacity to deal with noise and locate clusters without a specified k.
Later on, we’ll implement DBSCAN in Python and upload it to the upcoming releases.
COMPARISON OF K-MEANS AND DBSCAN
In this lesson, we used geometric data sets to compare the performance of the clustering algorithms k-means, k-medoids, and DBSCAN. Our three examples, each with a visualization and Mathematica code to show the clustering results, formed the framework for this.
Example 1: Two hundred well spaced random points fill a lemniscate.
DBSCAN takes these points and finds 4 clusters.
The clusters are displayed using the k-means approach with k=2 and k=4.
Similarly, k=2 and k=4 are used to illustrate the k-medoids approach.
Example 2: A union of a circle and an annulus filled with 400 randomly placed points is used to repeat the experiment.
Within this data set, DBSCAN detects two clusters.
Using k=2 and k=4, the k-means and k-medoids approaches are once more used.
Example 3: In this example, the area of a square is filled in less than a
400 random spots in the maximum circle.
Four clusters are found in this scenario by DBSCAN.
Using k=2 and k=4, the k-means and k-medoids approaches are illustrated.
comparison study and a brief description of each technique in the next release
COHENS’D
Cohen’s d is a measure of effect size used to indicate the standardized difference between two means. In the context of the age distribution for people killed by police, Cohen’s d was used to assess the effect size of the age difference between black and white individuals who were shot by police.
To calculate Cohen’s d, the difference between the two means (in this case, the mean ages of white and black individuals killed by police) is divided by the pooled standard deviation of the two groups. The pooled standard deviation is a weighted average of the standard deviations of the two groups, adjusted for their sample sizes.
For the data in question, the calculated Cohen’s d was found to be 0.577485, which is interpreted as a medium effect size according to Cohen-Salkowski guidelines. This indicates that the average age difference of 7.3 years between white and black people killed by police is a medium-sized effect, suggesting that while the difference is statistically significant, it is not a large or small effect but somewhere in between.
code snipped of the one used to calculate the Cohen’s d value :-
PEOPLE KILLED BY POLICE
We discussed the age distribution of those killed by police in today’s class.
We used the provided dataset’s age and race components for this purpose, primarily concentrating on comparisons between Black and White people as well as the general age distribution.
We conducted the study once more using Mathematica (a Python equivalent will be published in the upcoming updates).
These are the results we obtained for the total age distribution:
Age minimum: six years
Maximum age: ninety-one
Average age: 37.1%
Age median: 35 years old
Mean difference: 13.0 years
The distribution has a slight right-skewed (skewness of 0.73), with a kurtosis near 3 indicating a normal-like distribution and a skewness suggesting a slight age distribution towards the right. dispersion without prominent peaks or protruding tails.
Age Breakdown by Race:
People of color
The age range is as follows: minimum: 13, maximum: 88, mean: 32.7 years, median: 31 years, standard deviation: -11.4 years.
The distribution of black people slain by police has a kurtosis of 3.9 and a right skewness of about 1, indicating a slightly fat or peaky tail.
Caucasian individuals
a mean age of 40 years, a median age of 38 years, a maximum age of 91, a minimum age of 6, and a standard deviation of 13.3 years.
With a kurtosis of 2.86 and a moderate rightward skewness (skewness: 0.53), this distribution shows no long, fat tails and little peakiness.
Comparing Individuals Who Are Black and White:
..It was discovered that the mean age difference was roughly 7.3. decades.(On average, white people slain by police are statistically substantially older than black people.)
The statistical significance of this age difference was verified by a Monte Carlo simulation, which also showed that there is extremely little possibility that the observed age difference happened by accident.
Using Cohen’s d to measure effect size, the medium effect size of 0.58 indicated a medium-sized impact from the age difference.
In the upcoming post, additional details concerning Cohen’s D
LOCATION OF POLICE SHOOTINGS
We worked with and examined data related to the locations of police shootings in today’s class. We used information from the given dataset, which had about 7,000 entries.The data set’s longitude and latitude section gives us the necessary information on this
Several Mathematica functions were used in the research, including GeoPosition, which turned latitude/longitude pairs into geographic position objects, and GeoListPlot, which produced a geographic map of every site.
Additionally, we showed how to visualize the event density by generating geographic histograms using the GeoHistogram and GeoSmoothHistogram functions.
In addition, we used GeoDistance to determine the separation between the locations and show how to use clustering techniques to examine the shootings’ spatial distribution. The FindClusters function in Mathematica and an investigation of the DBSCAN clustering technique, yielding four different clusters for California
I’m going to execute this in Python and post it in my upcoming updates.
UPDATE
A decision tree is a type of tree structure that resembles a flowchart, with core nodes representing features, branches representing rules, and leaf nodes representing the algorithm’s outcome. It is a flexible supervised machine-learning approach that may be applied to regression and classification issues alike. It is among the most potent algorithms. Additionally, Random Forest uses it to train on various subsets of training data, making it one of the most potent machine learning algorithms.
It is a tool for supervised learning algorithms that may be applied to tasks involving regression and classification. It creates a tree structure that resembles a flowchart, with each internal node signifying an attribute test, each branch denoting a test result, and each leaf node (terminal has a class label (node). When a stopping criterion—such as the maximum depth of the tree or the minimum number of samples needed to split a node—is satisfied, the training data is recursively split into subsets based on the values of the attributes.
A metric like entropy or Gini impurity, which gauges the degree of unpredictability or impurity in the subsets, is used by the Decision Tree method to determine which characteristic is optimal for splitting the data during training. The objective is to identify the characteristic that optimizes the gain in information or the decrease in impurity following the split.
will provide updates in the future regarding the location and method of application of this to our data set.
UPDATE
I’ll discuss the themes in the upcoming updates before moving on to the classwork.
grouping –
The process of clustering becomes more complex when dealing with a dataset that contains different types of data, often known as heterogeneous data, like the one we are working with right now. You might come across a combination of categorized, numerical, and possibly even textual data in these datasets. To efficiently find significant patterns in these varied data formats, sophisticated methodologies and algorithms are needed. Therefore, we can either apply to parts that are related to each other or convert them all into a single data type and work with it. For this purpose, I’ve plotted DBSCAN (Density-Based Spatial Clustering) using just the latitude and longitude data of Applications with Noise) is a kind of technique for clustering. It belongs to the class of clustering methods that are based on density. DBSCAN is especially helpful for finding arbitrary shape clusters in different density datasets. outlines clusters as regions with a higher density of data points divided by regions with a lower density. It is appropriate for datasets where the number of clusters is unknown in advance because it does not require the number of clusters to be specified in advance.Its two primary parameters are “min_samples,” which indicates the bare minimum of data points needed to build a dense region (core point), and “eps” (epsilon), which provides the maximum distance between two samples for one to be deemed in the vicinity of the other.
(Implementation will follow the subjects that we shall