Week 5- Friday

Dataset Insights 

The dataset in project records occurrences when people were shot and killed by law enforcement officials. It is about fatal police shooting records.
The dataset is split into two distinct files: “/v2/fatal-police-shootings-data.csv,” which includes detailed information about the shooting incidents and the victims involved, and “/v2/fatal-police-shootings-agencies.csv,” which includes details about the police agencies that have been connected to at least one fatal police shooting since 2015.

These two CSV files were combined using a common identifier, “agency_ids,” to produce a larger dataset for study. Any rows with void or “NaN” values were eliminated from the dataset as part of the preprocessing and quality control of the data. This is carried out to guarantee the accuracy and completeness of the data used for analysis.

To gain a better understanding of the dataset and its characteristics, I prepared a histogram. Histograms are particularly useful for illustrating how data is distributed across different categories or bins, making it easier to observe patterns and trends.

It showed that the majority of events involved people who identified as white, followed by cases involving people who identified as Black, Hispanic, Native American, and Asian.

Week 5- Wednesday

The second project focuses on the analysis of information about fatal police shootings in the US that was obtained from the Washington Post’s repository. I used simple commands like “describe()” and “info()” to start a basic investigation into the data and its properties. The dataset now has 8,770 data entries that span the dates of January 2, 2015, and October 7, 2023. I am currently immersed in the process of becoming familiar with the dataset, exploring its potential analytical uses, and learning more about it.

Week 4 – Monday

Today, I have had an overview of the p-value, which, in my opinion, can be described as the probability under the null hypothesis, indicating a result that is as extreme or more extreme than the observed data.

In simpler terms, it informs you whether the outcomes derived from your data are statistically significant or simply the product of an irregular event. So far haven’t applied it in my project yet, however it may come in handy once I have a hypothesis ready.

Week 3- Wednesday

Monte Carlo Random Sampling

It is a computational technique that simulates complex scenarios and resolves complex problems by using random numbers. In instances where conventional mathematical approaches are impractical it employs numerous iterations of simulation and random sampling to estimate outcomes or probabilities.

Monte Carlo simulation is applied in various fields, including finance, engineering, physics, economics, and more. Common applications include risk assessment, option pricing, project management, and evaluating complex systems.

Week 3- Monday

Resampling Method- Cross Validation

In the field of machine learning, cross-validation is a reliable and popular technique whose mail goal is to increase the predictive model’s precision.
Cross-validation’s main principle is to split the dataset into various subsets, or “folds.” Each fold stands for a unique division of the data.

The process of cross-validation involves the development of many models, each of which is trained on a distinctive combination of these folds while saving one fold for validation. Through this iterative procedure, it is made sure that every piece of data may be used as both a training and validation set.

When working with small datasets, this strategy is quite helpful. In these situations, it enables effective model training and a detailed evaluation of the model’s performance, leading to forecasts that are more accurate.

Although it is feasible to divide the data into two equal groups for training and validation, this approach might not be as successful. It is frequently preferred to use more sophisticated techniques, such as k-fold cross-validation. Here, “k” stands for the quantity of folds, enabling a more thorough evaluation of model performance.

For instance, each fold would contain about 310 examples in a 10-fold cross-validation performed on a dataset like the CDC diabetes dataset, which contains 3100 cases. This method makes sure that each subset accurately represents the complete dataset, improving the evaluation’s rigor and usefulness.

Cross-validation’s main objective is to assess a model’s generalizability in the end. It gives a comprehensive insight of how well the model works with unknown input and may result in model enhancements.

Week 2- Friday

I conducted correlations using the provided data. Utilizing the Python Seaborn module, I visualized the distributions of the diabetes, obesity, and inactivity data.

I’ve also read about kurtosis, which is used to represent how peaked or how much of a distribution is in the tails, as we covered in the lecture.

I was talking to my teammates when we discovered more information regarding the socioeconomic status of various counties on the CDC website. In addition, we discovered information about food surplus. Therefore, based on socioeconomic factors, food surplus, transportation, population, etc., we can categorize counties as Urban and Rural.

I’ll use the Scikit-Learn model and linear regression to try and fix this issue. and will work to include further discoveries.

Week 2 – Wednesday

Kurtosis

From what I understand, Kurtosis is a statistical measure that helps us understand the shape of a probability distribution, or the “peak” and “tail” of a data set. When we plot a graph for the data, it forms a bell shape curve.

Types of Kurtosis

1. Mesokurtic : Kurtosis = 3.0

A mesokurtic distribution is one where the kurtosis is close to zero.

2. Leptokurtic : Kurtosis > 3.0

A leptokurtic distribution has positive kurtosis.

3. Platykurtic : Kurtosis < 3.0

A platykurtic distribution has negative kurtosis.

If a data set has a high kurtosis, that means the distribution has a very sharp peak and heavy tails. In practice, this refers to the data having more extreme values, either exceptionally high or exceptionally low, compared to a normal distribution.

On the opposite side, low kurtosis indicates a smoother, broader distribution with light tails. In this case, the data have fewer extreme values ​​and are more concentrated around the mean.

Week 2- Monday

P-value:

During the class, Professor have explained the concept of p-value which can be summarized as the probability under no effect or null hypothesis, whose result is equal to or more precise than what was actually observed.

The P  in p-value stands for probability and it is a measure of how likely it is that any observed difference value between groups is due to a chance.  In simple terms, it provides you the outcome from your data which will be statistically significant or due to a uneven event.

Conceptual working steps of p-value:

  • Formulating a null hypothesis (H0)
  • Collecting and analyzing data
  • Calculating the p-value 
  • Comparing the p-value to a significance level (A)

-If p-value ≤ A: You reject the null hypothesis. There is evidence to support your alternative hypothesis.

– If p-value > A: You fail to reject the null hypothesis. There is no enough evidence to support your alternative hypothesis.

Linear Regression and Multiple Regression:

If the linear regression is used to predict one dependent variable using one independent variable, then it is called SIMPLE LINEAR REGRESSION. The formula is Y = a + b X , in which Y is dependent, X is independent, b is slope and a is intercept.

If the linear regression is used to predict one dependent variable using two or more independent variable, then it is called MULTIPLE LINEAR REGRESSION. The formula is Y = a + b1X1 + b2X2 + … + bnXn, where Y is dependent, X is independent, a is intercept and b1, b2, etc are the slopes.