Week 3- Wednesday

Monte Carlo Random Sampling

It is a computational technique that simulates complex scenarios and resolves complex problems by using random numbers. In instances where conventional mathematical approaches are impractical it employs numerous iterations of simulation and random sampling to estimate outcomes or probabilities.

Monte Carlo simulation is applied in various fields, including finance, engineering, physics, economics, and more. Common applications include risk assessment, option pricing, project management, and evaluating complex systems.

Week 3- Monday

Resampling Method- Cross Validation

In the field of machine learning, cross-validation is a reliable and popular technique whose mail goal is to increase the predictive model’s precision.
Cross-validation’s main principle is to split the dataset into various subsets, or “folds.” Each fold stands for a unique division of the data.

The process of cross-validation involves the development of many models, each of which is trained on a distinctive combination of these folds while saving one fold for validation. Through this iterative procedure, it is made sure that every piece of data may be used as both a training and validation set.

When working with small datasets, this strategy is quite helpful. In these situations, it enables effective model training and a detailed evaluation of the model’s performance, leading to forecasts that are more accurate.

Although it is feasible to divide the data into two equal groups for training and validation, this approach might not be as successful. It is frequently preferred to use more sophisticated techniques, such as k-fold cross-validation. Here, “k” stands for the quantity of folds, enabling a more thorough evaluation of model performance.

For instance, each fold would contain about 310 examples in a 10-fold cross-validation performed on a dataset like the CDC diabetes dataset, which contains 3100 cases. This method makes sure that each subset accurately represents the complete dataset, improving the evaluation’s rigor and usefulness.

Cross-validation’s main objective is to assess a model’s generalizability in the end. It gives a comprehensive insight of how well the model works with unknown input and may result in model enhancements.

Week 2- Friday

I conducted correlations using the provided data. Utilizing the Python Seaborn module, I visualized the distributions of the diabetes, obesity, and inactivity data.

I’ve also read about kurtosis, which is used to represent how peaked or how much of a distribution is in the tails, as we covered in the lecture.

I was talking to my teammates when we discovered more information regarding the socioeconomic status of various counties on the CDC website. In addition, we discovered information about food surplus. Therefore, based on socioeconomic factors, food surplus, transportation, population, etc., we can categorize counties as Urban and Rural.

I’ll use the Scikit-Learn model and linear regression to try and fix this issue. and will work to include further discoveries.

Week 2 – Wednesday

Kurtosis

From what I understand, Kurtosis is a statistical measure that helps us understand the shape of a probability distribution, or the “peak” and “tail” of a data set. When we plot a graph for the data, it forms a bell shape curve.

Types of Kurtosis

1. Mesokurtic : Kurtosis = 3.0

A mesokurtic distribution is one where the kurtosis is close to zero.

2. Leptokurtic : Kurtosis > 3.0

A leptokurtic distribution has positive kurtosis.

3. Platykurtic : Kurtosis < 3.0

A platykurtic distribution has negative kurtosis.

If a data set has a high kurtosis, that means the distribution has a very sharp peak and heavy tails. In practice, this refers to the data having more extreme values, either exceptionally high or exceptionally low, compared to a normal distribution.

On the opposite side, low kurtosis indicates a smoother, broader distribution with light tails. In this case, the data have fewer extreme values ​​and are more concentrated around the mean.

Week 2- Monday

P-value:

During the class, Professor have explained the concept of p-value which can be summarized as the probability under no effect or null hypothesis, whose result is equal to or more precise than what was actually observed.

The P  in p-value stands for probability and it is a measure of how likely it is that any observed difference value between groups is due to a chance.  In simple terms, it provides you the outcome from your data which will be statistically significant or due to a uneven event.

Conceptual working steps of p-value:

  • Formulating a null hypothesis (H0)
  • Collecting and analyzing data
  • Calculating the p-value 
  • Comparing the p-value to a significance level (A)

-If p-value ≤ A: You reject the null hypothesis. There is evidence to support your alternative hypothesis.

– If p-value > A: You fail to reject the null hypothesis. There is no enough evidence to support your alternative hypothesis.

Linear Regression and Multiple Regression:

If the linear regression is used to predict one dependent variable using one independent variable, then it is called SIMPLE LINEAR REGRESSION. The formula is Y = a + b X , in which Y is dependent, X is independent, b is slope and a is intercept.

If the linear regression is used to predict one dependent variable using two or more independent variable, then it is called MULTIPLE LINEAR REGRESSION. The formula is Y = a + b1X1 + b2X2 + … + bnXn, where Y is dependent, X is independent, a is intercept and b1, b2, etc are the slopes.