oct 11th

Project 2 involves the thorough examination of two distinct datasets: “fatal-police-shootings-data” and “fatal-police-shootings-agencies,” each serving specific analytical purposes.

The first dataset, “fatal-police-shootings-data,” encompasses 8770 rows and 19 columns, covering the time span from January 2, 2015, to October 7, 2023. It’s important to note the presence of missing values in critical columns such as threat category, flee status, and location information. Despite these data gaps, this dataset offers a wealth of information regarding fatal police shootings, including details on threat levels, types of weapons involved, demographic information, and more.

Conversely, the “fatal-police-shootings-agencies” dataset comprises 3322 rows and six columns. Similar to the first dataset, it also contains missing data points, particularly in the “oricodes” column. This dataset is designed to provide insights into law enforcement agencies, including their names, identities, types, locations, and their involvement in shooting incidents with police officers.

To make the most of these datasets, it’s crucial to consider the context and formulate specific questions aligned with the analytical objectives. Doing so will allow for a more meaningful exploration of the law enforcement organizations, the intricate relationships between variables, and the incidents of fatal police shootings. These databases offer a valuable opportunity to investigate and gain deeper insights into these topics.

9th oct

Various factors influence the significance of variables in data analysis and machine learning. The importance of a variable is closely linked to its specific role in a particular context. Some variables exert substantial influence, while others play more minor roles. Identifying the most relevant variables often requires a combination of domain knowledge and techniques like correlation analysis.

Collinearity, where variables are interrelated, can make it challenging to discern their true importance. Therefore, it’s crucial to carefully select variables to ensure a clearer interpretation of models. Exploratory data analysis is vital for gaining a deeper understanding of variable relationships and significance.

Different machine learning models either explicitly indicate feature importance or assign varying weights to them. Expertise in the relevant field can uncover critical variables that may not be immediately evident from the data alone. Managing outliers is essential to prevent them from distorting assessments of variable importance.

The way variables are processed, whether through encoding or normalization methods, can also impact their perceived significance and overall model performance. In some models, a variable’s importance may depend on its interactions with other variables. Ultimately, the most important variables are those that align with the primary goal of the model, whether it’s understanding causality or enhancing prediction accuracy.

oct 3rd

Last week, I employed the R-squared metric for cross-validation, which helps estimate how much of the variation in the dependent variable can be predicted based on the predictors. Today, I delved into analyzing my models using various scoring measures and took some time to understand their distinctions. Notably, when a specific scoring metric isn’t specified in the parameters, the cross_val_score function calculates the negative Mean Squared Error (MSE) for each fold. It’s important to note that MSE is highly sensitive to outliers.

Additionally, I familiarized myself with the Mean Absolute Error (MAE) measure, which is more appropriate when all errors should be treated with equal importance and weight. This metric provides a different perspective on model performance compared to MSE and is particularly useful in scenarios where outliers can significantly impact results.

We were discussing about the project report and thus far this is all that happened .

We’ve completed the visualization aspect of our research. These visualizations depict the connections between the predictor variables and the outcome variable using the test data. Here, obesity and inactivity serve as the predictor variables, while diabetes is the outcome variable. We’ve created plots to represent the relationships between obesity and diabetes, as well as inactivity and diabetes. Furthermore, I’ve determined the R^2 and mean squared error (MSE) values. An R^2 value of 0.395 suggests that approximately 39.5% of the variability in the diabetes data can be explained by the predictor variables. R^2 values can range between 0 and 1, with values closer to 1 implying a better model fit. Meanwhile, the MSE value stands at -0.400063, which measures the average squared difference between the estimated and observed values. A lower MSE typically indicates a better model, although its scale is dependent on the nature of the outcome variable.

 

27th

we discussed about cross validation and k fold

In k-fold cross-validation, the dataset is divided into ‘k’ subsets. For each iteration, the model is trained on ‘k-1’ subsets and tested on the remaining one. This is repeated ‘k’ times, each time with a different subset as the test set. For example, in a 5-fold cross-validation, the data is partitioned into 5 parts, training on 4 and testing on the fifth. This cycle is done 5 times. The model’s performance is then averaged across all rounds, giving a more comprehensive assessment of its generalization ability

25th

My opinion on todays class.

Resampling methods involve creating new samples from the original data, which are invaluable for inference, model evaluation, and estimating statistics. One such method is bootstrapping, where we generate multiple samples by randomly selecting data points with replacement – often used to estimate population metrics or assess statistic uncertainty, especially with limited data. Cross-validation, another resampling strategy, is common in machine learning. It divides the dataset into subsets, using them iteratively for training and testing, helping gauge model generalization and identify issues like overfitting. Estimating prediction error is crucial to understanding how our model may perform with new, unseen data, and several methods can be employed based on our data and goals.

22nd sept

P-value:

The p-value, short for probability value, is a statistical measure that helps evaluate the significance of a particular finding in a statistical analysis. It quantifies the level of evidence that contradicts a null hypothesis, which often assumes that there is no effect or connection in the data being examined. A low p-value, typically below 0.05, indicates statistical significance and provides strong evidence against the null hypothesis. Conversely, a high p-value suggests limited evidence supporting the null hypothesis, signifying that the result is not statistically significant.

R-squared:

In the context of regression analysis, the R-squared statistic is employed to assess how well a model fits the given data. It quantifies the proportion of variance in the dependent variable, the variable being predicted, that can be attributed to the independent variables or predictor variables within the model. Higher R-squared values indicate a better fit, and they range from 0 to 1. An R-squared value of 1 signifies that the model perfectly explains all the variance in the data, while a value of 0 indicates that the model cannot account for any variation in the data. R-squared is used to measure the goodness of fit of a model to observed data, although it may not always accurately indicate the model’s predictive ability for future data.