1. General Statistics#
1.1. Introduction#
Statistics forms the backbone of data analysis, offering tools and models to make sense of data. It encompasses collecting, analyzing, interpreting, presenting, and organizing data. Two primary branches of statistics are Descriptive Statistics and Inferential Statistics. In this notebook, we will explore various statistical concepts and techniques, elucidating their importance and application in real-world scenarios.
1.2. Descriptive Statistics#
1.2.1. Theory#
Descriptive Statistics involves summarizing and organizing data so it can be easily understood. It provides simple summaries about the sample and its measures. These statistics use graphical and numerical tools to describe characteristics of data.
Key measures in Descriptive Statistics include:
Measures of Central Tendency: Mean, Median, Mode
Measures of Dispersion: Range, Variance, Standard Deviation
Data Visualization Tools: Histograms, Bar Charts, Box Plots
These tools and measures allow us to understand the central values and variability in a dataset.
1.2.2. Examples#
Let’s create Python examples to calculate these descriptive statistics using a sample dataset.
import numpy as np
import matplotlib.pyplot as plt
# Sample data
data = [12, 15, 12, 10, 12, 18, 14, 16, 12, 19, 16, 22, 13, 15, 22, 24, 10, 10]
# Measures of Central Tendency
mean = np.mean(data)
median = np.median(data)
mode = max(set(data), key=data.count) # Simplest mode calculation
# Measures of Dispersion
range_ = np.ptp(data)
variance = np.var(data)
std_dev = np.std(data)
print(f'Mean: {mean}, Median: {median}, Mode: {mode}')
print(f'Range: {range_}, Variance: {variance}, Standard Deviation: {std_dev}')
# Data Visualization
plt.hist(data, bins=5, edgecolor='black')
plt.title('Histogram of Sample Data')
plt.xlabel('Data Value')
plt.ylabel('Frequency')
plt.show()
1.3. Inferential Statistics#
1.3.1. Theory#
Inferential Statistics allows us to make inferences and predictions about a population based on a sample of data. This branch of statistics includes estimation of parameters and hypothesis testing. It extends beyond the immediate data alone by using it to reach conclusions about populations from sample data.
Key concepts:
Population and Sample: A population includes all elements from a set of data. A sample is a subset of the population.
Confidence Intervals: Range of values that estimate a parameter with a certain level of confidence.
Probability Distributions: Normal Distribution, t-Distribution, etc.
1.3.2. Examples#
Below is a Python example that uses a sample to infer characteristics about a population.
import scipy.stats as stats
# Sample Data
sample_data = np.random.normal(loc=50, scale=5, size=30)
# Confidence Interval for the mean
confidence_level = 0.95
degrees_freedom = len(sample_data) - 1
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1)
confidence_interval = stats.t.interval(confidence_level, degrees_freedom, sample_mean, sample_std/np.sqrt(len(sample_data)))
print(f'Confidence Interval: {confidence_interval}')
1.4. Hypothesis Testing#
1.4.1. Theory#
Hypothesis Testing is a statistical method that uses sample data to evaluate a hypothesis about a population parameter. It assesses two competing statements: the null hypothesis ((H_0)) and the alternative hypothesis ((H_1)).
Key steps involved in hypothesis testing:
Formulate the null and alternative hypotheses.
Choose the significance level ((\alpha)).
Calculate the test statistic.
Determine the p-value or critical values.
Make a decision: Reject or Do Not Reject (H_0).
1.4.2. Examples#
A Python implementation of a hypothesis test can be considered here using a t-test.
from scipy.stats import ttest_1samp
# Hypothesis Test for sample mean
population_mean = 50
t_statistic, p_value = ttest_1samp(sample_data, population_mean)
print(f'T-statistic: {t_statistic}, P-value: {p_value}')
1.5. Analysis of Variance (ANOVA)#
1.5.1. Theory#
ANOVA is a statistical method used to compare the means of three or more samples to understand if at least one sample mean differs significantly. It is an extension of the t-test for multiple groups.
Key components include:
Within-group variance: Variability within each group.
Between-group variance: Variability between the groups.
Reject (H_0) if the between-group variance is significantly larger than the within-group variance.
1.5.2. Examples#
Let’s examine an example using ANOVA in Python.
from scipy.stats import f_oneway
# Sample Data
group1 = [22, 25, 20, 23, 21]
group2 = [25, 30, 28, 27, 26]
group3 = [19, 16, 18, 20, 17]
anova_result = f_oneway(group1, group2, group3)
print(f'ANOVA result: F-statistic = {anova_result.statistic}, P-value = {anova_result.pvalue}')
1.6. Correlation and Regression#
1.6.1. Theory#
Correlation and Regression are statistical methods used to describe the nature and strength of relationships between variables.
Correlation: Measures the degree to which two variables move in relation to each other. Pearson�s correlation coefficient ((r)) is commonly used.
Regression: Helps in understanding the relationship between a dependent variable and one or more independent variables. Linear regression is the simplest form.
1.6.2. Examples#
The following Python code demonstrates correlation and linear regression analysis.
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
# Sample Data
data_x = np.random.rand(100)
data_y = 2 * data_x + np.random.normal(size=100)
# Correlation
correlation_coefficient = np.corrcoef(data_x, data_y)[0, 1]
print(f'Correlation Coefficient: {correlation_coefficient}')
# Regression
df = pd.DataFrame({'x': data_x, 'y': data_y})
sns.regplot(x='x', y='y', data=df)
plt.title('Regression Plot')
model = LinearRegression()
model.fit(df[['x']], df['y'])
print(f'Slope: {model.coef_[0]}, Intercept: {model.intercept_}')
In conclusion, the above sections form the key foundations of statistics, equipping us with the tools to analyze and draw conclusions from data effectively. Practical examples employing Python support the theoretical foundations, enhancing understanding and application in diverse fields.