Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Statistical Analysis With Scipy?

The document provides a comprehensive overview of statistical analysis techniques using the SciPy library, including importing necessary modules and setting up the environment. It covers various topics such as descriptive statistics, probability distributions, hypothesis testing, correlation and regression, and more advanced methods like Bayesian statistics and multivariate analysis. Each section includes specific functions and methods for performing statistical analyses, making it a valuable resource for data analysis in Python.

Uploaded by

Dhiraj Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Statistical Analysis With Scipy?

The document provides a comprehensive overview of statistical analysis techniques using the SciPy library, including importing necessary modules and setting up the environment. It covers various topics such as descriptive statistics, probability distributions, hypothesis testing, correlation and regression, and more advanced methods like Bayesian statistics and multivariate analysis. Each section includes specific functions and methods for performing statistical analyses, making it a valuable resource for data analysis in Python.

Uploaded by

Dhiraj Patil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

# [ Statistical Analysis with SciPy ]

Importing and Setup

● Import SciPy stats module: from scipy import stats


● Import NumPy for array operations: import numpy as np
● Set random seed for reproducibility: np.random.seed(42)

Descriptive Statistics

● Mean: np.mean(data)
● Median: np.median(data)
● Mode: stats.mode(data)
● Variance: np.var(data)
● Standard deviation: np.std(data)
● Range: np.ptp(data)
● Interquartile range: stats.iqr(data)
● Skewness: stats.skew(data)
● Kurtosis: stats.kurtosis(data)
● Coefficient of variation: stats.variation(data)
● Geometric mean: stats.gmean(data)
● Harmonic mean: stats.hmean(data)
● Trimmed mean: stats.trim_mean(data, 0.1)
● Percentile: np.percentile(data, 75)
● Quantile: np.quantile(data, [0.25, 0.5, 0.75])

Probability Distributions

● Normal distribution PDF: stats.norm.pdf(x, loc=0, scale=1)


● Normal distribution CDF: stats.norm.cdf(x, loc=0, scale=1)
● Normal distribution inverse CDF: stats.norm.ppf(q, loc=0, scale=1)
● Generate normal random numbers: stats.norm.rvs(loc=0, scale=1,
size=1000)
● Uniform distribution PDF: stats.uniform.pdf(x, loc=0, scale=1)
● Uniform distribution CDF: stats.uniform.cdf(x, loc=0, scale=1)
● Generate uniform random numbers: stats.uniform.rvs(loc=0, scale=1,
size=1000)
● Exponential distribution PDF: stats.expon.pdf(x, scale=1)
● Exponential distribution CDF: stats.expon.cdf(x, scale=1)

By: Waleed Mousa


● Generate exponential random numbers: stats.expon.rvs(scale=1,
size=1000)
● Poisson distribution PMF: stats.poisson.pmf(k, mu=1)
● Poisson distribution CDF: stats.poisson.cdf(k, mu=1)
● Generate Poisson random numbers: stats.poisson.rvs(mu=1, size=1000)
● Binomial distribution PMF: stats.binom.pmf(k, n, p)
● Binomial distribution CDF: stats.binom.cdf(k, n, p)
● Generate binomial random numbers: stats.binom.rvs(n, p, size=1000)
● Chi-square distribution PDF: stats.chi2.pdf(x, df)
● Chi-square distribution CDF: stats.chi2.cdf(x, df)
● Generate chi-square random numbers: stats.chi2.rvs(df, size=1000)
● Student's t-distribution PDF: stats.t.pdf(x, df)
● Student's t-distribution CDF: stats.t.cdf(x, df)
● Generate Student's t random numbers: stats.t.rvs(df, size=1000)
● F-distribution PDF: stats.f.pdf(x, dfn, dfd)
● F-distribution CDF: stats.f.cdf(x, dfn, dfd)
● Generate F random numbers: stats.f.rvs(dfn, dfd, size=1000)

Hypothesis Testing

● One-sample t-test: stats.ttest_1samp(data, popmean)


● Independent two-sample t-test: stats.ttest_ind(data1, data2)
● Paired t-test: stats.ttest_rel(data1, data2)
● One-way ANOVA: stats.f_oneway(data1, data2, data3)
● Two-way ANOVA: stats.f_oneway(*(group for name, group in
data.groupby(['factor1', 'factor2'])))
● Chi-square goodness of fit test: stats.chisquare(observed, expected)
● Chi-square test of independence:
stats.chi2_contingency(contingency_table)
● Shapiro-Wilk test for normality: stats.shapiro(data)
● Anderson-Darling test for normality: stats.anderson(data)
● Kolmogorov-Smirnov test: stats.kstest(data, 'norm')
● Mann-Whitney U test: stats.mannwhitneyu(data1, data2)
● Wilcoxon signed-rank test: stats.wilcoxon(data1, data2)
● Kruskal-Wallis H-test: stats.kruskal(data1, data2, data3)
● Friedman test: stats.friedmanchisquare(data1, data2, data3)
● Levene's test for equality of variances: stats.levene(data1, data2)
● Bartlett's test for equality of variances: stats.bartlett(data1, data2)
● Fligner-Killeen test for equality of variances: stats.fligner(data1,
data2)

By: Waleed Mousa


Correlation and Regression

● Pearson correlation coefficient: stats.pearsonr(x, y)


● Spearman rank correlation: stats.spearmanr(x, y)
● Kendall's tau: stats.kendalltau(x, y)
● Simple linear regression: stats.linregress(x, y)
● Multiple linear regression: stats.linregress(X, y)
● Polynomial regression: np.polyfit(x, y, deg=2)
● R-squared (coefficient of determination): 1 - (np.sum((y - y_pred)**2)
/ np.sum((y - np.mean(y))**2))
● Adjusted R-squared: 1 - ((1 - r_squared) * (n - 1) / (n - k - 1))
● F-statistic: ((r_squared / (k - 1)) / ((1 - r_squared) / (n - k)))
● Durbin-Watson statistic: stats.durbin_watson(residuals)

Non-parametric Methods

● Kernel density estimation: stats.gaussian_kde(data)


● Bootstrap sample: stats.bootstrap((data,), np.mean, n_resamples=1000)
● Jackknife resampling: stats.jackknife(data, np.mean)
● Permutation test: stats.permutation_test((data1, data2),
stats.ttest_ind)

Multivariate Analysis

● Principal Component Analysis: from sklearn.decomposition import PCA;


PCA().fit_transform(X)
● Canonical correlation analysis: from sklearn.cross_decomposition
import CCA; CCA().fit(X, Y).transform(X, Y)
● MANOVA: from statsmodels.multivariate.manova import MANOVA;
MANOVA.from_formula('y1 + y2 ~ group', data=data).mv_test()
● Hotelling's T-squared test: stats.hotelling_t2(X1, X2)

Time Series Analysis

● Autocorrelation: stats.autocorr(data)
● Partial autocorrelation: from statsmodels.tsa.stattools import pacf;
pacf(data)
● Augmented Dickey-Fuller test: from statsmodels.tsa.stattools import
adfuller; adfuller(data)
● KPSS test: from statsmodels.tsa.stattools import kpss; kpss(data)

By: Waleed Mousa


● Granger causality test: from statsmodels.tsa.stattools import
grangercausalitytests; grangercausalitytests(data, maxlag=5)

Bayesian Statistics

● Bayes factor: stats.bayes_mvs(data)


● Bayesian Information Criterion (BIC): stats.bic(residuals)
● Akaike Information Criterion (AIC): stats.aic(residuals)

Sampling and Experimental Design

● Simple random sample: np.random.choice(population, size=n,


replace=False)
● Stratified sample: from sklearn.model_selection import
StratifiedShuffleSplit; StratifiedShuffleSplit(n_splits=1,
test_size=0.3).split(X, y)
● Cluster sample: from sklearn.cluster import KMeans;
KMeans(n_clusters=k).fit_predict(X)
● Systematic sample: population[::k]
● Latin square design: stats.latin_square(n)

Power Analysis

● Power of t-test: stats.ttest_ind_solve_power(effect_size=0.5,


nobs1=100, alpha=0.05, ratio=1.0, alternative='two-sided')
● Power of ANOVA: stats.f_oneway_solve_power(dfnum=2, dfden=27,
alpha=0.05, effect_size=0.25)
● Sample size calculation for t-test:
stats.ttest_ind_solve_power(effect_size=0.5, power=0.8, alpha=0.05,
ratio=1.0, alternative='two-sided')

Reliability Analysis

● Cronbach's alpha: from statsmodels.stats.inter_rater import


fleiss_kappa; fleiss_kappa(data)
● Intraclass correlation coefficient: stats.ttest_ind(group1, group2)

Effect Size Calculations

By: Waleed Mousa


● Cohen's d: (np.mean(group1) - np.mean(group2)) /
np.sqrt((np.std(group1, ddof=1)**2 + np.std(group2, ddof=1)**2) /
2)
● Eta-squared: ss_effect / (ss_effect + ss_error)
● Odds ratio: (a * d) / (b * c)
● Risk ratio: (a / (a + b)) / (c / (c + d))

Data Transformation

● Z-score normalization: stats.zscore(data)


● Min-max scaling: (data - np.min(data)) / (np.max(data) -
np.min(data))
● Box-Cox transformation: stats.boxcox(data)
● Yeo-Johnson transformation: stats.yeojohnson(data)
● Logarithmic transformation: np.log1p(data)

Outlier Detection

● Z-score method: np.abs(stats.zscore(data)) > 3


● Interquartile range (IQR) method: (data < Q1 - 1.5 * IQR) | (data > Q3
+ 1.5 * IQR)
● Modified Z-score method: 0.6745 * (data - np.median(data)) /
stats.median_abs_deviation(data) > 3.5
● Grubbs' test: stats.grubbs(data)

Confidence Intervals

● Normal distribution CI: stats.norm.interval(alpha=0.95,


loc=np.mean(data), scale=stats.sem(data))
● T-distribution CI: stats.t.interval(alpha=0.95, df=len(data)-1,
loc=np.mean(data), scale=stats.sem(data))
● Binomial proportion CI: stats.binom.interval(n=len(data),
p=np.mean(data), alpha=0.05)
● Poisson CI: stats.poisson.interval(alpha=0.95, mu=np.mean(data))

Survival Analysis

● Kaplan-Meier estimator: from lifelines import KaplanMeierFitter;


KaplanMeierFitter().fit(durations, event_observed)
● Cox proportional hazards model: from lifelines import CoxPHFitter;
CoxPHFitter().fit(df, duration_col='T', event_col='E')

By: Waleed Mousa


● Log-rank test: from lifelines.statistics import logrank_test;
logrank_test(durations_1, durations_2, event_observed_1,
event_observed_2)

Spatial Statistics

● Moran's I: from pysal.explore import esda; esda.Moran(y, w).I


● Geary's C: from pysal.explore import esda; esda.Geary(y, w).C
● Getis-Ord G: from pysal.explore import esda; esda.G(y, w).G

Multivariate Normality Tests

● Mardia's test: from statsmodels.stats.multivariate_normal import


mardia; mardia(data)
● Henze-Zirkler test: from statsmodels.stats.multivariate_normal import
henze_zirkler; henze_zirkler(data)

Robust Statistics

● Median absolute deviation: stats.median_abs_deviation(data)


● Huber's M-estimator: from statsmodels.robust import scale;
scale.huber(data)
● Theil-Sen estimator: from scipy.stats import theilslopes;
theilslopes(y, x)

Factor Analysis

● Exploratory Factor Analysis: from factor_analyzer import


FactorAnalyzer; FactorAnalyzer().fit(data)
● Confirmatory Factor Analysis: from statsmodels.stats.factor import
FactorAnalysis; FactorAnalysis().fit(data)

Cluster Analysis

● K-means clustering: from sklearn.cluster import KMeans;


KMeans(n_clusters=k).fit(X)
● Hierarchical clustering: from scipy.cluster.hierarchy import linkage;
linkage(X, method='ward')
● DBSCAN clustering: from sklearn.cluster import DBSCAN;
DBSCAN().fit(X)

By: Waleed Mousa


Time Series Decomposition

● Seasonal decomposition: from statsmodels.tsa.seasonal import


seasonal_decompose; seasonal_decompose(data, model='additive')

Statistical Process Control

● Control chart (X-bar chart): from statsmodels.stats.stattools import


control_chart; control_chart(data)

Meta-Analysis

● Fixed effects meta-analysis: from statsmodels.stats.meta_analysis


import CombineResults; CombineResults.combine_effects(effects,
variances)
● Random effects meta-analysis: from statsmodels.stats.meta_analysis
import CombineResults; CombineResults.combine_effects(effects,
variances, method='random')

Structural Equation Modeling

● Path analysis: from statsmodels.stats.sem import SEM;


SEM.from_formula('y ~ x1 + x2', data=data).fit()

Item Response Theory

● 1PL (Rasch) model: from psychometrics import irt;


irt.twopl(difficulty, discrimination=1, ability)
● 2PL model: from psychometrics import irt; irt.twopl(difficulty,
discrimination, ability)

Multilevel Modeling

● Random intercept model: from


statsmodels.regression.mixed_linear_model import MixedLM;
MixedLM.from_formula('y ~ x', groups='group', data=data).fit()

Statistical Quality Control


● Capability analysis: from statsmodels.stats.stattools import
cpk_index; cpk_index(data, lower=lsl, upper=usl)
● Process capability index: (usl - lsl) / (6 * np.std(data, ddof=1))
By: Waleed Mousa
Nonlinear Regression

● Curve fitting: from scipy.optimize import curve_fit;


curve_fit(lambda x, a, b: a * np.exp(b * x), x_data, y_data)

Statistical Tests for Circular Data

● Rayleigh test: from scipy.stats import rayleigh; rayleigh.fit(data)


● Watson's U2 test: from astropy.stats import watson_u2;
watson_u2(data)

Extreme Value Analysis

● Generalized extreme value distribution fit: from scipy.stats import


genextreme; genextreme.fit(data)
● Peak over threshold analysis: from scipy.stats import genpareto;
genpareto.fit(data[data > threshold])

Functional Data Analysis

● Functional principal component analysis: from skfda.decomposition


import FPCA; FPCA().fit_transform(data)

Statistical Learning Theory

● Support Vector Machine: from sklearn.svm import SVC; SVC().fit(X, y)


● Cross-validation: from sklearn.model_selection import
cross_val_score; cross_val_score(model, X, y, cv=5)

Copulas

● Gaussian copula: from scipy.stats import multivariate_normal;


multivariate_normal.cdf(data)
● Clayton copula: from copulas.multivariate import
GaussianMultivariate;
GaussianMultivariate().fit(data).probability_density(data)

Stochastic Processes

● Brownian motion simulation: np.cumsum(np.random.normal(0, 1,


size=1000))

By: Waleed Mousa


● Ornstein-Uhlenbeck process: from scipy.integrate import odeint;
odeint(lambda y, t, theta, mu, sigma: theta * (mu - y), y0, t,
args=(theta, mu, sigma))

Causal Inference

● Propensity score matching: from sklearn.linear_model import


LogisticRegression; LogisticRegression().fit(X,
treatment).predict_proba(X)[:, 1]
● Difference-in-differences estimation: np.mean(post_treatment -
pre_treatment) - np.mean(post_control - pre_control)

Spatial Point Pattern Analysis

● Ripley's K function: from astropy.stats import RipleysKEstimator;


RipleysKEstimator(area=area).evaluate(data)

Statistical Network Analysis

● Erdős-Rényi random graph model: from networkx.generators.random_graphs


import erdos_renyi_graph; erdos_renyi_graph(n, p)

By: Waleed Mousa

You might also like