Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

python-Copy1

The document outlines a data analysis process using Python, specifically focusing on a dataset from an Excel file named 'heat_data.xlsx'. It includes steps for checking missing values, visualizing data through histograms and boxplots, calculating descriptive statistics, and building a linear regression model to analyze the relationship between variables. Additionally, it performs a t-test and ANOVA to assess differences in thermal properties between two variables, Y1 and Y2.

Uploaded by

Ky Phong Hoang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

python-Copy1

The document outlines a data analysis process using Python, specifically focusing on a dataset from an Excel file named 'heat_data.xlsx'. It includes steps for checking missing values, visualizing data through histograms and boxplots, calculating descriptive statistics, and building a linear regression model to analyze the relationship between variables. Additionally, it performs a t-test and ANOVA to assess differences in thermal properties between two variables, Y1 and Y2.

Uploaded by

Ky Phong Hoang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

python-Copy1

May 6, 2022

[1]: import pandas as pd


import numpy as np

[2]: # câu a
df = pd.read_excel("heat_data.xlsx")

[3]: df.head()

[3]: X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28

[4]: # Câu b xác định missing value


df.isna().sum()

[4]: X1 0
X2 0
X3 0
X4 0
X5 0
X6 0
X7 0
X8 0
Y1 0
Y2 0
dtype: int64

[5]: df.isnull().sum()

[5]: X1 0
X2 0
X3 0
X4 0
X5 0

1
X6 0
X7 0
X8 0
Y1 0
Y2 0
dtype: int64

[6]: # không có giá trị miss_value ,Xác định outlier bằng boxplot các cột có outlier␣
,→la X2, X3,X4

df.boxplot( figsize=(15,10))

# plt.show()

[6]: <matplotlib.axes._subplots.AxesSubplot at 0x26e52a36af0>

[7]: import matplotlib.pyplot as plt


import seaborn as sb

[8]: # Câu c visualize


df.hist(layout=(5,4), figsize=(15,10))

# plt.show()

2
[8]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54EFCEB0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54F341C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54F62550>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54F8E9A0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54FBCD30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54FF4130>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E54FF4220>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E5501F6D0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E5507AF10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E550B13D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E550DE760>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E550FED60>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E55137220>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E55164670>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E55190AF0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E551BFFA0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E551F6460>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E552208B0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E55250D30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x0000026E5528A1C0>]],
dtype=object)

[9]: # Câu c thống kê mô tả


df.describe()

[9]: X1 X2 X3 X4 X5 X6 \
count 768.000000 768.000000 768.000000 768.000000 768.00000 768.000000
mean 0.764167 671.708333 318.500000 176.604167 5.25000 3.500000
std 0.105777 88.086116 43.626481 45.165950 1.75114 1.118763
min 0.620000 514.500000 245.000000 110.250000 3.50000 2.000000
25% 0.682500 606.375000 294.000000 140.875000 3.50000 2.750000

3
50% 0.750000 673.750000 318.500000 183.750000 5.25000 3.500000
75% 0.830000 741.125000 343.000000 220.500000 7.00000 4.250000
max 0.980000 808.500000 416.500000 220.500000 7.00000 5.000000

X7 X8 Y1 Y2
count 768.000000 768.00000 768.000000 768.000000
mean 0.234375 2.81250 22.307195 24.587760
std 0.133221 1.55096 10.090204 9.513306
min 0.000000 0.00000 6.010000 10.900000
25% 0.100000 1.75000 12.992500 15.620000
50% 0.250000 3.00000 18.950000 22.080000
75% 0.400000 4.00000 31.667500 33.132500
max 0.400000 5.00000 43.100000 48.030000

[10]: #Câu d tính cor ta thấy X6, X8 gần như có có giá trị ảnh hướng tới Y1, Y2 loại␣
,→ra khỏi mô hình

corr = df.corr()

corr.style.background_gradient(cmap='coolwarm', axis=None)
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
.style
.background_gradient(cmap='coolwarm', axis=None, vmin=-1, vmax=1)
.highlight_null(null_color='#f1f1f1') # Color NaNs grey
.set_precision(2))

[10]: <pandas.io.formats.style.Styler at 0x26e5580faf0>

[11]: X = df[["X1","X2","X3","X4","X5","X7"]]
y = df['Y1']

[12]: # Câu d xây dựng mô hình tuyến tính thu nhiệt


from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,y)

[12]: LinearRegression()

[13]: # Phương trình hồi quy tuyến tính


lm.intercept_, lm.coef_

[13]: (81.1821953125,
array([-6.24378443e+01, -1.38242706e+12, 1.38242706e+12, 2.76485412e+12,
4.18270683e+00, 2.04058407e+01]))

4
[14]: # câu e
import scipy.stats as stats

[15]: stats.ttest_ind(a=df['Y1'],b=df['Y2'])

[15]: Ttest_indResult(statistic=-4.557390897279331, pvalue=5.589851371275724e-06)

[16]: # p_value = 0.00055% < 5% chứng tỏ không có sự khác biệt về mức độ thu nhiệt và␣
,→tỏa nhiệt

[17]: # câu f
import statsmodels.api as sm
from statsmodels.formula.api import ols

[18]: df_melt=pd.melt(df.reset_index(), id_vars=['index'], value_vars=['Y1','Y2'])

[19]: model = ols('value ~ C(variable)', data=df_melt).fit()


anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

[19]: sum_sq df F PR(>F)


C(variable) 1997.175243 1.0 20.769812 0.000006
Residual 147505.757543 1534.0 NaN NaN

[ ]:

[ ]:

[ ]:

You might also like