Report_Group 5D_The Professor Proposes
Report_Group 5D_The Professor Proposes
Report_Group 5D_The Professor Proposes
PROPOSES
Analyzing Diamond Value: A Case Study on Market Comparison and
Decision-Making in Diamond Purchasing
(Quantitative Techniques – II)
Submitted To:
Dr. Pritha Guha
Submitted By:
Group Name: 5D
Krishna Kumar Swaika B24187
Kunal Nitin Gupta B24188
Debanshu Poddar BL24012
Mahi Sachdeva BL24013
Sairam Dabbiru BL24015
Souvik Dey FB24004
INTRODUCTION
(A) PROBLEM STATEMENT
In "The Professor Proposes," a case study, the professor, having never gone diamond shopping, becomes
immersed in the complex diamond selection process in order to buy an engagement ring. He underestimates
the difficulty of the task and very quickly finds himself overwhelmed by all the considerations involved: the
"Four Cs"—cut, colour, carat, and clarity—polish, symmetry, and certification standards. Equipped with
information from diamond wholesalers, the professor intends to discuss the fairness of pricing a particular
diamond.
(B) OBJECTIVES AND METHODOLOGY
This report aims to apply exploratory data analysis, along with sophisticated statistical methods, to fully study
factors in the pricing of diamonds. In our methodology, we employed R software for the analysis of diamond
pricing. The scope of our data will include varied diamonds characterized by features that include colour, cut,
carat weight, clarity, polish, symmetry, and certification. We plan to start the research with EDA to understand
whether categorical and numerical predictors are correlated and related individually to the price, which will
be visualised in the form of histograms, box plots and heat-map of the correlation matrix. Then, we plan to
apply multiple regression analysis as well as a logit test for building prediction models. We will evaluate its
performance by using R-squared values, residual errors, and accuracy metrics. We will be further comparing
the models’ efficacies using the Akaike Information Criterion (AIC) and validate the relevance of each
parameter through an ANOVA test (specify type). Finally, the best-fitted equation will be used to calculate the
predicted price, offering a verdict on whether the quoted price aligns with market standards. Through this
approach, we aim to demonstrate the effectiveness of statistical modelling in decoding complex pricing
structures and providing actionable insights.
(C) MOTIVATION
The diamond industry is an intricate and fascinating market where numerous factors intricately interplay to
determine pricing. This complexity provides an excellent opportunity for the application of statistical methods
aimed at analysing and interpreting the relationships among such attributes as carat weight, colour, clarity, and
cut with their influence on the valuation of diamonds. The practical significance of this study lies in its ability
to bridge theoretical knowledge with real-world applications. Besides the luxury and emotional value,
diamonds symbolize a significant part of the global economy. Thus, this analysis will clarify consumer
behaviour and pricing strategies. Conducting this study will improve our analytical skills, contribute to data-
driven insights, and create a better appreciation of how statistical tools can be used to decode the dynamics of
a high-value industry.
This research is not just about finding the right diamond for Professor Davis; it's an exploration of the decision-
making process in a complex and often opaque market. It underscores the significance of informed choices
backed by data and analysis, aiming to demystify the process of valuing and purchasing a diamond, and
providing a template for others facing similar decisions in the diamond market.
DATA DESCRIPTION
This analysis dataset has 440 data points that are described by seven input variables and one response variable.
The response variable is diamond price, and the seven input variables include one numerical input variable,
carat, and six categorical input variables, including colour, cut, clarity, polish, symmetry, and certification.
The wholesalers whose data are used are three; but the parameter seems futile in this particular case, so we
ignore it.
Each attribute captures a unique characteristic of the diamond:
Attribute Description
Colour Rated from colourless to yellow, very rare hues are more valuable.
While largely a matter of personal taste, colour makes little
difference in price.
Cut The proportions of the diamond; significantly impact its brilliance
and reflective quality. A well-cut adds value and poor cuts
decrease it.
Clarity Signifies that the diamond contains inclusions or imperfections.
Greater clarity values mean fewer defects and, therefore, higher
values.
Certification Ensures quality standards and provides a grading assessment by
recognized laboratories like GIA, AGS, and EGL.
Carat The measurement, as a number, for size: large diamonds are rare,
so their price is high.
Polish The quality and smoothness of the diamond's surface. Graded
from poor to ideal, polish refers to how well the diamond can
reflect light, affecting its appearance overall.
Symmetry The alignment and proportions of the diamond facets. Good
symmetry means the best reflection of light and visual balance for
brilliance.
Price The monetary value assigned to each diamond is determined by
the interplay of the above factors, reflecting its market worth.
For regression analysis, the categorical variables are numerically encoded, and the fit of the model is evaluated
by checking the significance and impact of each variable on the price.
Below is the table with all the specifications of the diamond ring being considered: price, carat weight, cut,
colour, clarity, polish, symmetry, and certification, which are used as the basis for judging its market value.
Attribute Value
Price $3,100
Carat Weight 0.9
Cut Very Good
Colour J
Clarity SI2
Polish Good
Symmetry Very Good
Certification GIA
EXPLORATORY DATA ANALYSIS
The Pareto chart shows that SI1 Colours I, J, and H dominate the The majority of the dataset is
and SI2 clarity grades dominate, dataset, making a very high dominated by diamonds in the
contributing to over 50% contribution to the cumulative 0–0.5 and 1–2 carat bins. The
cumulatively, while higher percentage. Less common smaller frequencies for
grades like VVS2 and VVS1 are colours, such as K and L, have a diamonds in the 0.5–1 carat
rare, representing a small low percentage, meaning they range reflect their relatively
portion of the dataset. appear less frequently in the lower prevalence.
market.
The grades of "X" and "V" appear
dominant, while the grades "F"
and "G" appear to be relatively
rare. The GIA Certification is the
most common grading service.
EGL is next in line, while IGI,
AGS, and DOW represent only a
few.
Price vs Count
The histogram has a bimodal distribution, with two peaks. One is around
the low price, near 1000, suggesting many items in that range of low
cost. The second peak, near 3000, suggests a focus on the higher end.
There is a pretty big gap between these two groups with very little in the
mid-range of 1000 to 2000. This may indicate that there are not many
options at this price range or perhaps a strategy that hits two markets:
affordable and premium. The concentration of low-priced items may
indicate higher demand or supply at that range. Research into the mid-
price segment could fill this gap and open up growth.
The scatter plot represents diamond price distribution based on
cutting class. It is from observations that higher-quality cuts;
which include Premium and Ideal types, have a wider cost
difference and higher general prices against cheaper cuts such
as Fair and Good. This trend generally tends to show that cut
can drive the price very much with evidence of its impact on
increased diamond brilliance and value when presented in the
market. While the Very Good category overlaps partially with
Premium, it shows a moderate range of prices. This analysis thus
highlights cut quality as one of the critical determinants of
diamond pricing and reinforces its significance in valuation
models.
The scatter plot shows the price distribution of diamonds along
colour categories from D to J. Diamonds that are closer to the
colourless range, such as D, tend to have a wider price range
than those belonging to the more tinted categories, such as I
and J. This trend reflects market preference and valuation for a
diamond that appears closer to the colourless range, thus,
considered more desirable. However, there is a large overlap in
price distributions between the categories, indicating that
variables such as cut, clarity, and carat interact with colour
significantly. This analysis shows color is an important
attribute affecting diamond pricing, but it only makes sense
when considered along with other variables for valuation.
The scatter plot represents the price distribution of the diamonds
across different certification categories, including GIA, AGS,
IGI, EGL, and HRD. Diamonds certified by GIA and AGS,
which tend to be more stringent and reputable, tend to have
higher average prices. On the other hand, diamonds certified by
EGL and HRD tend to spread over a broader and often lower
price range, which is a reflection of potentially lower grading
standards. Overlap between categories of certification seems to
indicate that although certification heavily influences diamond
prices, factors such as carat, cut, and clarity are important. The
analysis emphasizes the significance of certification in shaping
consumer trust and perceived value in the diamond market.
Scatter Plot
The scatter plot shows the relationship between Carat and Price,
with colour gradient indicating varying Carat values. The two
variables are positively correlated, and this is reflected in the red
trend line. For higher Carat values, the price is also expected to be
higher. There are clusters of data points, especially for lower Carat
and Price values, but they are more spread out for the higher
values. This means that higher-carat diamonds are very rare and,
therefore, also costlier, while diamonds of lower-carat ratings are
more readily available and hence, cheaper. The general linear trend
shows a related rise in price with the rise in carat weight.
Correlation Matrix
The Cramér's V
correlation matrix
illustrates the strength
of the association of
categorical variables in
the diamond dataset.
For a strong
association, Cramer’s
V has to be at least
greater than 0.3, and
the closer the value is to
1, the stronger the
relationship. From the
matrix, it is evident that
the variables Clarity
and Cut have a high association and that clarity of a diamond affects the classification of its cut. Likewise,
Cut has highly positive correlations with Certification and Symmetry, indicating the extent of influences on
certification standards or considerations due to the cut of a diamond. Colour, along with Certification, has
comparatively low correlations, indicating some kind of relatively independent influence towards the features
of a diamond. Such analysis can prove beneficial in prioritizing different variables according to their
significance and relation in building predictive models or analyzing market trends.
The table below shows the summary of Diamond features using descriptive statistics min, median, mean, and
max of both carat and price, along with the frequency distributions on categorical features such as colour,
clarity, cut, certification, polish, and symmetry.
Statistic Carat Colour Clarity Cut Certification Polish Symmetry Price
Min. 0.0900 I: 79 SI1: 116 F: 59 AGS: 12 F: 5 F: 21 160
REGRESSION MODEL
We will now conduct the regression modelling to analyse the relationship between the price of diamonds
(dependent variable) and various independent variables, including Carat, Colour, Clarity, Cut, Certification,
Polish, Symmetry, and Wholesaler. We will employ two approaches – (a) a multiple regression model without
interaction, (b) a multiple regression model with interaction between predictors to capture the combined
effects of variables. In order to identify the best-performing models, we will first assess the best model based
on the predictors giving the best AIC results.
We will obtain optimal models that we will proceed to analyse in greater detail. These models will reveal the
significance and relative importance of each variable, along with the predictive accuracy of the models. This
rigorous approach not only will quantify the factors influencing diamond pricing but also enable the prediction
of a fair price for the diamond selected by Professor Davis.
Model Building
Multiple linear regression is a statistical technique that models the linear relationship between multiple
independent (explanatory) variables and a dependent (response) variable. This approach allows for a more
comprehensive analysis of how various factors collectively influence an outcome, making it well-suited for
complex scenarios like diamond pricing where multiple attributes impact the final value.
ANOVA table –
Source DF Sum of Squares (SS) Mean Square (MS) F-Statistic (F)
Regression DFRegression = SSR MSR Fobs = MSR / MSE
25 =519687583+10605862 = 589036599/25 = 23561463.96/39169.37
+53866928+5592684+ =23561463.96 ≈601.78
836542
= 589036599
Residuals DFResiduals = SSE=16215922 MSE -
414 = 16215922/414
= 39169.37
Total DFTotal =439 SST=SSR+SSE= - -
589036599+16215922=
60525252
The ANOVA presents that the regression model is highly significant with an F-statistic of about 601.78 and a
corresponding p-value < 2.2e-16, showing that the predictors included in the model explain a significant
portion of the variability in diamond prices. The Sum of Squares for Regression (SSR) was calculated as
589036599; the predictor explains most of the variance with a small Sum of Squares for Residuals SSE of
16215922, meaning there is only minimal unexplained variance. The mean Square Regression (MSR) is
23561463.96, which is much higher than the Mean Square Error, MSE, 39169.37. A value of 39169.37 further
strengthens the model in question. The high F-statistic reveals that the predictive variance is significantly more
important than the random error; hence, the overall effectiveness and reliability of the regression model are
affirmed.
Therefore, the regression model well portrays the relationship between the predictors (Carat, Colour, Clarity,
Certification, and Symmetry) and the diamond prices, making it a robust tool for predicting price variability
based on such factors.
The Residual Analysis is as below -
The scatter plot shows the relationship between the Actual
Prices of diamonds and their Predicted Prices from the
regression model. Most data points lie closely along a
linear trend, indicating that the model is very good at
predicting diamond prices. The clustering of points at
higher price ranges suggests that the model is particularly
good for more expensive diamonds. However, small
deviations from the line indicate residual variance, thus
indicating where improvement is possible, such as refining
the variable selection or incorporating more interaction
terms. Overall, the model does a good job of capturing the
pricing patterns and provides reliable predictions, so it can
be used for diamond valuation analysis.
Residuals vs. Fitted Values plot assesses
assumptions in linear regression. Residuals
scattered around the horizontal line at zero give a
view of how far apart observed and fitted values
were. There is some clustering and a pattern, though
not too systematic in the spread, especially for the
fitted values at both extremes, so it seems that the
model meets the assumption of linearity quite well.
However, the slight funnelling pattern might imply
heteroscedasticity, where residual variance
increases with the fitted values. In total, the model
performs well, but further refinements might be
necessary to improve the uniformity of residual
distribution.
The plot indicates a scattering of The plot suggests that most The plot reveals that the
residuals around the zero point, residuals cluster around zero, residuals are largely clustered
indicating a good capture of the indicating a relatively good fit. around zero, and so is a good fit
relationship between carat and However, some outliers and to the model. However, a slight
price. However, slight clustering slight patterns suggest minor pattern and nonuniform spread
may indicate non-linearity or deviations from the assumptions may indicate some potential
unexplained variance requiring of homoscedasticity and heteroscedasticity, which could
further model refinement. linearity. necessitate further analysis or
model adjustments.
The Boxplot of Standardized Residuals presents The Normal Q-Q Plot tests the normality of residuals
the distribution of residuals. This distribution is by comparing their distribution to a theoretical
centered at zero, indicating that in general, the normal distribution. Most of the points lie very close
model predictions fall along with the observed to the diagonal line, indicating that the residuals are
values. Most of the residuals lie within the approximately normally distributed. However, slight
interquartile range; a few are above ±2 and ±4, deviations at the tails indicate the presence of
which may indicate small deviations from the outliers or non-normality in the extreme values. This
model assumptions. This spread is symmetrical alignment supports the validity of the normality
around the median, supporting the assumption assumption for most of the data. Overall, the fit of
that the residuals are normally distributed. the model is quite good, but some minor deviations
Overall, the fit is good although addressing call for further analysis.
outliers would strengthen it even more.
The plot of Standardized Residuals vs. Fitted Values
after Log Transformation shows residuals spread
around zero with a much more uniform spread than in
the untransformed model. This means that log
transformation has helped reduce heteroscedasticity
since the residuals now vary more uniformly across the
fitted values. However, slight deviations and a few
outliers are still there, especially at the higher end of
fitted values. Overall, the log transformation has
improved the model's fit and addressed variance
inconsistencies, enhancing its predictive reliability.
BREUSCH-GODFREY TEST-
This is the test statistic that measures the degree of autocorrelation in the residuals. The higher the value, the
more likely it is that autocorrelation exists.
LM test = 25.433, df = 1, p-value = 4.581e-07
The p-value is very small, indicating that we reject the null hypothesis.
This means there is significant autocorrelation in the residuals of our model, which violates the assumption
that the residuals are independent.
BREUSCH-PAGAN TEST –
This is the test statistic for heteroscedasticity. A higher value suggests more evidence for heteroscedasticity
(non-constant variance of residuals).
BP = 108.43, df = 25, p-value = 2.3e-12. The p-value is extremely small, which means we reject the null
hypothesis. This indicates that there is significant heteroscedasticity in the residuals of our model, meaning
that the variance of the residuals is not constant across the values of the independent variables.
FITTED REGRESSION EQUATION -
𝑷𝒓𝒊𝒄𝒆 = 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + (𝑪𝒂𝒓𝒂𝒕 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 × 𝑪𝒂𝒓𝒂𝒕) + 𝑪𝒐𝒍𝒐𝒖𝒓 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕
+ 𝑪𝒍𝒂𝒓𝒊𝒕𝒚 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 + 𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 + 𝑺𝒚𝒎𝒎𝒆𝒕𝒓𝒚 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕
Coefficients Used:
o Intercept: −1259.285
o Carat (0.9): 4207.182×0.9=3786.464
o Colour (J): −630.739
o Clarity (SI2): 741.915
o Certification (GIA): −56.466
o Symmetry (Very Good - V): 219.537
𝑃𝑟𝑖𝑐𝑒 = −1259.285 + 3786.464 − 630.739 + 741.915 − 56.466 + 219.537
𝑷𝒓𝒊𝒄𝒆 = 𝟐𝟖𝟎𝟏. 𝟒𝟐𝟔
CONCLUDING OBSERVATION -
For the given characteristics of a diamond, our model predicts a price of 2801.426 compared to the quoted
price of 3100. This close alignment shows that the refined model, optimized using AIC, has better accuracy
and reliability than the original, unoptimized model. This illustrates the practical applicability of the model
refinement process in achieving predictions closer to real-world values.
The multiple regression model without interaction terms provided insight into the factors that drive diamond
prices with a very good fit of 0.9717 Adjusted R-squared. Carat, Colour, and Clarity emerged as the most
important predictors, thus confirming their crucial role in explaining price variation.
This balanced model strikes a delicate trade-off between simplicity and explanatory power, ensuring robust
predictions without unnecessary complexity. This certainly provides a good basis for further refinement and
explorations to better understand and capture the intricate relationships between predictors.
Model 2: Regression Model with Interaction Terms –
The regression analysis focused on a subset of key predictors, namely Carat, Colour, Clarity, Certification,
and Symmetry, selected using the Akaike Information Criterion. This approach balanced simplicity and
predictive power, avoiding overfitting while maintaining explanatory accuracy. The model had an Adjusted
R-squared of 0.9717, indicating high explanatory power for diamond price variability.
Although the model offers such insights, it makes the assumption of additive relationships among predictors.
Interaction analysis will be crucial to look at potential synergies or a combined effect between variables; this
might better capture more complex dynamics influencing the diamond prices, thereby enhancing the
interpretability and accuracy in prediction.
The interaction model was developed to precisely capture complex interactions that influence the prices of
diamonds. The possible interactions of all key predictors - Carat, Colour, Clarity, Certification, and Symmetry
- are considered. However, the added comprehensiveness made several insignificant terms pop up and
complicated the model further.
For this purpose, a refined interaction model was designed through a process of eliminating irrelevant variables
systematically. In doing this, only the meaningful interactions and main effects are maintained. The result is
that a refined model has complexity versus interpretability and still manages to retain all critical interactions,
which explains much of the price variation. Thus, the interaction model serves as a sound basis for further
analysis and decision-making. This refinement is done by running a refine function which automatically
removes predictors which have higher p-values, i.e. the null hypothesis is not rejected.
Due to the data being huge, only the interactions relevant to the professor's diamond characteristics, namely
those involving Carat, Colour (J), Clarity (SI2), Certification (GIA), and Symmetry (Very Good - V), will be
included in the prediction. Other interaction terms that are not relevant to the predictors involved will be zero
and thus need not be considered in the analysis.
The regression equation includes:
Interaction Estimate Std. Error t-value p-value
Carat:ColourJ -894.90 232.20 -3.854 0.000141 **
Carat:ClaritySI2 920.26 156.95 5.863 1.15e-08 ***
Regression Equation:
𝑷𝒓𝒊𝒄𝒆 = 𝜷𝟎 + 𝜷𝟏 (𝑪𝒂𝒓𝒂𝒕) + 𝜷𝟐 (𝑪𝒐𝒍𝒐𝒖𝒓𝑱) + 𝜷𝟑 (𝑪𝒍𝒂𝒓𝒊𝒕𝒚𝑺𝑰𝟐) + 𝜷𝟒 (𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏𝑮𝑰𝑨)
+ 𝜷𝟓 (𝑺𝒚𝒎𝒎𝒆𝒕𝒓𝒚𝑽) + 𝜷𝟔 (𝑪𝒂𝒓𝒂𝒕: 𝑪𝒐𝒍𝒐𝒖𝒓𝑱) + 𝜷𝟕(𝑪𝒂𝒓𝒂𝒕: 𝑪𝒍𝒂𝒓𝒊𝒕𝒚𝑺𝑰𝟐) + 𝝐
Where,
o 𝛽0 is the intercept representing the baseline price when all predictors are zero.
o 𝛽1 is the effect of Carat on Price, holding Colour, Clarity, Certification, and Symmetry constant.
o 𝛽2 is the effect of Colour on Price, holding Carat, Clarity, Certification, and Symmetry constant.
o 𝛽3 is the effect of Clarity on Price, holding Carat, Colour, Certification, and Symmetry constant.
o 𝛽4 is the effect of Certification on Price, holding Carat, Colour, Clarity, and Symmetry constant.
o 𝛽5 is the effect of Symmetry on Price, holding Carat, Colour, Clarity, Certification constant.
o 𝛽6 is the effect of interaction between Carat and Colour.
o 𝛽7 is the effect of interaction between Carat and Clarity.
This model accounts for both the main effects and the combined effects of the interaction terms to better
capture the complexity of the relationship between predictors and the diamond's price.
Hypothesis Testing for Significance of All Predictors –
Now, we will do hypothesis testing at a 5% level of significance to check the significance of all the
predictors.
𝑵𝒖𝒍𝒍 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒊𝒔 𝒅𝒆𝒇𝒊𝒏𝒆𝒅 𝒂𝒔 → 𝑯𝟎 ∶ 𝜷𝟏 = 𝜷𝟐 = 𝜷𝟑 = 𝜷𝟒 = 𝜷𝟓 = 𝜷𝟔 = 𝜷𝟕 = 𝟎
𝑨𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒆 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒊𝒔 𝒅𝒆𝒇𝒊𝒏𝒆𝒅 𝒂𝒔 → 𝑯𝟏 ∶ 𝑨𝒏𝒚 𝒐𝒏𝒆 𝜷𝒊 𝒏𝒐𝒕 𝒆𝒒𝒖𝒂𝒍 𝒕𝒐 𝒛𝒆𝒓𝒐
ANOVA Testing –
Degrees Sum of
Source of Variation of Squares Mean Square (MS) F-statistic
Freedom (SS)
Carat (Factor A) 1 519687583 SSA/Df=519687583/1 MSA/MSE
=519687583 =519687583/14106
=36840.75
Colour (Factor B) 8 10605862 SSB/Df=10605862/8= MSB/MSE
1325733 =1325733/14106
=93.98
Clarity (Factor C) 8 53866928 SSC/Df=53866928/8= MSC/MSE
6733366 =6733366/14106
=477.33
Carat:Colour 8 2173451 SSAB/Df=2173451/8 MSAB/MSE
Interaction =271681 =271681/14106
=19.26
Carat:Clarity 7 3846658 SSAC/Df=3846658/7 MSAC/MSE
Interaction =549523 =549523/14106
=38.96
Residuals/Errors 314 4429387 SSE/Df=4429387/314 N/A
=14106
Total 346 586764048 N/A N/A
Based on F-statistics and respective p-values: -
➢ Carat, Color, and Clarity have highly significant associations with the Price variable.
➢ The Carat:Colour and Carat:Clarity interactions are also significant, indicating that the effect of Carat
on Price depends on the levels of Colour and Clarity.
So, we reject the Null Hypothesis indicating the significance of all the predictors and interactions.
The Residual Analysis is as below -
The plot of Actual vs.
Predicted Prices shows
strong points along the
diagonal line. This shows
that for most data points,
the model is fairly accurate
in terms of the prediction of
the prices of the diamonds
using the independent
variables. In the Residuals
vs. Fitted Values plot, the residual around zero suggests no bias in the model. However, slight clustering and
uneven spread, especially at higher fitted values, might indicate heteroscedasticity or model limitations. In
general, the model works well but needs to be further refined to deal with minor residual patterns.
The Residuals vs. Carat plot The Standardized Residuals vs. The Standardized Residuals
shows residuals scattered around Fitted Values plot reveals vs. Carat plot shows residuals
zero, indicating that the model residuals that scatter around zero. distributed around zero,
captures the relationship between This indicates that the model is suggesting the model captures
carat and price reasonably well. reasonably good. However, slight the relationship between carat
However, slight clustering clustering and spread at higher and price adequately.
suggests potential non-linearity or fitted values suggest a possibility However, some clustering and
variance inconsistency, which of heteroscedasticity that may call variability in residuals indicate
may require further investigation for adjustments to improve potential non-linearity or
or model adjustment. consistency in predictions. heteroscedasticity, which may
need further refinement in the
model.
FUTURE SCOPE
The analysis carried out gives good support, yet there are several areas that would require improvement and
further research.
• Refining interaction analysis: Since the interaction terms were considered in this study, additional
interactions between variables such as clarity, colour, and symmetry may build more intricate
relationships determining the price of diamonds. This could enhance predictions and provide greater
insights into market behaviour.
• Advanced Modelling Techniques: The use of nonlinear models, such as polynomial regression or
machine learning approaches like decision trees and random forests, may help better capture the non-
linear relationships in the data. Such techniques could improve the generalization ability of the model
across different scenarios.
• Market Segmentation Studies: Expanding the analysis to understand the variation of diamond prices
with respect to different customer segments, geographic locations, and economic conditions would
give a macro view. This would help in bringing the pricing strategy in line with the market dynamics.
• Heteroscedasticity Corrections: The Breusch-Pagan test revealed heteroscedasticity in some models.
Weighted least squares or log or Box-Cox transformation could be used to correct the problem for
better estimation of parameters.
REFERENCES
1. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN
978-3-319-24277-4.
https://ggplot2.tidyverse.org
4. Kassambara, G., Gorenc, J., Priya, & Visitor. (2018, March 11). Stepwise Regression Essentials in R.
STHDA.
http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-
essentials-in-r/
6. Collinearity-CBU statistics.
https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/Collinearity