Report_Group 5D_The Professor Proposes

CASE – THE PROFESSOR
PROPOSES
Analyzing Diamond Value: A Case Study on Market Comparison and
Decision-Making in Diamond Purchasing
(Quantitative Techniques – II)
Submitted To:
Dr. Pritha Guha
Submitted By:
Group Name: 5D
Krishna Kumar Swaika B24187
Kunal Nitin Gupta B24188
Debanshu Poddar BL24012
Mahi Sachdeva BL24013
Sairam Dabbiru BL24015
Souvik Dey FB24004
INTRODUCTION
(A) PROBLEM STATEMENT
In "The Professor Proposes," a case study, the professor, having never gone diamond shopping, becomes
immersed in the complex diamond selection process in order to buy an engagement ring. He underestimates
the difficulty of the task and very quickly finds himself overwhelmed by all the considerations involved: the
"Four Cs"—cut, colour, carat, and clarity—polish, symmetry, and certification standards. Equipped with
information from diamond wholesalers, the professor intends to discuss the fairness of pricing a particular
diamond.
(B) OBJECTIVES AND METHODOLOGY
This report aims to apply exploratory data analysis, along with sophisticated statistical methods, to fully study
factors in the pricing of diamonds. In our methodology, we employed R software for the analysis of diamond
pricing. The scope of our data will include varied diamonds characterized by features that include colour, cut,
carat weight, clarity, polish, symmetry, and certification. We plan to start the research with EDA to understand
whether categorical and numerical predictors are correlated and related individually to the price, which will
be visualised in the form of histograms, box plots and heat-map of the correlation matrix. Then, we plan to
apply multiple regression analysis as well as a logit test for building prediction models. We will evaluate its
performance by using R-squared values, residual errors, and accuracy metrics. We will be further comparing
the models’ efficacies using the Akaike Information Criterion (AIC) and validate the relevance of each
parameter through an ANOVA test (specify type). Finally, the best-fitted equation will be used to calculate the
predicted price, offering a verdict on whether the quoted price aligns with market standards. Through this
approach, we aim to demonstrate the effectiveness of statistical modelling in decoding complex pricing
structures and providing actionable insights.
(C) MOTIVATION
The diamond industry is an intricate and fascinating market where numerous factors intricately interplay to
determine pricing. This complexity provides an excellent opportunity for the application of statistical methods
aimed at analysing and interpreting the relationships among such attributes as carat weight, colour, clarity, and
cut with their influence on the valuation of diamonds. The practical significance of this study lies in its ability
to bridge theoretical knowledge with real-world applications. Besides the luxury and emotional value,
diamonds symbolize a significant part of the global economy. Thus, this analysis will clarify consumer
behaviour and pricing strategies. Conducting this study will improve our analytical skills, contribute to data-
driven insights, and create a better appreciation of how statistical tools can be used to decode the dynamics of
a high-value industry.
This research is not just about finding the right diamond for Professor Davis; it's an exploration of the decision-
making process in a complex and often opaque market. It underscores the significance of informed choices
backed by data and analysis, aiming to demystify the process of valuing and purchasing a diamond, and
providing a template for others facing similar decisions in the diamond market.
DATA DESCRIPTION
This analysis dataset has 440 data points that are described by seven input variables and one response variable.
The response variable is diamond price, and the seven input variables include one numerical input variable,
carat, and six categorical input variables, including colour, cut, clarity, polish, symmetry, and certification.
The wholesalers whose data are used are three; but the parameter seems futile in this particular case, so we
ignore it.
Each attribute captures a unique characteristic of the diamond:
Attribute Description
Colour Rated from colourless to yellow, very rare hues are more valuable.
While largely a matter of personal taste, colour makes little
difference in price.
Cut The proportions of the diamond; significantly impact its brilliance
and reflective quality. A well-cut adds value and poor cuts
decrease it.
Clarity Signifies that the diamond contains inclusions or imperfections.
Greater clarity values mean fewer defects and, therefore, higher
values.
Certification Ensures quality standards and provides a grading assessment by
recognized laboratories like GIA, AGS, and EGL.
Carat The measurement, as a number, for size: large diamonds are rare,
so their price is high.
Polish The quality and smoothness of the diamond's surface. Graded
from poor to ideal, polish refers to how well the diamond can
reflect light, affecting its appearance overall.
Symmetry The alignment and proportions of the diamond facets. Good
symmetry means the best reflection of light and visual balance for
brilliance.
Price The monetary value assigned to each diamond is determined by
the interplay of the above factors, reflecting its market worth.
For regression analysis, the categorical variables are numerically encoded, and the fit of the model is evaluated
by checking the significance and impact of each variable on the price.
Below is the table with all the specifications of the diamond ring being considered: price, carat weight, cut,
colour, clarity, polish, symmetry, and certification, which are used as the basis for judging its market value.
Attribute Value
Price $3,100
Carat Weight 0.9
Cut Very Good
Colour J
Clarity SI2
Polish Good
Symmetry Very Good
Certification GIA
EXPLORATORY DATA ANALYSIS
The Pareto chart shows that SI1 Colours I, J, and H dominate the The majority of the dataset is
and SI2 clarity grades dominate, dataset, making a very high dominated by diamonds in the
contributing to over 50% contribution to the cumulative 0–0.5 and 1–2 carat bins. The
cumulatively, while higher percentage. Less common smaller frequencies for
grades like VVS2 and VVS1 are colours, such as K and L, have a diamonds in the 0.5–1 carat
rare, representing a small low percentage, meaning they range reflect their relatively
portion of the dataset. appear less frequently in the lower prevalence.
market.
The grades of "X" and "V" appear
dominant, while the grades "F"
and "G" appear to be relatively
rare. The GIA Certification is the
most common grading service.
EGL is next in line, while IGI,
AGS, and DOW represent only a
few.
"V" (Very Good) and "G" (Good)

grades are the most dominant for
both attributes and account for
most of the data. Lower grades
such as "X" (Excellent), "F"
(Fair), and "I" (Ideal) are much
less frequent, signifying their
infrequency in the dataset.
Price vs Count
The histogram has a bimodal distribution, with two peaks. One is around
the low price, near 1000, suggesting many items in that range of low
cost. The second peak, near 3000, suggests a focus on the higher end.
There is a pretty big gap between these two groups with very little in the
mid-range of 1000 to 2000. This may indicate that there are not many
options at this price range or perhaps a strategy that hits two markets:
affordable and premium. The concentration of low-priced items may
indicate higher demand or supply at that range. Research into the mid-
price segment could fill this gap and open up growth.
The scatter plot represents diamond price distribution based on
cutting class. It is from observations that higher-quality cuts;
which include Premium and Ideal types, have a wider cost
difference and higher general prices against cheaper cuts such
as Fair and Good. This trend generally tends to show that cut
can drive the price very much with evidence of its impact on
increased diamond brilliance and value when presented in the
market. While the Very Good category overlaps partially with
Premium, it shows a moderate range of prices. This analysis thus
highlights cut quality as one of the critical determinants of
diamond pricing and reinforces its significance in valuation
models.
The scatter plot shows the price distribution of diamonds along
colour categories from D to J. Diamonds that are closer to the
colourless range, such as D, tend to have a wider price range
than those belonging to the more tinted categories, such as I
and J. This trend reflects market preference and valuation for a
diamond that appears closer to the colourless range, thus,
considered more desirable. However, there is a large overlap in
price distributions between the categories, indicating that
variables such as cut, clarity, and carat interact with colour
significantly. This analysis shows color is an important
attribute affecting diamond pricing, but it only makes sense
when considered along with other variables for valuation.
The scatter plot represents the price distribution of the diamonds
across different certification categories, including GIA, AGS,
IGI, EGL, and HRD. Diamonds certified by GIA and AGS,
which tend to be more stringent and reputable, tend to have
higher average prices. On the other hand, diamonds certified by
EGL and HRD tend to spread over a broader and often lower
price range, which is a reflection of potentially lower grading
standards. Overlap between categories of certification seems to
indicate that although certification heavily influences diamond
prices, factors such as carat, cut, and clarity are important. The
analysis emphasizes the significance of certification in shaping
consumer trust and perceived value in the diamond market.
The scatter plot illustrates the price of diamonds against

symmetry level into five categories: Excellent, Very Good,
Good, Fair, and Poor. While diamonds that are graded
Excellent or Very Good in terms of symmetry have higher
prices because of the premium assigned to superior
craftsmanship and aesthetics, those with Fair or Poor
symmetry have wider, lower price distributions and thus are
considered to have lower market value because of their
imperfections. That diminishing price can be related to the
decrease in symmetry quality underlines symmetry as a
significant determinant of diamond price. Nevertheless,
overlaps between classes suggest an interaction between
factors such as carat and clarity with symmetry.
The scatter plot shows the interaction between diamond prices and
polish levels categorized as Excellent, Very Good, Good, Fair, and
Poor. Diamonds have higher price ranges when polished to Excellent
and Very Good since the market favours superior polish quality, which
makes a diamond more attractive and brilliant in appearance. On the
other hand, diamonds with Fair and Poor polish have lower price
distributions because of their reduced attractiveness and perceived
imperfections. However, there is some overlap in price ranges across
polish categories, indicating that while polish is important, it interacts
with other factors such as carat and clarity to influence diamond prices.
This again emphasizes the multifactorial nature of diamond valuation.
The 3D scatter plot visualizes the relationship between
Price, Carat, and Clarity of diamonds. It demonstrates
that higher carat weights and better clarity grades, such
as VVS1 and VVS2, tend to correspond to higher
prices, reflecting their superior quality and rarity.
Diamonds with lower clarity grades like SI1 and SI2
generally fall in the lower price range, though carat
weight also influences pricing significantly. The
clustering within clarity levels indicates that while
clarity plays a crucial role, it interacts with carat to
determine pricing. This visualization underscores the
multifaceted dynamics of diamond valuation, where
both intrinsic quality and size jointly drive market
value.
The 3D scatter plot depicts the relationship between
Price, Carat, and Cut of diamonds. Higher-priced
diamonds are typically associated with superior cut
grades like Ideal and Premium, which indicate their
quality and beauty. Poorer cut grades like Fair and
Good tend to cluster in the lower price range
regardless of carat size. In addition, carat weight does
affect price across all cuts, with larger stones also
commanding a higher price in lower cut grades. Thus,
this visualization emphasizes size interaction with cut
quality on valuation, and the role of craftsmanship in
establishing market value.
The 3D scatter plot explores the relationship between
Price, Carat, and Certification. Diamonds certified by
IGI and HRD show a wider range of prices, suggesting
higher market recognition and valuation consistency.
Concentration in the lower price ranges is shown by the
AGS and EGL certifications, which indicate lesser
influence on premium pricing. Higher carat weights
and reputable certifications like GIA show an upper
price spectrum of diamonds. This visualization
emphasizes the importance of certification quality in
increasing the value of diamonds and their
marketability, especially when accompanied by bigger
carat sizes.
Box plot
The box plot indicates the normalized values of two variables:
Carat and Price. Here, Carat has a wide range, and the median
is close to 0.5, with a very large interquartile range. It also has
a near-zero minimum value, and it goes up to 1. The parameter
Price, however, has a higher median, near 0.75, with a tighter
IQR and data points that are relatively closer to each other. The
range of Price overall also goes from zero to one but is less
variable compared to Carat. This indicates that though Carat
values are more spread out, Price values are more consistent.
Scatter Plot
The scatter plot shows the relationship between Carat and Price,
with colour gradient indicating varying Carat values. The two
variables are positively correlated, and this is reflected in the red
trend line. For higher Carat values, the price is also expected to be
higher. There are clusters of data points, especially for lower Carat
and Price values, but they are more spread out for the higher
values. This means that higher-carat diamonds are very rare and,
therefore, also costlier, while diamonds of lower-carat ratings are
more readily available and hence, cheaper. The general linear trend
shows a related rise in price with the rise in carat weight.
Correlation Matrix
The Cramér's V
correlation matrix
illustrates the strength
of the association of
categorical variables in
the diamond dataset.
For a strong
association, Cramer’s
V has to be at least
greater than 0.3, and
the closer the value is to
1, the stronger the
relationship. From the
matrix, it is evident that
the variables Clarity
and Cut have a high association and that clarity of a diamond affects the classification of its cut. Likewise,
Cut has highly positive correlations with Certification and Symmetry, indicating the extent of influences on
certification standards or considerations due to the cut of a diamond. Colour, along with Certification, has
comparatively low correlations, indicating some kind of relatively independent influence towards the features
of a diamond. Such analysis can prove beneficial in prioritizing different variables according to their
significance and relation in building predictive models or analyzing market trends.
The table below shows the summary of Diamond features using descriptive statistics min, median, mean, and
max of both carat and price, along with the frequency distributions on categorical features such as colour,
clarity, cut, certification, polish, and symmetry.
Statistic Carat Colour Clarity Cut Certification Polish Symmetry Price
Min. 0.0900 I: 79 SI1: 116 F: 59 AGS: 12 F: 5 F: 21 160
1st Quartile 0.3000 J: 72 SI2: 110 G: 49 DOW: 1 G: 165 G: 157 520
Median 0.8100 H: 71 I1: 82 I: 86 EGL: 119 I: 5 I: 5 2169

Mean 0.6693 F: 58 VS2: 41 V: 97 GIA: 265 v: 1 V: 206 1717
3rd Quartile 1.0100 E: 54 VS1: 30 X: 149 IGI: 43 V: 203 X: 51 3012
Max. 1.5800 G: 43 I2: 28 X: 61 3145
REGRESSION MODEL
We will now conduct the regression modelling to analyse the relationship between the price of diamonds
(dependent variable) and various independent variables, including Carat, Colour, Clarity, Cut, Certification,
Polish, Symmetry, and Wholesaler. We will employ two approaches – (a) a multiple regression model without
interaction, (b) a multiple regression model with interaction between predictors to capture the combined
effects of variables. In order to identify the best-performing models, we will first assess the best model based
on the predictors giving the best AIC results.
We will obtain optimal models that we will proceed to analyse in greater detail. These models will reveal the
significance and relative importance of each variable, along with the predictive accuracy of the models. This
rigorous approach not only will quantify the factors influencing diamond pricing but also enable the prediction
of a fair price for the diamond selected by Professor Davis.
Model Building
Multiple linear regression is a statistical technique that models the linear relationship between multiple
independent (explanatory) variables and a dependent (response) variable. This approach allows for a more
comprehensive analysis of how various factors collectively influence an outcome, making it well-suited for
complex scenarios like diamond pricing where multiple attributes impact the final value.
The assumptions of the model are –

❖ Elimination of Variables Caused by Multicollinearity: Variables that had a high correlation with each
other were removed to prevent multicollinearity. This was done using the Variance Inflation Factor
(VIF) and stepwise regression methods.
❖ Removal of Unimportant Predictors: Variables that were discovered to have a very small influence on
the dependent variable, which is price, through exploratory data analysis were dropped to reduce the
model without losing explanatory ability.
❖ Linearity Assumption: The relationship between the independent variables and the dependent variable,
which is price, is assumed to be linear for proper regression modelling.
❖ Normality of Residuals: The error terms, or residuals, are assumed to be normally distributed to ensure
that the inferential statistics are valid.
❖ Homoscedasticity: Homoscedasticity assumes that the variation in residuals is constant at every level
of the independent variable, which implies that the same reliability exists in the whole range of
predictions made by the model.
❖ Independent Observations: It is assumed that the observations in the given dataset are independent of
each other, with no dependence or clustering which may affect the model's performance.
❖ Verification of Colour Impact: Even though colour is included in the case as a subjective preference, a
model is developed to test its statistical significance in determining price.
Model 1: Regression Model with No Interaction Terms –
The regression equation is -
𝑷𝒓𝒊𝒄𝒆 = 𝜷𝒐 + 𝜷𝟏𝒄𝒂𝒓𝒂𝒕 + 𝜷𝟐𝒄𝒐𝒍𝒐𝒖𝒓 + 𝜷𝟑𝒄𝒍𝒂𝒓𝒊𝒕𝒚 + 𝜷𝟒𝒄𝒖𝒕 + 𝜷𝟓𝒄𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 + 𝜷𝟔𝒑𝒐𝒍𝒊𝒔𝒉
+ 𝜷𝟕𝒔𝒚𝒎𝒎𝒆𝒕𝒓𝒚 + 𝝐
Where,
β0 is the intercept, indicating the baseline price when all independent variables are at their reference level.
β1 is the contribution of the carat weight to the price.
β2 is the effect of the diamond's colour grade on the price.
β3 is the effect of clarity on the price
β4 is the effect of the diamond's cut grade on the price.
β5 is the impact of certification (e.g., GIA, EGL) on the price
𝛽6 is the effect of the Polish grade on the price.
β7 is the effect of symmetry on the price.
The final model for the diamond price prediction was chosen via a stepwise regression method with the Akaike
Information Criterion balancing model complexity and fit. Carat, Colour, Clarity, Certification, and Symmetry
with their respective levels and interaction terms are included in this model. The AIC value for the final model
is 5929.147, which reflects that the number of predictors to be included in the model is optimal for achieving
better model performance.
This model explains about 97% of the variation in the prices of diamonds, with the Multiple R-squared value
being 0.9733 and the Adjusted R-squared value being 0.9717. The overall significance of the model is
confirmed by the F-statistic of 603.1 with a p-value of < 2.2e-16. Among the predictors, Carat has the strongest
influence, with a coefficient of 4207.182, indicating a significant positive impact on price. Others are clarity
SI1, which drags the price up, and colour L and DOW certification levels, dragging down the price. Symmetry
also contributes meaningfully to changes in prices; levels such as symmetry G and symmetry X do influence
the prices significantly.
VIF values were determined as GVIF^(1/(2*Df) where GVIF is the Generalised VIF. GVIF value is
determined for multi-variate predictors while VIF is for uni-variate predictors. So, GVIF is adjusted with
respect to the degrees of freedom to get their respective VIF values. Now based on the VIF values obtained,
we can conclude that all the variables had VIF less than 5 and hence need not be removed.
Variable GVIF Df VIF

Carat 4.238971 1 2.058876
Colour 3.828497 8 1.087525
Clarity 8.185235 8 1.140419
Certification 11.149906 4 1.351789
Symmetry 3.382803 4 1.164554
The model exhibits good predictive capabilities without being too complex to the extent of being either too
large or too big. Residual diagnostics further establish that linearity, normality, and homoscedasticity are not
grossly violated; thus, the model is good. This model is then the best fitting and the most interpretable model
by which diamond prices can be predicted given the data analysed and in terms of being statistically robust
and practically important.
The new equation is
𝑷𝒓𝒊𝒄𝒆 = 𝜷𝒐 + 𝜷𝟏𝒄𝒂𝒓𝒂𝒕 + 𝜷𝟐𝒄𝒐𝒍𝒐𝒖𝒓 + 𝜷𝟑𝒄𝒍𝒂𝒓𝒊𝒕𝒚 + 𝜷𝟒𝒄𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 + 𝜷𝟓𝒔𝒚𝒎𝒎𝒆𝒕𝒓𝒚 + 𝝐
Hypothesis Testing for Significance of All Predictors –
Now, we will do hypothesis testing at a 5% level of significance to check the significance of all the predictors.
𝑵𝒖𝒍𝒍 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒊𝒔 𝒅𝒆𝒇𝒊𝒏𝒆𝒅 𝒂𝒔 → 𝑯𝟎 ∶ 𝜷𝟏 = 𝜷𝟐 = 𝜷𝟑 = 𝜷𝟒 = 𝜷𝟓 = 𝟎
𝑨𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒆 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒊𝒔 𝒅𝒆𝒇𝒊𝒏𝒆𝒅 𝒂𝒔 → 𝑯𝟏 ∶ 𝑨𝒏𝒚 𝒐𝒏𝒆 𝜷𝒊 𝒏𝒐𝒕 𝒆𝒒𝒖𝒂𝒍 𝒕𝒐 𝒛𝒆𝒓𝒐
ANOVA Testing -
Predictor Df Sum Sq Mean Sq F value Pr(>F)
Carat 1 519687583 519687583 13267.8647 < 2.2e-16
Colour 8 10605862 1325733 33.8466 < 2.2e-16
Clarity 8 53866928 6733366 171.9060 < 2.2e-16
Certification 4 5592684 1398171 35.6960 < 2.2e-16
Symmetry 4 836542 209136 5.3393 0.0003353
Residuals 414 16215922 39169 - -
ANOVA table –
Source DF Sum of Squares (SS) Mean Square (MS) F-Statistic (F)
Regression DFRegression = SSR MSR Fobs = MSR / MSE
25 =519687583+10605862 = 589036599/25 = 23561463.96/39169.37
+53866928+5592684+ =23561463.96 ≈601.78
836542
= 589036599
Residuals DFResiduals = SSE=16215922 MSE -
414 = 16215922/414
= 39169.37
Total DFTotal =439 SST=SSR+SSE= - -
589036599+16215922=
60525252
The ANOVA presents that the regression model is highly significant with an F-statistic of about 601.78 and a
corresponding p-value < 2.2e-16, showing that the predictors included in the model explain a significant
portion of the variability in diamond prices. The Sum of Squares for Regression (SSR) was calculated as
589036599; the predictor explains most of the variance with a small Sum of Squares for Residuals SSE of
16215922, meaning there is only minimal unexplained variance. The mean Square Regression (MSR) is
23561463.96, which is much higher than the Mean Square Error, MSE, 39169.37. A value of 39169.37 further
strengthens the model in question. The high F-statistic reveals that the predictive variance is significantly more
important than the random error; hence, the overall effectiveness and reliability of the regression model are
affirmed.
Therefore, the regression model well portrays the relationship between the predictors (Carat, Colour, Clarity,
Certification, and Symmetry) and the diamond prices, making it a robust tool for predicting price variability
based on such factors.
The Residual Analysis is as below -
The scatter plot shows the relationship between the Actual
Prices of diamonds and their Predicted Prices from the
regression model. Most data points lie closely along a
linear trend, indicating that the model is very good at
predicting diamond prices. The clustering of points at
higher price ranges suggests that the model is particularly
good for more expensive diamonds. However, small
deviations from the line indicate residual variance, thus
indicating where improvement is possible, such as refining
the variable selection or incorporating more interaction
terms. Overall, the model does a good job of capturing the
pricing patterns and provides reliable predictions, so it can
be used for diamond valuation analysis.
Residuals vs. Fitted Values plot assesses
assumptions in linear regression. Residuals
scattered around the horizontal line at zero give a
view of how far apart observed and fitted values
were. There is some clustering and a pattern, though
not too systematic in the spread, especially for the
fitted values at both extremes, so it seems that the
model meets the assumption of linearity quite well.
However, the slight funnelling pattern might imply
heteroscedasticity, where residual variance
increases with the fitted values. In total, the model
performs well, but further refinements might be
necessary to improve the uniformity of residual
distribution.
The plot indicates a scattering of The plot suggests that most The plot reveals that the
residuals around the zero point, residuals cluster around zero, residuals are largely clustered
indicating a good capture of the indicating a relatively good fit. around zero, and so is a good fit
relationship between carat and However, some outliers and to the model. However, a slight
price. However, slight clustering slight patterns suggest minor pattern and nonuniform spread
may indicate non-linearity or deviations from the assumptions may indicate some potential
unexplained variance requiring of homoscedasticity and heteroscedasticity, which could
further model refinement. linearity. necessitate further analysis or
model adjustments.
The Boxplot of Standardized Residuals presents The Normal Q-Q Plot tests the normality of residuals
the distribution of residuals. This distribution is by comparing their distribution to a theoretical
centered at zero, indicating that in general, the normal distribution. Most of the points lie very close
model predictions fall along with the observed to the diagonal line, indicating that the residuals are
values. Most of the residuals lie within the approximately normally distributed. However, slight
interquartile range; a few are above ±2 and ±4, deviations at the tails indicate the presence of
which may indicate small deviations from the outliers or non-normality in the extreme values. This
model assumptions. This spread is symmetrical alignment supports the validity of the normality
around the median, supporting the assumption assumption for most of the data. Overall, the fit of
that the residuals are normally distributed. the model is quite good, but some minor deviations
Overall, the fit is good although addressing call for further analysis.
outliers would strengthen it even more.
The plot of Standardized Residuals vs. Fitted Values
after Log Transformation shows residuals spread
around zero with a much more uniform spread than in
the untransformed model. This means that log
transformation has helped reduce heteroscedasticity
since the residuals now vary more uniformly across the
fitted values. However, slight deviations and a few
outliers are still there, especially at the higher end of
fitted values. Overall, the log transformation has
improved the model's fit and addressed variance
inconsistencies, enhancing its predictive reliability.
The plots indicate that a few

observations (e.g., 118, 202,
204, 378) exert high
influence and leverage on
the regression model and
thus should be examined
more closely or even
removed to stabilize and
improve accuracy.
The Histogram of Residuals is symmetric around zero,
supporting normality. The peak near zero indicates
good predictions for most observations, though slight
skewness and tails suggest some outliers. Overall, the
residuals align well with normality assumptions.
BREUSCH-GODFREY TEST-
This is the test statistic that measures the degree of autocorrelation in the residuals. The higher the value, the
more likely it is that autocorrelation exists.
LM test = 25.433, df = 1, p-value = 4.581e-07
The p-value is very small, indicating that we reject the null hypothesis.
This means there is significant autocorrelation in the residuals of our model, which violates the assumption
that the residuals are independent.
BREUSCH-PAGAN TEST –
This is the test statistic for heteroscedasticity. A higher value suggests more evidence for heteroscedasticity
(non-constant variance of residuals).
BP = 108.43, df = 25, p-value = 2.3e-12. The p-value is extremely small, which means we reject the null
hypothesis. This indicates that there is significant heteroscedasticity in the residuals of our model, meaning
that the variance of the residuals is not constant across the values of the independent variables.
FITTED REGRESSION EQUATION -
𝑷𝒓𝒊𝒄𝒆 = 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + (𝑪𝒂𝒓𝒂𝒕 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 × 𝑪𝒂𝒓𝒂𝒕) + 𝑪𝒐𝒍𝒐𝒖𝒓 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕
+ 𝑪𝒍𝒂𝒓𝒊𝒕𝒚 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 + 𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 + 𝑺𝒚𝒎𝒎𝒆𝒕𝒓𝒚 𝑪𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕
Coefficients Used:
o Intercept: −1259.285
o Carat (0.9): 4207.182×0.9=3786.464
o Colour (J): −630.739
o Clarity (SI2): 741.915
o Certification (GIA): −56.466
o Symmetry (Very Good - V): 219.537
𝑃𝑟𝑖𝑐𝑒 = −1259.285 + 3786.464 − 630.739 + 741.915 − 56.466 + 219.537
𝑷𝒓𝒊𝒄𝒆 = 𝟐𝟖𝟎𝟏. 𝟒𝟐𝟔
CONCLUDING OBSERVATION -
For the given characteristics of a diamond, our model predicts a price of 2801.426 compared to the quoted
price of 3100. This close alignment shows that the refined model, optimized using AIC, has better accuracy
and reliability than the original, unoptimized model. This illustrates the practical applicability of the model
refinement process in achieving predictions closer to real-world values.
The multiple regression model without interaction terms provided insight into the factors that drive diamond
prices with a very good fit of 0.9717 Adjusted R-squared. Carat, Colour, and Clarity emerged as the most
important predictors, thus confirming their crucial role in explaining price variation.
This balanced model strikes a delicate trade-off between simplicity and explanatory power, ensuring robust
predictions without unnecessary complexity. This certainly provides a good basis for further refinement and
explorations to better understand and capture the intricate relationships between predictors.
Model 2: Regression Model with Interaction Terms –
The regression analysis focused on a subset of key predictors, namely Carat, Colour, Clarity, Certification,
and Symmetry, selected using the Akaike Information Criterion. This approach balanced simplicity and
predictive power, avoiding overfitting while maintaining explanatory accuracy. The model had an Adjusted
R-squared of 0.9717, indicating high explanatory power for diamond price variability.
Although the model offers such insights, it makes the assumption of additive relationships among predictors.
Interaction analysis will be crucial to look at potential synergies or a combined effect between variables; this
might better capture more complex dynamics influencing the diamond prices, thereby enhancing the
interpretability and accuracy in prediction.
The interaction model was developed to precisely capture complex interactions that influence the prices of
diamonds. The possible interactions of all key predictors - Carat, Colour, Clarity, Certification, and Symmetry
- are considered. However, the added comprehensiveness made several insignificant terms pop up and
complicated the model further.
For this purpose, a refined interaction model was designed through a process of eliminating irrelevant variables
systematically. In doing this, only the meaningful interactions and main effects are maintained. The result is
that a refined model has complexity versus interpretability and still manages to retain all critical interactions,
which explains much of the price variation. Thus, the interaction model serves as a sound basis for further
analysis and decision-making. This refinement is done by running a refine function which automatically
removes predictors which have higher p-values, i.e. the null hypothesis is not rejected.
Due to the data being huge, only the interactions relevant to the professor's diamond characteristics, namely
those involving Carat, Colour (J), Clarity (SI2), Certification (GIA), and Symmetry (Very Good - V), will be
included in the prediction. Other interaction terms that are not relevant to the predictors involved will be zero
and thus need not be considered in the analysis.
The regression equation includes:
Interaction Estimate Std. Error t-value p-value
Carat:ColourJ -894.90 232.20 -3.854 0.000141 **
Carat:ClaritySI2 920.26 156.95 5.863 1.15e-08 ***
Regression Equation:
𝑷𝒓𝒊𝒄𝒆 = 𝜷𝟎 + 𝜷𝟏 (𝑪𝒂𝒓𝒂𝒕) + 𝜷𝟐 (𝑪𝒐𝒍𝒐𝒖𝒓𝑱) + 𝜷𝟑 (𝑪𝒍𝒂𝒓𝒊𝒕𝒚𝑺𝑰𝟐) + 𝜷𝟒 (𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏𝑮𝑰𝑨)
+ 𝜷𝟓 (𝑺𝒚𝒎𝒎𝒆𝒕𝒓𝒚𝑽) + 𝜷𝟔 (𝑪𝒂𝒓𝒂𝒕: 𝑪𝒐𝒍𝒐𝒖𝒓𝑱) + 𝜷𝟕(𝑪𝒂𝒓𝒂𝒕: 𝑪𝒍𝒂𝒓𝒊𝒕𝒚𝑺𝑰𝟐) + 𝝐
Where,
o 𝛽0 is the intercept representing the baseline price when all predictors are zero.
o 𝛽1 is the effect of Carat on Price, holding Colour, Clarity, Certification, and Symmetry constant.
o 𝛽2 is the effect of Colour on Price, holding Carat, Clarity, Certification, and Symmetry constant.
o 𝛽3 is the effect of Clarity on Price, holding Carat, Colour, Certification, and Symmetry constant.
o 𝛽4 is the effect of Certification on Price, holding Carat, Colour, Clarity, and Symmetry constant.
o 𝛽5 is the effect of Symmetry on Price, holding Carat, Colour, Clarity, Certification constant.
o 𝛽6 is the effect of interaction between Carat and Colour.
o 𝛽7 is the effect of interaction between Carat and Clarity.
This model accounts for both the main effects and the combined effects of the interaction terms to better
capture the complexity of the relationship between predictors and the diamond's price.
Hypothesis Testing for Significance of All Predictors –
Now, we will do hypothesis testing at a 5% level of significance to check the significance of all the
predictors.
𝑵𝒖𝒍𝒍 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒊𝒔 𝒅𝒆𝒇𝒊𝒏𝒆𝒅 𝒂𝒔 → 𝑯𝟎 ∶ 𝜷𝟏 = 𝜷𝟐 = 𝜷𝟑 = 𝜷𝟒 = 𝜷𝟓 = 𝜷𝟔 = 𝜷𝟕 = 𝟎
𝑨𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒆 𝑯𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔 𝒊𝒔 𝒅𝒆𝒇𝒊𝒏𝒆𝒅 𝒂𝒔 → 𝑯𝟏 ∶ 𝑨𝒏𝒚 𝒐𝒏𝒆 𝜷𝒊 𝒏𝒐𝒕 𝒆𝒒𝒖𝒂𝒍 𝒕𝒐 𝒛𝒆𝒓𝒐
ANOVA Testing –
Degrees Sum of
Source of Variation of Squares Mean Square (MS) F-statistic
Freedom (SS)
Carat (Factor A) 1 519687583 SSA/Df=519687583/1 MSA/MSE
=519687583 =519687583/14106
=36840.75
Colour (Factor B) 8 10605862 SSB/Df=10605862/8= MSB/MSE
1325733 =1325733/14106
=93.98
Clarity (Factor C) 8 53866928 SSC/Df=53866928/8= MSC/MSE
6733366 =6733366/14106
=477.33
Carat:Colour 8 2173451 SSAB/Df=2173451/8 MSAB/MSE
Interaction =271681 =271681/14106
=19.26
Carat:Clarity 7 3846658 SSAC/Df=3846658/7 MSAC/MSE
Interaction =549523 =549523/14106
=38.96
Residuals/Errors 314 4429387 SSE/Df=4429387/314 N/A
=14106
Total 346 586764048 N/A N/A
Based on F-statistics and respective p-values: -
➢ Carat, Color, and Clarity have highly significant associations with the Price variable.
➢ The Carat:Colour and Carat:Clarity interactions are also significant, indicating that the effect of Carat
on Price depends on the levels of Colour and Clarity.
So, we reject the Null Hypothesis indicating the significance of all the predictors and interactions.
The Residual Analysis is as below -
The plot of Actual vs.
Predicted Prices shows
strong points along the
diagonal line. This shows
that for most data points,
the model is fairly accurate
in terms of the prediction of
the prices of the diamonds
using the independent
variables. In the Residuals
vs. Fitted Values plot, the residual around zero suggests no bias in the model. However, slight clustering and
uneven spread, especially at higher fitted values, might indicate heteroscedasticity or model limitations. In
general, the model works well but needs to be further refined to deal with minor residual patterns.
The Residuals vs. Carat plot The Standardized Residuals vs. The Standardized Residuals
shows residuals scattered around Fitted Values plot reveals vs. Carat plot shows residuals
zero, indicating that the model residuals that scatter around zero. distributed around zero,
captures the relationship between This indicates that the model is suggesting the model captures
carat and price reasonably well. reasonably good. However, slight the relationship between carat
However, slight clustering clustering and spread at higher and price adequately.
suggests potential non-linearity or fitted values suggest a possibility However, some clustering and
variance inconsistency, which of heteroscedasticity that may call variability in residuals indicate
may require further investigation for adjustments to improve potential non-linearity or
or model adjustment. consistency in predictions. heteroscedasticity, which may
need further refinement in the
model.
The Histogram of Standardized Residuals shows that the

residuals are nearly normally distributed because it is
approximately symmetric around zero. The residuals
appear to be fairly close to normal, which is the main
assumption for many tests of statistical significance in
regression analysis. Most residuals fall within ±2
standard deviations, indicating a good model fit for most
observations. However, the slight tails on each side
indicate the presence of a few outliers. Overall, the
residuals align well with the normal distribution,
confirming the model's adequacy while highlighting a
need to investigate potential outliers.
The Q-Q Plot of Standardized Residuals is used to

test the normality of residuals, comparing their
distribution to that expected under a theoretical
normal distribution. Most points lie close to the
diagonal line, indicating the residuals are
approximately normally distributed. However,
slight deviations are observed at the tails,
suggesting the presence of some outliers or non-
normality at the extremes. The central part of the
plot is a good fit; hence, it assumes that the data is
generally normally distributed. The overall
residual is normally distributed with good fit,
validating the performance of the model, except for
the deviations at both tails.
BREUSCH-PAGAN TEST –
This is the test statistic for heteroscedasticity. A higher value suggests more evidence for heteroscedasticity
(non-constant variance of residuals).
data: refined_model
BP = 192.05, df = 125, p-value = 0.0001091
The model exhibits heteroscedasticity based on the results of the Breusch-Pagan test. However, the refined
model shows a reduction in the magnitude of heteroscedasticity compared to the earlier model as indicated by
the higher degrees of freedom and slightly less significant p-value. This means that the refinements in the first
model improved its handling of variance inconsistency.
On using the Box-Cox transformation, the Standardized
Residuals vs. Fitted Values plot is below. This was done
on account of heteroscedasticity detected by Breusch-
Pagan Test-a low p-value and large test statistic. The
application of transformation was to help stabilize
variance and improve on the model fit. However, after
transformation, as can be seen, these residuals are more
symmetric about the zero point and have very diminished
patterns and clustering compared with the untransformed
one. While some variability still remains at higher fitted
values, the spread is more uniform and thus suggests an
improvement in homoscedasticity. Overall, the Box-Cox
Transformation has improved the assumptions of the
model, though further refinements may still be necessary
to achieve optimal results.
The Standardized Residuals vs. Fitted Values plot
of residuals after a log transformation to correct for
heteroscedasticity and reduce model misfit shows
residuals more evenly spread around the zero line.
This is less variance inconsistent than with the
original model. Still, there are some small clusters
and patterns, particularly at higher fitted values, but
the overall spread is better controlled. This implies
that the log transformation has enhanced the
model's ability to satisfy the assumption of
homoscedasticity. Further refinements may help to
address the remaining patterns and improve
predictive performance.
The Cook's distance plot identifies influential The Residuals vs. Leverage Plot identifies points 204
points by which the regression model deviates and 295 as having higher leverage and standardized
significantly. Points at the locations 204, 154, and residuals. It indicates that they had significant
295 have high Cook's distance values, meaning influence on the regression model. The observations
possibly highly influential on the regression lie in proximity to the boundary value of Cook's
models' coefficients. Therefore, these points distance-thus they could have overinfluence on the
should be investigated and proved to be valid or fitted model. Such influential points shall be closely
outliers. With correcting these influential points inspected for validity or considered as points needing
the resultant model will be even better in terms of adjustments for improved model stability.
strength of the model and goodness fit.
FITTED REGRESSION EQUATION –

𝑷𝒓𝒊𝒄𝒆 = 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 + (𝜷𝟏 ⋅ 𝑪𝒂𝒓𝒂𝒕) + (𝜷𝟐 ⋅ 𝑪𝒐𝒍𝒐𝒖𝒓𝑱) + (𝜷𝟑 ⋅ 𝑪𝒍𝒂𝒓𝒊𝒕𝒚𝑺𝑰𝟐) + (𝜷𝟒
⋅ 𝑪𝒆𝒓𝒕𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏𝑮𝑰𝑨) + (𝜷𝟓 ⋅ 𝑺𝒚𝒎𝒎𝒆𝒕𝒓𝒚𝑽) + (𝜷𝟔 ⋅ 𝑪𝒂𝒓𝒂𝒕: 𝑪𝒐𝒍𝒐𝒖𝒓𝑱) + (𝜷𝟕
⋅ 𝑪𝒂𝒓𝒂𝒕: 𝑪𝒍𝒂𝒓𝒊𝒕𝒚𝑺𝑰𝟐))
Coefficients Used:
Intercept: 𝛽0 = 916.96
Carat (0.9): 𝛽1 = 1838.11⋅0.9 = 1654.299
Colour (J): 𝛽2 = 208.00
Clarity (SI2): 𝛽3 = −104.26
Certification (GIA): 𝛽4 = −913.08
Symmetry (Very Good - V): 𝛽5 = −703.31
Interaction (Carat:ColourJ): 𝛽6 = −894.90⋅0.9 = −805.41
Interaction (Carat:ClaritySI2): 𝛽7 = 920.26⋅0.9 =828.234
𝑃𝑟𝑖𝑐𝑒 = 916.96 + 1654.299 + 208.00 − 104.26 − 913.08 − 703.31 − 805.41 + 828.234
𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝑷𝒓𝒊𝒄𝒆 = 𝟏𝟎𝟖𝟏. 𝟒𝟑𝟑
CONCLUDING OBSERVATION –
For the characteristics given for a diamond, our advanced interaction model predicts a price of 1081.433 that
captures the subtle relationships between variables like Carat:ColourJ and Carat:ClaritySI2, thus
demonstrating the ability to uncover complex interactions and delivering accurate predictions, thereby
bringing about a significant improvement in understanding variability in pricing. The multiple regression
model with interaction terms provided deeper insights into how factors such as carat, colour, and clarity
interact to influence diamond prices. Integrating these relationships, the model achieves a robust balance
between explanatory power and precision, providing a valuable framework for further refinement and strategic
analyses. This approach emphasizes the practical importance of capturing interactions to make more informed
predictions and decisions.
FINAL VERDICT –
From the comparison of both the interaction and non-interaction regression models used for predicting the
price of a diamond, it is concluded that:
❖ Prediction Outcomes: The non-interaction model estimated a price of around 2801, which is nearer to
the quoted price of 3100. The interaction model predicted a price of 1081, which is far away from the
quoted price.
❖ Performance of the Models: The non-interaction model shows an Adjusted R-squared value of 0.9717,
meaning strong explanatory power and a good fit. The interaction model, despite incorporating
complex relationships between variables, had introduced heteroscedasticity, which was detected
through the Breusch-Pagan Test (p < 0.05). This implies that residuals depend on predictors, which
diminishes the reliability for constant predictions.
❖ Key Determinants of Variability: Carat, Colour, and Clarity are always core variables and show to be
the most influential predictors in whichever model used. Interaction terms add richness but also
complexity and variability that don't mirror the real-world price setting.
Rationale for Different Predictions → The significant difference in prediction between the two models could
be because of the impact of interaction terms, though statistically sound, may not fully account for all external
factors driving market prices, such as brand reputation, marketing, or changes in demand. Both models, hence,
cannot be wrong at the same time; instead, they do different things: the non-interaction model is simple and
robust, and the interaction model goes into deeper relationships, but pays for it by losing a bit of predictive
power.
Conclusion → The non-interaction model is suitable when one intends to make general predictions with
robustness and near quotation price accuracy. However, when one goes deeper to find intricate relationships
among variables, then this interaction model would come into play to be used in exploratory analysis. To make
the predictions more accurate regarding quoted prices, it is now essential to integrate brand reputation, market
trends, and consumer preferences into future versions of the model. That way, statistical predictions could be
closer to real pricing dynamics.
OUR LEARNINGS
This case study provides invaluable insights from the analysis of diamond pricing by means of applying
multiple regression techniques:
❖ Importance of Data-Driven Insights: The statistical modelling revealed that the factors driving
diamond pricing are very much related to carat weight, clarity, and colour.
❖ Practical Application of Interaction Models: Including interaction terms provided an insight into how
interactions between predictors, such as carat and clarity, jointly affect price variation. This is an
example of how capturing multi-dimensional relationships is essential for complex real-world
problems.
❖ Model Evaluation and Validation: The analysis highlights the need for proper validation measures such
as VIF, residual diagnostics, and BP tests to ensure the validity of models. Detection of
heteroscedasticity and multicollinearity improved the models by enhancing their predictive ability.
❖ Challenges in Pricing Models: The price in some cases is predicted but does not necessarily reflect the
actual. Such a scenario presents a warning to the complexity in modelling that is inherently complex
and has to be continuously refined with regard to real data.
❖ Simplistic yet Balanced Accuracy: In stepwise selection, this study illustrates the balance of model
simplicity with regard to explanatory power through AIC, opening the doors for models that are
interpretable and predictive.
These insights not only offer specific practical knowledge about diamond valuation but also reflect the use of
statistics in the overall areas of business decision-making.
FUTURE SCOPE
The analysis carried out gives good support, yet there are several areas that would require improvement and
further research.
• Refining interaction analysis: Since the interaction terms were considered in this study, additional
interactions between variables such as clarity, colour, and symmetry may build more intricate
relationships determining the price of diamonds. This could enhance predictions and provide greater
insights into market behaviour.
• Advanced Modelling Techniques: The use of nonlinear models, such as polynomial regression or
machine learning approaches like decision trees and random forests, may help better capture the non-
linear relationships in the data. Such techniques could improve the generalization ability of the model
across different scenarios.
• Market Segmentation Studies: Expanding the analysis to understand the variation of diamond prices
with respect to different customer segments, geographic locations, and economic conditions would
give a macro view. This would help in bringing the pricing strategy in line with the market dynamics.
• Heteroscedasticity Corrections: The Breusch-Pagan test revealed heteroscedasticity in some models.
Weighted least squares or log or Box-Cox transformation could be used to correct the problem for
better estimation of parameters.
REFERENCES
1. Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN
978-3-319-24277-4.
https://ggplot2.tidyverse.org
2. Professor Proposes Dataset – HBS Publishing (Education).

https://hbsp.harvard.edu/tu/f47d77fc
3. The Professor Proposes - Spreadsheet W06586-XLS-ENG.

https://hbsp.harvard.edu/tu/f47d77fc.
4. Kassambara, G., Gorenc, J., Priya, & Visitor. (2018, March 11). Stepwise Regression Essentials in R.
STHDA.
http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-
essentials-in-r/
5. Plotly for R – Basic Charts Documentation.

https://plotly.com/r/basic-charts/
6. Collinearity-CBU statistics.
https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/Collinearity
7. Sage Encyclopedia of Communication Research Methods

https://www.researchgate.net/publication/307963787_Cramer's_V

Report_Group 5D_The Professor Proposes

Uploaded by

Copyright:

Available Formats

Report_Group 5D_The Professor Proposes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report_Group 5D_The Professor Proposes

Uploaded by

Copyright:

Available Formats

CASE – THE PROFESSOR

"V" (Very Good) and "G" (Good)

The scatter plot illustrates the price of diamonds against

1st Quartile 0.3000 J: 72 SI2: 110 G: 49 DOW: 1 G: 165 G: 157 520

Median 0.8100 H: 71 I1: 82 I: 86 EGL: 119 I: 5 I: 5 2169

3rd Quartile 1.0100 E: 54 VS1: 30 X: 149 IGI: 43 V: 203 X: 51 3012

Max. 1.5800 G: 43 I2: 28 X: 61 3145

The assumptions of the model are –

Variable GVIF Df VIF

The plots indicate that a few

The Histogram of Standardized Residuals shows that the

The Q-Q Plot of Standardized Residuals is used to

FITTED REGRESSION EQUATION –

2. Professor Proposes Dataset – HBS Publishing (Education).

3. The Professor Proposes - Spreadsheet W06586-XLS-ENG.

5. Plotly for R – Basic Charts Documentation.

7. Sage Encyclopedia of Communication Research Methods

You might also like