MAT240 - Module 4 Project
MAT240 - Module 4 Project
Report: Housing Price Prediction Model for D. M. Pan National Real Estate Company
Aiden Tran
Professor Patel
02/02/2025
Median Housing Price Model for D. M. Pan National Real Estate Company 2
This report investigates whether the square footage of a property can act as a reliable
indicator of its listing price, specifically on homes sold in 2019 to help drive effective and
justifiable decision making for the D.M. Pan National Real Estate Company. To explore this
potential relationship, we will conduct linear regression as our primary analytical method. This
approach is best suited for examining the connection and correlation between two continuous
variables: square footage (used as predictor variable, x) and listing price (used as the response
variable, y). Additionally, a scatterplot and histograms will be created to visualize the data and
evaluate any potential linearity of the relationship between these variables. The listing price,
being dependent on square footage, is the primary reason on why linear regression is an
The results of this study aim to provide valuable insights to D. M. Pan National Real
Estate Company, enabling them to use square footage as a meaningful benchmark for setting
To ensure we created a representative sample of properties from the Real Estate Data
Note: The entirety of the Real Estate Data was used as the initial population for this sampling
process. A new column was added to the dataset in Microsoft Excel. This was placed adjacent
to the rest of our data across all rows of data. The RAND() function was used to generate a
random number for each row in the dataset, ensuring that each row was assigned a unique
random value.
Using the Data > Sort feature in Microsoft Excel, the data was sorted by the random
After sorting, the rows from the randomized data were selected to serve as the sample for this
analysis. This method of sampling is representative of the general population because it uses a
randomization process that gives each home in the dataset an equal chance of being included
in the sample. As a result, the sample reflects the variability present in the overall dataset,
ensuring it is unbiased and diverse. By using a random selection method, the sample provides a
Predictor Variable: The predictor variable in this study is square footage, as it is the
Response Variable: The response variable is listing price, which is the dependent
$400,000.00
$300,000.00
$200,000.00
$100,000.00
$0.00
- 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000
Square Feet
Median Housing Price Model for D. M. Pan National Real Estate Company 4
Data Analysis
12
10
Frequency
Square Feet
10
8
6
4
2
0
($272,550.00,… ($403,250.00,… ($533,950.00,…
[$207,200.00,… ($337,900.00,… ($468,600.00,…
Square Feet
Median Housing Price Model for D. M. Pan National Real Estate Company 5
Summary Statistics
Listing Square
Random Listing Price Square Feet National Price Feet
When observing our above listed data, there are clear patterns in terms of center,
spread, shape and unusual characteristics that are significant within the relationship between
square footage and listing price. For square footage, the mean is 2,053 square feet, and the
median is slightly lower at 1,993 square feet. This indicates a slightly right-skewed distribution
where larger homes are influencing the mean. The spread, measured by a standard deviation of
638.82 square feet suggests moderate variance, highlighting the majority of homes listing
within 638 square feet of the average size. Additionally, the histogram supports this with a clear
clustering of homes between 1,126 and 2,807 square feet, while larger homes above 3,368
square feet appear as outliers. Also, there are gaps in the data, notably in the range of 2,807 to
3,088 square feet. Although true, when we observe our scatterplot, we have a small number of
properties greater than 3,500 square feet; however, our trend-line/regression equation seems
For listing prices, the mean is $342,581, and a median of $311,000, again indicating a
slight right skew, this can primarily be attributed to a few high-priced homes. The standard
deviation of $96,890 reflects substantial price variability across the dataset. The histogram
reveals a concentration of homes prices between $207,200 and $403,250, with prices above
$468,000 forming a long right tail due to outliers. These higher-priced homes, much like the
Median Housing Price Model for D. M. Pan National Real Estate Company 6
larger homes in the square footage data, represent only a small portion of the dataset but
significantly skew the overall distribution. Both datasets display a positive correlation between
square footage and price, as expected in real estate markets. However, the presence of outliers
and gaps indicates that other factors may also influence the variability in house sizes and prices.
interpreting these results. As for outliers, visually we were able to determine a few specific data
points but would like to emphasize, all recorded data thus-far has been on the entirety of the
data-set besides a specific data point ($675,500 & 6,498sqft) due to its impact on
representation; specifically there was no other data points reflecting a square footage greater
than even 4,000 square feet. It’s highly essential for the D.M. Pan National Real Estate Company
to understand the concentration of data and understand the high level of variation that comes
When comparing our sample data with that of the national data, the sample and
national data are closely aligned in terms of trends/tendencies but differ in spread and
variability. The mean listing price in the sample is $342,581 which is slightly above the national
mean of $342,365, while the median listing price is $311,000, which is lower than the national
median of $318,000. This indicates the sample may include a higher proportion of lower-price
homes relative to the national market. For square footage, the sample mean is 2,053 square
feet, slightly below the national mean of 2,111 square feet. However true, the sample’s median
square footage of 1,993 is higher than the national median of 1,881 indicating a focus on mid-
sized properties opposed to larger properties. In addition, the spread in the sample is narrower
than in the national data for both listing price and square footage. The sample’s standard
Median Housing Price Model for D. M. Pan National Real Estate Company 7
deviation for listing price is $96,890, compared to the national figure of $125,914, indicating
less variability in home prices within the sample than when compared to the national insights.
Additionally, the sample’s standard deviation for square footage is 638 square feet which is
significantly lower than the national data of 921 square feet; this indicates that the sample may
Both the sample and national datasets indicate a right-skewed distribution with higher-
priced and larger homes forming the upper tails of the distribution. However, the narrow
spread in the sample may limit the extent of these tails when compared to national data.
Additionally, as mentioned before, the sample does contain noticeable gaps in square footage
(between ~2,800 – 3100 sqft), which are not necessarily observable in the national data, but
this may indicate underrepresentation of certain home sizes. While the close alignment of
mean values suggests that the sample suggests the central trends of the national housing
market, its narrower spread and slightly different medians indicate that it may not fully capture
the market’s diversity, more specifically at the extremes (very small and very large properties).
With this, it is essential to understand the sample is broadly representative of the national
market but less so for high-prices or unusually large or small homes. This information will assist
the D.M. Pan National Real Estate Company assess how well this sample data reflects the
national housing market, ideally enabling them to make informed decisions about pricing
strategies, which market(s) they want to target, and resource allocation to best align with
Regression Model
y = 109.54x + 117702
Square Footage vs Listing Price R² = 0.5216
$1,000,000.00
$800,000.00
Listing Price
$600,000.00
$400,000.00
$200,000.00
$0.00
- 1,000 2,000 3,000 4,000 5,000 6,000 7,000
Square Feet
The following scatterplot shows a positive linear relationship between square footage
and its listing price, indicating that larger homes have higher prices. Also, the strength of the
association, I would consider moderate, with an R^2 value of 0.52, meaning about 52% of the
variation in listing price can be explained by its square footage. Finally, the form is linear, as
indicated by the regression line, y=109.54x + 117,702, where listing price increases by
To ensure transparency, I will discuss potential outliers and how this can potentially
impact our data. Before we begin, I would like to highlight the importance of understanding I
have already excluded one specific data point prior to calculating our data; this was essential to
manner. With this being said, I will outline ALL outliers calculated utilizing IQR (interquartile
range), this includes properties with listing prices above the upper bound of $552,200; these
include $585,600 (3,637 sqft), $599,300 (3,636 sqft), and the data point that has already been
removed, $675,500 (6,498 sqft). As for outliers based on square footage, the upper bound of
3,304 sqft is observed and these include, 3637, 3636, 3583, 6498, and 3608 square feet.
Median Housing Price Model for D. M. Pan National Real Estate Company 9
Now, let us discuss potentially removing additional data points and examine how that
could impact our findings and more importantly, our correlation coefficient. When examining
outliers, it is important to consider potential use-cases for this. For example, are we focusing on
homes within 1,500 – 2,500 square feet? Are we aligning more with luxury markets? There are
various use-cases for observing the data; depending on the use-case, we can observe various
outliers to achieve different findings. Before we examine changes to our correlation, let us
discuss our current insights such as an R of 0.722 and an R-square value of 0.52. Considering
our square footage outliers were those greater than 3,304 square feet; let us temporarily
remove those datasets and examine how it impacts our findings. After removing all data points
above 3,304 square feet we can calculate our new findings; this computes an R value of 0.36
and an R-square value of 0.132. Clearly, removing these data points impacts our findings,
however true; it causes larger variance and more factors that cannot be explained by the
difference in square footage. With this, it’s advisable to not remove any additional data points
as it would skew our data heavily and negatively represent the overall national market.
However true, it’s essential to remember we had removed one specific data point due to its
large difference in square footage and listing price compared to the norm, although true; it’s
very important to understand how this may impact decision making when examining a market
with premium or luxury properties that do not typically correlate as heavily with square
footage.
When calculating the correlation coefficient (r), we utilized Microsoft Excel’s “=CORREL
(range1, range2)” function in which computes the linear relationship between two variables: in
our case, listing price and square footage. Specifically, we utilized “=CORREL (D3:D52, F3:F52)”,
Median Housing Price Model for D. M. Pan National Real Estate Company 10
where D3:D52 represents listing price and F3:F52 represents square footage. Our result (r)
value quantifies the strength and direction of the linear association between two values. When
calculated, we compute a value of 0.72; this value indicates a moderate-to-strong positive (r)
value which in our case confirms the upward trend, indicating a significant, not perfect; linear
association.
y= 109.54x +117,702, where “y” equates to our predicted listing price and “x” equates
to our applicable square footage. With this, we can interpret the slope (109.54) represents that
for each additional square footage, we can predict a price increase of $109.54, this can provide
a simple baseline to understand our relationship between a singular square footage and how
that applies to its listing price. Additionally, the intercept (117,702) represents the estimated
base price when square footage is zero, likely representing costs that could not be explained by
square footage in itself, i.e. land value. Admittedly, this may not make sense considering a
property of 0 square feet would likely not have a value assigned to it. However true, if we are
accounting for external factors such as the value of land then this interception of $117,702
would make sense.
When calculating R-squared we simply take our R (correlation coefficient) and square it;
this leaves us computing an R-squared value of 0.52. An R-squared value of 0.52 suggests a
moderate correlation between square footage and listing price. Also, while the model explains
over half of the price variation, nearly 48% of the variability is due to other factors that cannot
be directly examined by observing square footage and listing price. This regression equation is
moderately strong but additional variables could improve its accuracy. To provide additional
context, I will showcase a computation we utilized to estimate the listing price for a 1,500
square foot property. To ensure transparency, we will input 1,500 square feet into our
regression equation, for x. This looks like, y=109.54 (1500) +117702, this equates to a predicted
listing price of, $282,012. Examples such as this one could provide additional insight to the D.M.
Median Housing Price Model for D. M. Pan National Real Estate Company 11
Pan National Real Estate Company into a reasonable listing price based on square footage
alone.
Conclusions
This thorough analysis examines whether a home’s square footage can serve as a
reliable predictor for a predicted listing price. Using a linear regression model, we found a
moderate correlation between these two factors, with ~52% of price variations explained by
Pricing Model: The regression equation y = 109.54x +117,702 suggests that for every
Predictive Example: For a 1,500 square-foot home, the estimated listing price would be
Market Variance: While square footage alone is a strong indicator of price, ~48% of
pricing variations stem from other factors which may include land value, amenities, etc.
Data Trends: The above provided data and visuals such as the histograms show a slightly
averages; this is essential when considering listing very large or smaller homes.
Outlier Consideration: Some extreme values (i.e. homes above 3,500 sq ft) were
reviewed but removing them significantly weakened the model’s predictive strength.
The results were mostly in line with expectations, but there were a few notable findings that
Expected Findings:
Square Footage Correlates with Price: As anticipated, larger homes have higher
listing prices, and the regression equation confirmed this positive relationship.
Moderate Strength of Model: R^2= 0.52; it was expected that square footage
would not be the sole factor in determining price. Although true, it still
Unexpected Findings:
Impact of Outliers: The presence of a few large and highly priced homes skews
the data. This reinforces the need to potentially consider luxury markets
predictive ability.
properties from across the nation; ~48% of pricing factors remain unanswered
by this model alone, this indicates additional factors such as location, amenities,
the national data suggests a more concentrated mid-sized home market which
may limit its ability to reflect pricing trends for unusually large or small
properties.
Overall, the findings confirmed realistic expectations that square footage is a crucial key
to predicting listing prices as the relationship is positively linear; however, it is not exclusive
when it comes to factoring predicted home prices. Admittedly, this reinforces the potential
Median Housing Price Model for D. M. Pan National Real Estate Company 13
need for multifactor pricing strategies that account for additional variables beyond size alone,
for example, comparing “comps,” or comparable prices within a specific market will better
essential to examine potential options to improve results or address different problems, several
changes include:
lot size, and renovations may help further explain the 48% of pricing variability
Segment the Market: Separate models for potential starters, mid-range, and
luxury homes would potentially provide more accurate pricing predictions for
specific use-cases.
Addressing Outliers: Although we did not explore this avenue, instead of simply
removing datapoints such as those with higher prices, we could potentially apply
additional skew.
learning techniques which could better capture complex pricing trends that a
trends and findings that vary from market to market. Specifically, geographical
To challenge progressive thinking and encourage further real estate findings, it raises
the question, how do additional factors such as location, home age, and amenities impact
listing prices compared to square footage alone? This question would help determine the
relative importance of various property features and determining whether incorporating these