Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
0 views

MAT240 - Module 4 Project

This report analyzes the relationship between square footage and listing prices of homes sold in 2019 for D. M. Pan National Real Estate Company using linear regression. The findings indicate a moderate correlation, with approximately 52% of the variation in listing prices explained by square footage. The analysis also highlights the importance of considering additional variables and the potential impact of outliers on the results.

Uploaded by

itsaitra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

MAT240 - Module 4 Project

This report analyzes the relationship between square footage and listing prices of homes sold in 2019 for D. M. Pan National Real Estate Company using linear regression. The findings indicate a moderate correlation, with approximately 52% of the variation in listing prices explained by square footage. The analysis also highlights the importance of considering additional variables and the potential impact of outliers on the results.

Uploaded by

itsaitra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Median Housing Price Model for D. M.

Pan National Real Estate Company 1

Report: Housing Price Prediction Model for D. M. Pan National Real Estate Company

Aiden Tran

Department of Mathematics, SNHU

MAT 240: Applied Statistics

Professor Patel

02/02/2025
Median Housing Price Model for D. M. Pan National Real Estate Company 2

This report investigates whether the square footage of a property can act as a reliable

indicator of its listing price, specifically on homes sold in 2019 to help drive effective and

justifiable decision making for the D.M. Pan National Real Estate Company. To explore this

potential relationship, we will conduct linear regression as our primary analytical method. This

approach is best suited for examining the connection and correlation between two continuous

variables: square footage (used as predictor variable, x) and listing price (used as the response

variable, y). Additionally, a scatterplot and histograms will be created to visualize the data and

evaluate any potential linearity of the relationship between these variables. The listing price,

being dependent on square footage, is the primary reason on why linear regression is an

appropriate modeling technique for this analysis.

The results of this study aim to provide valuable insights to D. M. Pan National Real

Estate Company, enabling them to use square footage as a meaningful benchmark for setting

listing prices effectively.

To ensure we created a representative sample of properties from the Real Estate Data

provided, the following steps were taken:

Note: The entirety of the Real Estate Data was used as the initial population for this sampling

process. A new column was added to the dataset in Microsoft Excel. This was placed adjacent

to the rest of our data across all rows of data. The RAND() function was used to generate a

random number for each row in the dataset, ensuring that each row was assigned a unique

random value.

 Using the Data > Sort feature in Microsoft Excel, the data was sorted by the random

number column in ascending order (smallest to largest).


Median Housing Price Model for D. M. Pan National Real Estate Company 3

 This process randomized the dataset, ensuring an unbiased selection.

After sorting, the rows from the randomized data were selected to serve as the sample for this

analysis. This method of sampling is representative of the general population because it uses a

randomization process that gives each home in the dataset an equal chance of being included

in the sample. As a result, the sample reflects the variability present in the overall dataset,

ensuring it is unbiased and diverse. By using a random selection method, the sample provides a

reliable basis for modeling and analysis.

 Predictor Variable: The predictor variable in this study is square footage, as it is the

independent variable used to estimate housing prices.

 Response Variable: The response variable is listing price, which is the dependent

variable being predicted based on square footage.

Click Here to Access Full Excel Datasheet

Square Footage vs Listing Price


$700,000.00
$600,000.00
$500,000.00
Listing Price

$400,000.00
$300,000.00
$200,000.00
$100,000.00
$0.00
- 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000

Square Feet
Median Housing Price Model for D. M. Pan National Real Estate Company 4

Data Analysis

Histogram for Square Feet


14

12

10
Frequency

Square Feet

Histogram for Listing Price


18
16
14
12
Listing Price

10
8
6
4
2
0
($272,550.00,… ($403,250.00,… ($533,950.00,…
[$207,200.00,… ($337,900.00,… ($468,600.00,…
Square Feet
Median Housing Price Model for D. M. Pan National Real Estate Company 5

Summary Statistics
Listing Square
Random Listing Price Square Feet National Price Feet

Mean $342,581.63 2,053 Mean $342,365 2,111

Median $311,000.00 1,993 Median $318,000 1,881

Std Dev $96,890.87 638.819 Std Dev $125,914 921

When observing our above listed data, there are clear patterns in terms of center,

spread, shape and unusual characteristics that are significant within the relationship between

square footage and listing price. For square footage, the mean is 2,053 square feet, and the

median is slightly lower at 1,993 square feet. This indicates a slightly right-skewed distribution

where larger homes are influencing the mean. The spread, measured by a standard deviation of

638.82 square feet suggests moderate variance, highlighting the majority of homes listing

within 638 square feet of the average size. Additionally, the histogram supports this with a clear

clustering of homes between 1,126 and 2,807 square feet, while larger homes above 3,368

square feet appear as outliers. Also, there are gaps in the data, notably in the range of 2,807 to

3,088 square feet. Although true, when we observe our scatterplot, we have a small number of

properties greater than 3,500 square feet; however, our trend-line/regression equation seems

to represent those listing prices fairly accurately.

For listing prices, the mean is $342,581, and a median of $311,000, again indicating a

slight right skew, this can primarily be attributed to a few high-priced homes. The standard

deviation of $96,890 reflects substantial price variability across the dataset. The histogram

reveals a concentration of homes prices between $207,200 and $403,250, with prices above

$468,000 forming a long right tail due to outliers. These higher-priced homes, much like the
Median Housing Price Model for D. M. Pan National Real Estate Company 6

larger homes in the square footage data, represent only a small portion of the dataset but

significantly skew the overall distribution. Both datasets display a positive correlation between

square footage and price, as expected in real estate markets. However, the presence of outliers

and gaps indicates that other factors may also influence the variability in house sizes and prices.

This ultimately underscores the importance of considering additional variables when

interpreting these results. As for outliers, visually we were able to determine a few specific data

points but would like to emphasize, all recorded data thus-far has been on the entirety of the

data-set besides a specific data point ($675,500 & 6,498sqft) due to its impact on

representation; specifically there was no other data points reflecting a square footage greater

than even 4,000 square feet. It’s highly essential for the D.M. Pan National Real Estate Company

to understand the concentration of data and understand the high level of variation that comes

with the real-estate market across the nation.

When comparing our sample data with that of the national data, the sample and

national data are closely aligned in terms of trends/tendencies but differ in spread and

variability. The mean listing price in the sample is $342,581 which is slightly above the national

mean of $342,365, while the median listing price is $311,000, which is lower than the national

median of $318,000. This indicates the sample may include a higher proportion of lower-price

homes relative to the national market. For square footage, the sample mean is 2,053 square

feet, slightly below the national mean of 2,111 square feet. However true, the sample’s median

square footage of 1,993 is higher than the national median of 1,881 indicating a focus on mid-

sized properties opposed to larger properties. In addition, the spread in the sample is narrower

than in the national data for both listing price and square footage. The sample’s standard
Median Housing Price Model for D. M. Pan National Real Estate Company 7

deviation for listing price is $96,890, compared to the national figure of $125,914, indicating

less variability in home prices within the sample than when compared to the national insights.

Additionally, the sample’s standard deviation for square footage is 638 square feet which is

significantly lower than the national data of 921 square feet; this indicates that the sample may

underrepresent both exceedingly small and increasingly large properties.

Both the sample and national datasets indicate a right-skewed distribution with higher-

priced and larger homes forming the upper tails of the distribution. However, the narrow

spread in the sample may limit the extent of these tails when compared to national data.

Additionally, as mentioned before, the sample does contain noticeable gaps in square footage

(between ~2,800 – 3100 sqft), which are not necessarily observable in the national data, but

this may indicate underrepresentation of certain home sizes. While the close alignment of

mean values suggests that the sample suggests the central trends of the national housing

market, its narrower spread and slightly different medians indicate that it may not fully capture

the market’s diversity, more specifically at the extremes (very small and very large properties).

With this, it is essential to understand the sample is broadly representative of the national

market but less so for high-prices or unusually large or small homes. This information will assist

the D.M. Pan National Real Estate Company assess how well this sample data reflects the

national housing market, ideally enabling them to make informed decisions about pricing

strategies, which market(s) they want to target, and resource allocation to best align with

market dynamics in the specific region.


Median Housing Price Model for D. M. Pan National Real Estate Company 8

Regression Model

y = 109.54x + 117702
Square Footage vs Listing Price R² = 0.5216
$1,000,000.00

$800,000.00

Listing Price
$600,000.00

$400,000.00

$200,000.00

$0.00
- 1,000 2,000 3,000 4,000 5,000 6,000 7,000

Square Feet

The following scatterplot shows a positive linear relationship between square footage

and its listing price, indicating that larger homes have higher prices. Also, the strength of the

association, I would consider moderate, with an R^2 value of 0.52, meaning about 52% of the

variation in listing price can be explained by its square footage. Finally, the form is linear, as

indicated by the regression line, y=109.54x + 117,702, where listing price increases by

approximately $109.54 for every additional square foot.

To ensure transparency, I will discuss potential outliers and how this can potentially

impact our data. Before we begin, I would like to highlight the importance of understanding I

have already excluded one specific data point prior to calculating our data; this was essential to

maintaining representative findings when comparing to the national market in a reasonable

manner. With this being said, I will outline ALL outliers calculated utilizing IQR (interquartile

range), this includes properties with listing prices above the upper bound of $552,200; these

include $585,600 (3,637 sqft), $599,300 (3,636 sqft), and the data point that has already been

removed, $675,500 (6,498 sqft). As for outliers based on square footage, the upper bound of

3,304 sqft is observed and these include, 3637, 3636, 3583, 6498, and 3608 square feet.
Median Housing Price Model for D. M. Pan National Real Estate Company 9

Now, let us discuss potentially removing additional data points and examine how that

could impact our findings and more importantly, our correlation coefficient. When examining

outliers, it is important to consider potential use-cases for this. For example, are we focusing on

homes within 1,500 – 2,500 square feet? Are we aligning more with luxury markets? There are

various use-cases for observing the data; depending on the use-case, we can observe various

outliers to achieve different findings. Before we examine changes to our correlation, let us

discuss our current insights such as an R of 0.722 and an R-square value of 0.52. Considering

our square footage outliers were those greater than 3,304 square feet; let us temporarily

remove those datasets and examine how it impacts our findings. After removing all data points

above 3,304 square feet we can calculate our new findings; this computes an R value of 0.36

and an R-square value of 0.132. Clearly, removing these data points impacts our findings,

however true; it causes larger variance and more factors that cannot be explained by the

difference in square footage. With this, it’s advisable to not remove any additional data points

as it would skew our data heavily and negatively represent the overall national market.

However true, it’s essential to remember we had removed one specific data point due to its

large difference in square footage and listing price compared to the norm, although true; it’s

very important to understand how this may impact decision making when examining a market

with premium or luxury properties that do not typically correlate as heavily with square

footage.

When calculating the correlation coefficient (r), we utilized Microsoft Excel’s “=CORREL

(range1, range2)” function in which computes the linear relationship between two variables: in

our case, listing price and square footage. Specifically, we utilized “=CORREL (D3:D52, F3:F52)”,
Median Housing Price Model for D. M. Pan National Real Estate Company 10

where D3:D52 represents listing price and F3:F52 represents square footage. Our result (r)

value quantifies the strength and direction of the linear association between two values. When

calculated, we compute a value of 0.72; this value indicates a moderate-to-strong positive (r)

value which in our case confirms the upward trend, indicating a significant, not perfect; linear

association.

Line of Best Fit

y= 109.54x +117,702, where “y” equates to our predicted listing price and “x” equates
to our applicable square footage. With this, we can interpret the slope (109.54) represents that
for each additional square footage, we can predict a price increase of $109.54, this can provide
a simple baseline to understand our relationship between a singular square footage and how
that applies to its listing price. Additionally, the intercept (117,702) represents the estimated
base price when square footage is zero, likely representing costs that could not be explained by
square footage in itself, i.e. land value. Admittedly, this may not make sense considering a
property of 0 square feet would likely not have a value assigned to it. However true, if we are
accounting for external factors such as the value of land then this interception of $117,702
would make sense.

When calculating R-squared we simply take our R (correlation coefficient) and square it;

this leaves us computing an R-squared value of 0.52. An R-squared value of 0.52 suggests a

moderate correlation between square footage and listing price. Also, while the model explains

over half of the price variation, nearly 48% of the variability is due to other factors that cannot

be directly examined by observing square footage and listing price. This regression equation is

moderately strong but additional variables could improve its accuracy. To provide additional

context, I will showcase a computation we utilized to estimate the listing price for a 1,500

square foot property. To ensure transparency, we will input 1,500 square feet into our

regression equation, for x. This looks like, y=109.54 (1500) +117702, this equates to a predicted

listing price of, $282,012. Examples such as this one could provide additional insight to the D.M.
Median Housing Price Model for D. M. Pan National Real Estate Company 11

Pan National Real Estate Company into a reasonable listing price based on square footage

alone.

Conclusions

This thorough analysis examines whether a home’s square footage can serve as a

reliable predictor for a predicted listing price. Using a linear regression model, we found a

moderate correlation between these two factors, with ~52% of price variations explained by

square footage alone.

Key insights include:

 Pricing Model: The regression equation y = 109.54x +117,702 suggests that for every

additional square foot, the listing price would increase by $109.54.

 Predictive Example: For a 1,500 square-foot home, the estimated listing price would be

$282,012; this can be extrapolated for additional predictions.

 Market Variance: While square footage alone is a strong indicator of price, ~48% of

pricing variations stem from other factors which may include land value, amenities, etc.

 Data Trends: The above provided data and visuals such as the histograms show a slightly

right-skewed distribution, meaning a few high-priced or larger homes influence overall

averages; this is essential when considering listing very large or smaller homes.

 Outlier Consideration: Some extreme values (i.e. homes above 3,500 sq ft) were

reviewed but removing them significantly weakened the model’s predictive strength.

The results were mostly in line with expectations, but there were a few notable findings that

stood out, such as:


Median Housing Price Model for D. M. Pan National Real Estate Company 12

Expected Findings:

 Square Footage Correlates with Price: As anticipated, larger homes have higher

listing prices, and the regression equation confirmed this positive relationship.

 Moderate Strength of Model: R^2= 0.52; it was expected that square footage

would not be the sole factor in determining price. Although true, it still

accounted for a sizable portion of the variability at ~52%.

Unexpected Findings:

 Impact of Outliers: The presence of a few large and highly priced homes skews

the data. This reinforces the need to potentially consider luxury markets

separately as removing these outliers significantly weakened the model’s

predictive ability.

 Variance in Pricing: Although this is expected considering we’re examining

properties from across the nation; ~48% of pricing factors remain unanswered

by this model alone, this indicates additional factors such as location, amenities,

geography, etc. impact the listing price(s) of properties as-well.

 Sample vs National Data findings: The sample’s narrower spread compared to

the national data suggests a more concentrated mid-sized home market which

may limit its ability to reflect pricing trends for unusually large or small

properties.

Overall, the findings confirmed realistic expectations that square footage is a crucial key

to predicting listing prices as the relationship is positively linear; however, it is not exclusive

when it comes to factoring predicted home prices. Admittedly, this reinforces the potential
Median Housing Price Model for D. M. Pan National Real Estate Company 13

need for multifactor pricing strategies that account for additional variables beyond size alone,

for example, comparing “comps,” or comparable prices within a specific market will better

gauge realistic variance in a specific region or market.

To observe potential mishaps or misrepresentations for specific markets or areas it is

essential to examine potential options to improve results or address different problems, several

changes include:

 Adding More Variables: Incorporating factors that include location, bedrooms,

lot size, and renovations may help further explain the 48% of pricing variability

that could not be captured by square footage.

 Segment the Market: Separate models for potential starters, mid-range, and

luxury homes would potentially provide more accurate pricing predictions for

specific use-cases.

 Refine the Sample: A more representative dataset tailored towards a specific

market or region would significantly improve general findings tailored towards

specific markets as these datapoints would be better aligned and could

potentially “price in” comps.

 Addressing Outliers: Although we did not explore this avenue, instead of simply

removing datapoints such as those with higher prices, we could potentially apply

reasonable transformations such as a logarithmic adjustment(s) to ideally reduce

additional skew.

 Considering Different Models: With the evolution of data analytics, it may be

beneficial to explore avenues such as, polynomial regression or even machine


Median Housing Price Model for D. M. Pan National Real Estate Company 14

learning techniques which could better capture complex pricing trends that a

simple correlation could not equate for.

 Focus on Regional Markets: When considering placing an investment in

property, it is highly suggested to focus on regional markets to examine specific

trends and findings that vary from market to market. Specifically, geographical

differences and improving pricing accuracy based on regional comps.

To challenge progressive thinking and encourage further real estate findings, it raises

the question, how do additional factors such as location, home age, and amenities impact

listing prices compared to square footage alone? This question would help determine the

relative importance of various property features and determining whether incorporating these

specific factors can improve price predictions or potentially complicate findings.

You might also like