The objective of this notebook is to create a time-lapse visualization of the dynamic relationship between median home value, as measured by regional ZHVI, and regional median household income from 1979 to 2017. ZHVI is a home value index created by Zillow which is the median of all estimated home values for a given region within a given quarter. The dataset used in this notebook contains ZHVI values for 916 metropolitan regions within the US sampled quarterly from March of 1979 to June of 2017. For each quarter, regional ZHVI values are plotted against regional median household incomes and chronologically strung together into frame of a resulting gif.
The ZHVI to Median income ratio provides a measure of affordability of the housing market in a given region. The higher the ZHVI to median income ratio the more expensive the region is for its median earner. Therefore, the time-lapse scatter plot gives a good sense of how housing affordability has shifted over time from 1979 to 2017. As can be seen from the increased slope of the regression line, in general it is more expensive to live in higher income areas in 2017 than it was in 1979. Further, just prior to the bursting of the subprime mortgage bubble in 2007 housing prices relative to median incomes were at all time highs for many regions within US.
import numpy as np
import pandas as pd
import imageio
import datetime
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
ZHVI
: dataframe containing monthly Zillow Housing Value Index (ZHVI) by region spanning from 1979-03 to 2017-06. ZHVI is a metric of home values created by Zillow. In a nutshell, it is the median of all estimated home values for a given region by month. Data was downloaded from Zillow.
ZHVI = pd.read_csv("../input/median-housing-price-us/Affordability_ChainedZHVI_2017Q2.csv")
ZHVI.head()
RegionID | RegionName | SizeRank | 1979-03 | 1979-06 | 1979-09 | 1979-12 | 1980-03 | 1980-06 | 1980-09 | ... | 2015-03 | 2015-06 | 2015-09 | 2015-12 | 2016-03 | 2016-06 | 2016-09 | 2016-12 | 2017-03 | 2017-06 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 102001 | United States | 0 | 47420.325963 | 48967.092287 | 50031.142946 | 50986.193294 | 51904.910205 | 52704.245822 | 54225.059691 | ... | 174100.0 | 176600.0 | 179500.0 | 181600.0 | 183800.0 | 186600.0 | 190300.0 | 194200.0 | 197200.0 | 200400.0 |
1 | 394913 | New York, NY | 1 | 52563.259032 | 54137.586472 | 55438.816295 | 56434.819370 | 58715.987702 | 59792.313605 | 62330.514988 | ... | 368000.0 | 370900.0 | 374800.0 | 379300.0 | 382100.0 | 386600.0 | 396500.0 | 403400.0 | 412600.0 | 422300.0 |
2 | 753899 | Los Angeles-Long Beach-Anaheim, CA | 2 | 73394.779918 | 77180.816449 | 80406.587484 | 83785.158199 | 86875.107296 | 90746.032538 | 93632.248727 | ... | 531300.0 | 539500.0 | 549200.0 | 557600.0 | 566500.0 | 574800.0 | 581400.0 | 594500.0 | 602300.0 | 609800.0 |
3 | 394463 | Chicago, IL | 3 | 60177.913481 | 61589.446205 | 61680.095096 | 62495.935111 | 62029.740817 | 62884.430356 | 60281.512213 | ... | 188500.0 | 191300.0 | 193000.0 | 194400.0 | 195800.0 | 198400.0 | 201500.0 | 205200.0 | 209300.0 | 211200.0 |
4 | 394514 | Dallas-Fort Worth, TX | 4 | 62602.241939 | 65943.236024 | 68777.723717 | 71310.255676 | 73647.977485 | 74660.990269 | 76901.307002 | ... | 162100.0 | 168300.0 | 175200.0 | 180800.0 | 185300.0 | 190900.0 | 197500.0 | 201600.0 | 206000.0 | 211000.0 |
5 rows X 157 columns
Income
: dataframe containing median household incomes by region and month for a period spanning from 1979-03 to 2017-06. Data was downloaded from Zillow.
Income = pd.read_csv("../input/median-housing-price-us/Affordability_Income_2017Q2.csv")
Income.head()
RegionID | RegionName | SizeRank | 1979-03 | 1979-06 | 1979-09 | 1979-12 | 1980-03 | 1980-06 | 1980-09 | ... | 2015-03 | 2015-06 | 2015-09 | 2015-12 | 2016-03 | 2016-06 | 2016-09 | 2016-12 | 2017-03 | 2017-06 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 102001 | United States | 0 | 16347.78 | 16670.31 | 17005.00 | 17340.92 | 17731.96 | 18018.92 | 18291.96 | ... | 55101.89 | 55571.91 | 55982.59 | 56443.61 | 56781.037105 | 57118.464209 | 57455.891314 | 57793.318418 | 58130.745523 | 58468.172627 |
1 | 394913 | New York, NY | 1 | 17537.06 | 17882.32 | 18246.90 | 18628.91 | 19096.50 | 19499.66 | 19871.00 | ... | 68178.78 | 68566.98 | 68916.19 | 69309.90 | 69886.398450 | 70462.896900 | 71039.395351 | 71615.893801 | 72192.392251 | 72768.890701 |
2 | 753899 | Los Angeles-Long Beach-Anaheim, CA | 2 | 17915.63 | 18279.35 | 18733.76 | 19230.16 | 19819.58 | 20321.11 | 20702.53 | ... | 61865.99 | 62303.60 | 62759.76 | 63246.79 | 63708.637741 | 64170.485482 | 64632.333224 | 65094.180965 | 65556.028706 | 66017.876447 |
3 | 394463 | Chicago, IL | 3 | 20060.12 | 20455.89 | 20900.13 | 21277.76 | 21702.38 | 21852.68 | 21946.23 | ... | 62631.77 | 62890.67 | 63234.29 | 63856.01 | 64306.116564 | 64756.223128 | 65206.329692 | 65656.436256 | 66106.542820 | 66556.649384 |
4 | 394514 | Dallas-Fort Worth, TX | 4 | 18151.53 | 18456.96 | 18887.64 | 19463.93 | 20092.82 | 20605.59 | 20967.85 | ... | 61323.91 | 61664.37 | 61856.66 | 61729.10 | 62101.688038 | 62474.276075 | 62846.864113 | 63219.452151 | 63592.040188 | 63964.628226 |
5 rows X 157 columns
CPI
: dataframe containing Consumer Price Index (CPI) published monthly for a period spanning from 1960-01 to 2019-07. Data was downloaded from the Federal Reserve Economic Data (FRED). CPI measures changes in the price level of a weighted average market basket of consumer goods and services purchased by households.
CPI = pd.read_csv("../input/consumer-price-index-usa-all-items/USACPIALLMINMEI.csv")
CPI.head()
DATE | USACPIALLMINMEI | |
---|---|---|
0 | 1960-01-01 | 12.361982 |
1 | 1960-02-01 | 12.404174 |
2 | 1960-03-01 | 12.404174 |
3 | 1960-04-01 | 12.446365 |
4 | 1960-05-01 | 12.446365 |
5 rows X 2 columns
The dataframes are melted and merged into a single dataframe. CPI is used to adjust historical income and ZHVI values to the 2019 dollar.
# melt ZHVI to create Date and ZHVI columns
ZHVI = pd.melt( ZHVI, id_vars=['RegionID', 'RegionName', 'SizeRank'], value_name='ZHVI', var_name = 'Date')
# melt Income to create Date and Income columns
Income = pd.melt( Income, id_vars=['RegionID', 'RegionName', 'SizeRank'], value_name='Income', var_name = 'Date')
#merge ZHVI and Income dataframes on ['RegionID','RegionName','SizeRank','Date'] columns
ZHVI = ZHVI.merge(Income, how='outer', on=['RegionID','RegionName','SizeRank','Date'])
#rename CPI columns
CPI.columns = ['Date', 'CPI']
#change CPI Date column values from string to datetime object
CPI.Date = pd.to_datetime(CPI.Date, format="%Y-%m")
#change ZHVI Date column values from string to datetime object
ZHVI.Date = pd.to_datetime(ZHVI.Date, format="%Y-%m")
#merge ZHVI and CPI on Date columns
ZHVI = ZHVI.merge(CPI, how='inner', on=['Date'])
#set CPI index to Date
CPI.set_index('Date', inplace= True)
#calculate average CPI for 2019
base = np.mean(CPI.loc['2019'])[0]
#adjust all income values to the average 2019 dollar value
ZHVI['Income_A'] = ZHVI.Income*base/ZHVI.CPI
#adjust ZHVI values to the average 2019 dollar value
ZHVI['ZHVI_A'] = ZHVI.ZHVI*base/ZHVI.CPI
#set ZHVI index to Date
ZHVI.set_index('Date', inplace = True)
ZHVI.head()
RegionID | RegionName | SizeRank | ZHVI | Income | CPI | Income_A | ZHVI_A | |
---|---|---|---|---|---|---|---|---|
Date | ||||||||
1979-03-01 | 102001 | United States | 0 | 47420.325963 | 16347.78 | 29.449364 | 59849.205547 | 173605.763941 |
1979-03-01 | 394913 | New York, NY | 1 | 52563.259032 | 17537.06 | 29.449364 | 64203.158387 | 192434.036547 |
1979-03-01 | 753899 | Los Angeles-Long Beach-Anaheim, CA | 2 | 73394.779918 | 17915.63 | 29.449364 | 65589.102763 | 268698.212808 |
1979-03-01 | 394463 | Chicago, IL | 3 | 60177.913481 | 20060.12 | 29.449364 | 73440.078419 | 220311.278553 |
1979-03-01 | 394514 | Dallas-Fort Worth, TX | 4 | 62602.241939 | 18151.53 | 29.449364 | 66452.732417 | 229186.742509 |
Adjusted ZHVI is plotted against adjusted median income for each quarter from March of 1979 to June of 2017. The US as a whole is plotted in red. A line of best fit is calculated and plotted. Plots are provided to a function which converts them into frames of a gif.
#function to calculate regression line parameters which will be used in resulting plots
def reg(X,y):
regr = LinearRegression()
regr.fit(X,y)
a= regr.coef_
b= regr.intercept_
r= regr.score(X,y)
return a,b, r
#function for creating a plot which will become a single frame of the resulting gif
def plot_gif(date, data):
#create subplot
fig, (ax1) = plt.subplots(nrows =1, ncols =1, figsize=(10, 10), squeeze = True)
#set x to median income by region and y to ZHVI by region fro the given month adjusted to the value of the 2019 dollar
x= data[date]['Income_A']
y= data[date]['ZHVI_A']
#set US_x to median US income and US_y to total US ZHVI for the given month
US_x = data[date].loc[data[date].RegionName == 'United States'].Income_A
US_y = data[date].loc[data[date].RegionName == 'United States'].ZHVI_A
#calculate regression line parameters for x and y
a,b,r = reg(x.values.reshape(-1, 1),y.values.reshape(-1, 1))
a= a.squeeze()
b= b.squeeze()
#set plot style
plt.style.use('seaborn-whitegrid')
#plot x and y
plt.scatter(x,y, label = 'Adjusted ZHVI by Locality (Top 100)')
#plot US_x and US_y in red
plt.scatter(US_x, US_y, c='r', label = 'Adjusted ZHVI Total US')
#set y axis limits
ax1.set_ylim(0,1000000)
#set x axis limits
ax1.set_xlim(20000,100000)
#set axis labels and plot title
ax1.set_ylabel('ZHVI Adjusted to 2019 Dollars', fontsize = 'large')
ax1.set_xlabel('Median Income Adjusted to 2019 Dollars', fontsize = 'large')
ax1.set_title('Regional ZHVI vs. Median Income: '+ date )
#create regression line equation string
reg_equation = 'y= %fx + %f \nR-squared: %f' % (a,b,r)
#annotate plot with regression line equation and r_squared value
plt.text(x= 21000, y=940000,s=reg_equation)
#plot regression line
plt.plot([20000,100000], [a*20000+b, a*100000+b])
#format ticks to include $ sign
formatter =ticker.FormatStrFormatter('$%d')
ax1.yaxis.set_major_formatter(formatter)
ax1.xaxis.set_major_formatter(formatter)
ax1.legend(loc='upper right')
# Used to return the plot as an image array
fig.canvas.draw()
# draw the canvas, cache the renderer
image = np.frombuffer(fig.canvas.tostring_rgb(), dtype='uint8')
image = image.reshape(fig.canvas.get_width_height()[::-1] + (3,))
return image
#limit data to the 100 largest regions within the US
data = ZHVI.loc[ZHVI.SizeRank<100].dropna()
#create gif by passing list of plots and desired fps value into function
imageio.mimsave('./bubble.gif', [plot_gif(date,data) for date in pd.unique(data.index.strftime('%Y-%m'))], fps=20)
The visualization shows the dynamic relationship between home values and median incomes for 100 regions within the US. For any given period this relationship is roughly linear with housing values predictably increasing with increasing median income. However, this relationship is constantly changing. For instance, in March 1979 the slope of the regression line was 2.9 meaning that median home values increased by 2.9 dollars for every dollar increase of median income between regions. In June of 2017 the slope of the regression line was 10.1 a three fold increase in slope. This implies that it is substantially more expensive to live in higher income area in 2017 than it was in 1979. During the height of the housing bubble the slope of the regression line increased to ~ 12. Perhaps the most salient feature of this dynamic plot is the visualization of the boom-bust cylce of the economy which appears as undulating expansions and contractions of the data points over time. For instance, we can visually assess three contractions one in the early 1980s (Early 1980s Recession), one in the early 1990s (Early 1990s Recession) and one in the late 2000s (Financial Crisis of 2007-08). The last of these is especially dramatic, which is not suprising since it corresponds to the bursting of the subprime mortgage bubble.
Another interesting observation is that while some regions change quite a bit over time in terms of median income and home values, other regions are fairly static. For instance, the point (in red) representing the US as a whole, doesn't move substantially from its starting point in 1979 to its end point in 2017.
For those interested, the Ipython notebook and data are dowloadable from Kaggle.