Housing Market Visualization (1979-2017)

Objective

The objective of this notebook is to create a time-lapse visualization of the dynamic relationship between median home value, as measured by regional ZHVI, and regional median household income from 1979 to 2017. ZHVI is a home value index created by Zillow which is the median of all estimated home values for a given region within a given quarter. The dataset used in this notebook contains ZHVI values for 916 metropolitan regions within the US sampled quarterly from March of 1979 to June of 2017. For each quarter, regional ZHVI values are plotted against regional median household incomes and chronologically strung together into frame of a resulting gif.

The ZHVI to Median income ratio provides a measure of affordability of the housing market in a given region. The higher the ZHVI to median income ratio the more expensive the region is for its median earner. Therefore, the time-lapse scatter plot gives a good sense of how housing affordability has shifted over time from 1979 to 2017. As can be seen from the increased slope of the regression line, in general it is more expensive to live in higher income areas in 2017 than it was in 1979. Further, just prior to the bursting of the subprime mortgage bubble in 2007 housing prices relative to median incomes were at all time highs for many regions within US.

Imports

          
import numpy as np
import pandas as pd
import imageio
import datetime
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

Read Data

ZHVI : dataframe containing monthly Zillow Housing Value Index (ZHVI) by region spanning from 1979-03 to 2017-06. ZHVI is a metric of home values created by Zillow. In a nutshell, it is the median of all estimated home values for a given region by month. Data was downloaded from Zillow.

          
ZHVI = pd.read_csv("../input/median-housing-price-us/Affordability_ChainedZHVI_2017Q2.csv")
ZHVI.head()

	RegionID	RegionName	SizeRank	1979-03	1979-06	1979-09	1979-12	1980-03	1980-06	1980-09	...	2015-03	2015-06	2015-09	2015-12	2016-03	2016-06	2016-09	2016-12	2017-03	2017-06
0	102001	United States	0	47420.325963	48967.092287	50031.142946	50986.193294	51904.910205	52704.245822	54225.059691	...	174100.0	176600.0	179500.0	181600.0	183800.0	186600.0	190300.0	194200.0	197200.0	200400.0
1	394913	New York, NY	1	52563.259032	54137.586472	55438.816295	56434.819370	58715.987702	59792.313605	62330.514988	...	368000.0	370900.0	374800.0	379300.0	382100.0	386600.0	396500.0	403400.0	412600.0	422300.0
2	753899	Los Angeles-Long Beach-Anaheim, CA	2	73394.779918	77180.816449	80406.587484	83785.158199	86875.107296	90746.032538	93632.248727	...	531300.0	539500.0	549200.0	557600.0	566500.0	574800.0	581400.0	594500.0	602300.0	609800.0
3	394463	Chicago, IL	3	60177.913481	61589.446205	61680.095096	62495.935111	62029.740817	62884.430356	60281.512213	...	188500.0	191300.0	193000.0	194400.0	195800.0	198400.0	201500.0	205200.0	209300.0	211200.0
4	394514	Dallas-Fort Worth, TX	4	62602.241939	65943.236024	68777.723717	71310.255676	73647.977485	74660.990269	76901.307002	...	162100.0	168300.0	175200.0	180800.0	185300.0	190900.0	197500.0	201600.0	206000.0	211000.0

5 rows X 157 columns

Income : dataframe containing median household incomes by region and month for a period spanning from 1979-03 to 2017-06. Data was downloaded from Zillow.

          
Income = pd.read_csv("../input/median-housing-price-us/Affordability_Income_2017Q2.csv")
Income.head()

	RegionID	RegionName	SizeRank	1979-03	1979-06	1979-09	1979-12	1980-03	1980-06	1980-09	...	2015-03	2015-06	2015-09	2015-12	2016-03	2016-06	2016-09	2016-12	2017-03	2017-06
0	102001	United States	0	16347.78	16670.31	17005.00	17340.92	17731.96	18018.92	18291.96	...	55101.89	55571.91	55982.59	56443.61	56781.037105	57118.464209	57455.891314	57793.318418	58130.745523	58468.172627
1	394913	New York, NY	1	17537.06	17882.32	18246.90	18628.91	19096.50	19499.66	19871.00	...	68178.78	68566.98	68916.19	69309.90	69886.398450	70462.896900	71039.395351	71615.893801	72192.392251	72768.890701
2	753899	Los Angeles-Long Beach-Anaheim, CA	2	17915.63	18279.35	18733.76	19230.16	19819.58	20321.11	20702.53	...	61865.99	62303.60	62759.76	63246.79	63708.637741	64170.485482	64632.333224	65094.180965	65556.028706	66017.876447
3	394463	Chicago, IL	3	20060.12	20455.89	20900.13	21277.76	21702.38	21852.68	21946.23	...	62631.77	62890.67	63234.29	63856.01	64306.116564	64756.223128	65206.329692	65656.436256	66106.542820	66556.649384
4	394514	Dallas-Fort Worth, TX	4	18151.53	18456.96	18887.64	19463.93	20092.82	20605.59	20967.85	...	61323.91	61664.37	61856.66	61729.10	62101.688038	62474.276075	62846.864113	63219.452151	63592.040188	63964.628226

5 rows X 157 columns

CPI : dataframe containing Consumer Price Index (CPI) published monthly for a period spanning from 1960-01 to 2019-07. Data was downloaded from the Federal Reserve Economic Data (FRED). CPI measures changes in the price level of a weighted average market basket of consumer goods and services purchased by households.

          
CPI = pd.read_csv("../input/consumer-price-index-usa-all-items/USACPIALLMINMEI.csv")
CPI.head()

	DATE	USACPIALLMINMEI
0	1960-01-01	12.361982
1	1960-02-01	12.404174
2	1960-03-01	12.404174
3	1960-04-01	12.446365
4	1960-05-01	12.446365

5 rows X 2 columns

Data Preprocessing

The dataframes are melted and merged into a single dataframe. CPI is used to adjust historical income and ZHVI values to the 2019 dollar.

          
# melt ZHVI to create Date and ZHVI columns
ZHVI = pd.melt( ZHVI, id_vars=['RegionID', 'RegionName', 'SizeRank'], value_name='ZHVI', var_name = 'Date')
# melt Income to create Date and Income columns
Income = pd.melt( Income, id_vars=['RegionID', 'RegionName', 'SizeRank'], value_name='Income', var_name = 'Date')
#merge ZHVI and Income dataframes on ['RegionID','RegionName','SizeRank','Date'] columns
ZHVI = ZHVI.merge(Income, how='outer', on=['RegionID','RegionName','SizeRank','Date'])
#rename CPI columns
CPI.columns = ['Date', 'CPI']
#change CPI Date column values from string to datetime object
CPI.Date = pd.to_datetime(CPI.Date, format="%Y-%m")
#change ZHVI Date column values from string to datetime object
ZHVI.Date = pd.to_datetime(ZHVI.Date, format="%Y-%m")
#merge ZHVI and CPI on Date columns
ZHVI = ZHVI.merge(CPI, how='inner', on=['Date'])
#set CPI index to Date
CPI.set_index('Date', inplace= True)
#calculate average CPI for 2019
base = np.mean(CPI.loc['2019'])[0]
#adjust all income values to the average 2019 dollar value
ZHVI['Income_A'] = ZHVI.Income*base/ZHVI.CPI
#adjust ZHVI values to the average 2019 dollar value
ZHVI['ZHVI_A'] = ZHVI.ZHVI*base/ZHVI.CPI
#set ZHVI index to Date
ZHVI.set_index('Date', inplace = True)

ZHVI.head()

	RegionID	RegionName	SizeRank	ZHVI	Income	CPI	Income_A	ZHVI_A
Date
1979-03-01	102001	United States	0	47420.325963	16347.78	29.449364	59849.205547	173605.763941
1979-03-01	394913	New York, NY	1	52563.259032	17537.06	29.449364	64203.158387	192434.036547
1979-03-01	753899	Los Angeles-Long Beach-Anaheim, CA	2	73394.779918	17915.63	29.449364	65589.102763	268698.212808
1979-03-01	394463	Chicago, IL	3	60177.913481	20060.12	29.449364	73440.078419	220311.278553
1979-03-01	394514	Dallas-Fort Worth, TX	4	62602.241939	18151.53	29.449364	66452.732417	229186.742509

Data Visualization

Adjusted ZHVI is plotted against adjusted median income for each quarter from March of 1979 to June of 2017. The US as a whole is plotted in red. A line of best fit is calculated and plotted. Plots are provided to a function which converts them into frames of a gif.

          
#function to calculate regression line parameters which will be used in resulting plots
def reg(X,y):
    regr = LinearRegression()
    regr.fit(X,y)
    a= regr.coef_
    b= regr.intercept_
    r= regr.score(X,y)
    return a,b, r

#function for creating a plot which will become a single frame of the resulting gif
def plot_gif(date, data):
    #create subplot
    fig, (ax1) = plt.subplots(nrows =1, ncols =1, figsize=(10, 10), squeeze = True)

    #set x to median income by region and y to ZHVI by region fro the given month adjusted to the value of the 2019 dollar
    x= data[date]['Income_A']
    y= data[date]['ZHVI_A']

    #set US_x to median US income and US_y to total US ZHVI for the given month
    US_x = data[date].loc[data[date].RegionName == 'United States'].Income_A
    US_y = data[date].loc[data[date].RegionName == 'United States'].ZHVI_A

   #calculate regression line parameters for x and y
    a,b,r = reg(x.values.reshape(-1, 1),y.values.reshape(-1, 1))
    a= a.squeeze()
    b= b.squeeze()

    #set plot style
    plt.style.use('seaborn-whitegrid')
    #plot x and y
    plt.scatter(x,y, label = 'Adjusted ZHVI by Locality (Top 100)')
    #plot US_x and US_y in red
    plt.scatter(US_x, US_y, c='r', label = 'Adjusted ZHVI Total US')
    #set y axis limits
    ax1.set_ylim(0,1000000)
    #set x axis limits
    ax1.set_xlim(20000,100000)
    #set axis labels and plot title
    ax1.set_ylabel('ZHVI Adjusted to 2019 Dollars', fontsize = 'large')
    ax1.set_xlabel('Median Income Adjusted to 2019 Dollars', fontsize = 'large')
    ax1.set_title('Regional ZHVI vs. Median Income: '+ date )
    #create regression line equation string
    reg_equation = 'y= %fx + %f \nR-squared: %f' % (a,b,r)
    #annotate plot with regression line equation and r_squared value
    plt.text(x= 21000, y=940000,s=reg_equation)

    #plot regression line
    plt.plot([20000,100000], [a*20000+b, a*100000+b])

    #format ticks to include $ sign
    formatter =ticker.FormatStrFormatter('$%d')
    ax1.yaxis.set_major_formatter(formatter)
    ax1.xaxis.set_major_formatter(formatter)
    ax1.legend(loc='upper right')

    # Used to return the plot as an image array
    fig.canvas.draw()
    # draw the canvas, cache the renderer
    image = np.frombuffer(fig.canvas.tostring_rgb(), dtype='uint8')
    image  = image.reshape(fig.canvas.get_width_height()[::-1] + (3,))

    return image

#limit data to the 100 largest regions within the US
data = ZHVI.loc[ZHVI.SizeRank<100].dropna()

#create gif by passing list of plots and desired fps value into function
imageio.mimsave('./bubble.gif', [plot_gif(date,data) for date in pd.unique(data.index.strftime('%Y-%m'))], fps=20)

Interpretation

The visualization shows the dynamic relationship between home values and median incomes for 100 regions within the US. For any given period this relationship is roughly linear with housing values predictably increasing with increasing median income. However, this relationship is constantly changing. For instance, in March 1979 the slope of the regression line was 2.9 meaning that median home values increased by 2.9 dollars for every dollar increase of median income between regions. In June of 2017 the slope of the regression line was 10.1 a three fold increase in slope. This implies that it is substantially more expensive to live in higher income area in 2017 than it was in 1979. During the height of the housing bubble the slope of the regression line increased to ~ 12. Perhaps the most salient feature of this dynamic plot is the visualization of the boom-bust cylce of the economy which appears as undulating expansions and contractions of the data points over time. For instance, we can visually assess three contractions one in the early 1980s (Early 1980s Recession), one in the early 1990s (Early 1990s Recession) and one in the late 2000s (Financial Crisis of 2007-08). The last of these is especially dramatic, which is not suprising since it corresponds to the bursting of the subprime mortgage bubble.

Another interesting observation is that while some regions change quite a bit over time in terms of median income and home values, other regions are fairly static. For instance, the point (in red) representing the US as a whole, doesn't move substantially from its starting point in 1979 to its end point in 2017.

For those interested, the Ipython notebook and data are dowloadable from Kaggle.