Predicting Property Prices Using Oracle ADW and AutoML: Part 2 — Exploring the Data & Creating a ML Model with AutoML

12 min readJul 6, 2022

This blog is part of a series. To learn more about this demo and to gain access to the other steps, refer to this blog. Now that our data is clean and ready for machine learning, we will use AutoML, a new component of Oracle Machine Learning that provides a no-code browser-based interface that automates the machine learning modeling process and simplifies deployment to just a few clicks, to predict the sale price of a property.

Oracle Machine Learning

To access Oracle Machine Learning, you first need to sign-out of the BROOKLYN schema and log back into the database as the ADMIN user. Do this by clicking on the BROOKLYN user icon in the top-right hand corner of the screen and then click Sign Out.

Next, sign-in to the database as the ADMIN user.

After you are logged-in as ADMIN, click the Oracle Machine Learning tile.

You will be brought to a new webpage. This is where we will access Oracle Machine Learning. Sign-in using your BROOKLYN schema credentials.

You will then enter the OML user interface. In the upper-left hand corner of your screen, click the hamburger icon. Under Project, click Notebooks

Once you are brought to the notebooks page, click the + Create button to create a new notebook. The collaborative notebook environment is based on Apache Zeppelin.

Let’s call this notebook Exploring Brooklyn Property Data. Then, click OK.

Your screen should now look like this.

Let’s use OML4SQL to visualize our data in the database. To visualize our original data, execute the following SQL code in your first box.

%sql
SELECT * FROM BROOKLYN_SALES_MAP

The output should look something like this.

Let’s count the number of rows in our BROOKLYN_COPY table with the following code.

%sql
SELECT COUNT(*) FROM BROOKLYN_COPY;

The output should look something like this.

Let’s count the number of rows in BROOKLYN_VIEW, which if you can recall, is the database view that filters the sale price of properties to be between $400K and $7M.

%sql
SELECT COUNT(*) FROM BROOKLYN_VIEW;

The output should look something like this.

Let’s look at the BUILDING_CLASS_CATEGORY column.

%sql
SELECT COUNT(BUILDING_CLASS_CATEGORY) AS NUMBER_OF_BUILDING_TYPES, BUILDING_CLASS_CATEGORY
FROM BROOKLYN_VIEW
GROUP BY BUILDING_CLASS_CATEGORY
ORDER BY NUMBER_OF_BUILDING_TYPES DESC;

The output should look something like this.

To visualize this data, click the bar graph icon. Then click settings.

Move building_class_category under Groups and NUMBER_OF_BUILDING_TYPES SUM under Values. Then, under xAxis, click Hide. You should now have a histogram that looks like this.

If you hover above a bar, you will be able to view the building class category and its count. The first bar shows that there are 44,672 two-family homes in the BROOKLYN_VIEW.

Lastly, let’s count the number of rows in our final view, BROOKLYN_VIEW_FINAL. This view filters both the price to be between $400K and $7M, as well as the building class category to be only one, two, and three family homes.

%sql
SELECT COUNT(*) FROM BROOKLYN_VIEW_FINAL;

The output should look something like this.

If you wish to access this notebook, you can download it here. Let’s move onto training our machine learning model using AutoML.

AutoML

OML AutoML UI automates four major steps in the machine learning modeling process that are often time consuming and tedious: algorithm selection, data sampling, feature selection, and model tuning.

To access AutoML, click AutoML under Quick Actions on the OML homepage.

Click +Create.

You will now be able to create an experiment. Fill in the boxes with the following parameters:

Name: Brooklyn1.0
Data Source: BROOKLYN.BROOKLYN_VIEW_FINAL
Predict: SALE_PRICE
Prediction Type: Regression

Under Additional Settings, for Model Metric, select R2. Your screen should look like this.

In the upper right-hand corner of your screen, click the Save button to save your experiment. After your experiment saves, click the drop-down carrot next to the Start button and then click Faster Results to start your experiment.

Now, let the experiment run. This should not take long with the faster results selection. Mine took 6 minutes to be exact. Once AutoML is done running, your experiment screen will look like this.

Comparing the R2 scores, the neural network algorithm produced the best model with an R2 of 0.5874. If you scroll down on the page, you will see a feature importance chart.

We are now going to auto generate a notebook that will allow us to interact with our best-performing model using OML4PY — AKA Python. First, under the Leader Board, select the Neural Network. Then, click Create Notebook.

Name the notebook Brooklyn1.0 NN and then click OK.

When prompted to do so, click Open Notebook.

The notebook will look like this.

This notebook is a launching pad for further data analysis. This autogenerated notebook enables us to rebuild the neural network model that AutoML found yielded the highest R2. Run the first two boxes. In the third box, titled Prepare training data, we are going to make some changes. Delete the code that is in this box and add the following.

%python
TRAIN, TEST = build_data.split(ratio = (0.6,0.4))
X_train = TRAIN.drop('SALE_PRICE')
y_train = TRAIN['SALE_PRICE']
X_test = TEST.drop('SALE_PRICE')
y_test = TEST['SALE_PRICE']

Change the name of this box to Split data into randomly selected 60% sample for train and 40% test. In this box, we are splitting the data. In the next box, we need add something to the last line; we need to add model_name = 'nn_brooklyn_nb' after y_train inside the nn_mod.fit() function. It will look like this.

By adding this parameter to this function, this will allow us to later invoke the model using the CLI. Run this box after doing this. This may take a few minutes to run because this code is re-creating the model from scratch. After this block of code finishes, run the next box. This box shows model details. After running this box, it should look like this.

The final three boxes we are going to delete — then we are going to add some more boxes. You can delete the box by clicking the gear icon in the upper right-hand corner of each of the boxes. Then, click Remove.

Insert a new box. Title this box Display model weights. Execute this code to view the model weights.

%python
weights = nn_mod.weights.round(4)
weights = weights.sort_values(by='WEIGHT', ascending=True)
z.show(weights)

Your output should look something like this.

Next, let’s explore the feature importance chart. The OML4Py Explainability module is used to identify the important features that impact a trained model’s predictions. Insert a new box. Title this box Feature Importance. Execute the following code to show the Global Feature importance.

%python
from oml.mlx import GlobalFeatureImportance
import pandas as pd
import numpy as npgfi = GlobalFeatureImportance(mining_function='regression', 
                              score_metric='r2', random_state=0, 
                              parallel=4)
explanation = gfi.explain(nn_mod, build_data, 'SALE_PRICE')
explanation

The output should look like this.

If you want to upload the feature importance chart to your database, you would use this snippet of code.

# Create a Dataframe
data = explanation.to_dict()
data = data['explanation']
my_list = []
for i in data:
    del i['feature_importance_per_iter']
    my_list.append(i)
df = pd.DataFrame(my_list)# Create Feature Importance Table
feature_importance_table = oml.create(df, table = 'FEATURE_IMPORTANCE_TABLE')
z.show(feature_importance_table)

The output would look like this.

If you run this box more than once, you will get an error if you don’t drop the FEATURE_IMPORTANCE_TABLE from your database beforehand, as the table already exists after running this box once.

Next, let’s graph the feature importance chart. Insert a new box. This is quite easy to do and only requires two lines of code.

explanation.to_pyplot_figure()
explanation.show_in_notebook()

The output should look like this.

Insert a new box. Title this box Score data. Execute the following code in this box.

%python
mod_predict = nn_mod.predict(build_data ,supplemental_cols = build_data[:, ['SALE_PRICE']]).pull()
y_true = mod_predict['SALE_PRICE']
y_pred = mod_predict['PREDICTION']
z.show(mod_predict.round(2))

The output should look something like this.

If you wish to view a scatter plot of this data, select the scatter plot icon.

Let’s compute the R2 using sklearn. Add a new box and title it Compute R2 using sklearn. Execute the following code.

%python
import sklearn as skl
metric_score = skl.metrics.r2_score(y_true, y_pred)
print(metric_score.round(4))

We should get an R2 of about 0.664. Your R2 might be slightly different.

Next, let’s compute some more quality statistics. Add a new box and title it Model quality statistics on the predicted values dataframe. Execute the following code.

%python
# The coefficient of determination Rˆ2 is part of the "score" method for OML models
CoefficientOfDeterminationR2 = nn_mod.score(X_test, y_test)
    
# Compute other statistics by hand - no need to pull data locally to Python
# PRED_DF[['PREDICTION', 'SALE_PRICE']
MeanAbsoluteError = (abs(mod_predict['SALE_PRICE'] - mod_predict['PREDICTION']).sum())/mod_predict['SALE_PRICE'].count()
Median = mod_predict['SALE_PRICE'].median()
MeanSquaredError =  ((mod_predict['SALE_PRICE'] - mod_predict['PREDICTION'])**2).sum()/mod_predict['SALE_PRICE'].count()
RootMeanSquaredError =  np.sqrt(MeanSquaredError)
average_sale_price = mod_predict["SALE_PRICE"].mean()print('Average Actual Sale Price: ', "{:,}".format(average_sale_price.round(0)))
print('Median Sale Price',  "{:,}".format(Median.round(0)))
print('R^2: ', "{:,}".format(CoefficientOfDeterminationR2.round(4)))
print('Mean Absolute Error: ', "{:,}".format(MeanAbsoluteError.round(0)))
print('Mean Squared Error: ', "{:,}".format(MeanSquaredError.round(0)))
print('Root Mean Squared Error: ', "{:,}".format(RootMeanSquaredError.round(0)))

The output should look something like this.

Note that this R2 is slightly different than the R2 in the previous step. This is because we are calculating the R2 using the “score” method for OML models, whereas in the previous step, we were using sklearn. It is interesting that the average sale price is about $803K and that the median sale price is $650K. It is also interesting that the mean absolute error is about $168K — that is how much our predicted sale price could be off by, plus or minus.

Next, let’s use matplotlib to graph the predictions versus actuals data. Add a new box and title it Using matplotlib, plot the predicted and actual sale price, log scale. Execute the following code.

%python
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.figure(figsize=[9,7])
plt.margins(0.05)PRED_DF = mod_predict
x = PRED_DF[['SALE_PRICE']]
y = PRED_DF[['PREDICTION']]n = len(x)plt.plot(x, y, '.', c='black', alpha=0.8,  ms=7)
plt.plot( [np.min(y),np.max(y)],[np.min(y),np.max(y)],
          c='red', alpha=0.8,  ms=3,
          label='Perfect prediction reference line')
plt.yscale('log')
plt.xscale('log')
plt.legend(frameon=True, facecolor='lightgray')         
plt.xlabel('SALE_PRICE')
plt.ylabel('PREDICTION')
plt.xlim((250000,15000000))
plt.ylim((150000,15000000))
plt.title('The prediction vs. actual value for sale price',
          fontsize=16)
plt.show()

The output should look like this.

Because in this graph, unlike in the previous Zeppelin-generated graph, we are using a log scale, we can more easily see the data in a condensed way. We can interpret this graph as meaning that we should use the model where you have lots of training data, and we should not use use the model for data that is above or below this, where the predictions come in hot. When the sale price is lower, it is over predicting. When the sale price is higher, it is under predicting.

Let’s now plot the residuals using matplotlib. Add a new box and title it Using matplotlib, plot the residuals v. actual, log scale. Execute the following code.

%python
plt.style.use('seaborn')
plt.figure(figsize=[9,7])
plt.margins(0.05)PRED_DF['RESIDUAL'] = PRED_DF['SALE_PRICE'] - PRED_DF['PREDICTION']
x = PRED_DF[['SALE_PRICE']]
y = PRED_DF[['RESIDUAL']]
plt.plot(x, y, '.', c='black', alpha=0.8,  ms=7)plt.plot( [np.min(x),np.max(x)],[0,0],
          c='red', alpha=0.8,  ms=3,
          label='Perfect prediction (Zero residuals)')plt.legend(frameon=True, facecolor='lightgray',loc='upper left')         
plt.xlabel('SALE_PRICE')
plt.ylabel('RESIDUAL')
plt.xlim((350000,15000000))
plt.title('Sale Price vs. Residuals', fontsize=16)
plt.grid(True)
plt.xscale('log')
plt.show()

The output should look like this.

This graph is telling the same story as the previous graph, however, it more drastically shows the over predictions around the higher sale price. Let’s now create 101 bins and plot the average prediction versus the average actual sale price of a bin against each other. First, create a dataframe table that will have the bins that we need for this graph with the following code.

%python
Nbins = 101
df = PRED_DF
df = df[['SALE_PRICE', 'PREDICTION']].copy()
df['log_SALE_PRICE'] = np.log(df.SALE_PRICE)
df['COUNT'] = 0
df['bin'] = df.log_SALE_PRICE - df.log_SALE_PRICE.min()
df.bin /= df.bin.max()
df.bin *= Nbins
df.bin = df.bin.astype(int)
agger = {'COUNT':'count', 'SALE_PRICE':['mean', 'std'], 'PREDICTION':['mean', 'std']}
df = df.groupby('bin').agg(agger).reset_index()
df['N_records'] = df['COUNT']['count']
df['SALE_PRICE_mean'] = df['SALE_PRICE']['mean']
df['SALE_PRICE_sigma'] = df['SALE_PRICE']['std']/np.sqrt(df['N_records'])
df['SALE_PRICE_pred_mean'] = df['PREDICTION']['mean']
df['SALE_PRICE_pred_sigma'] = df['PREDICTION']['std']/np.sqrt(df['N_records'])
df_test_pred_mean = df
z.show(df_test_pred_mean)

The output should look like this.

Next, let’s graph this using matplotlib with the following code. Title this new box Plot SALE_PRICE vs bin.

%python
df = df_test_pred_mean
idx = (df.N_records > 1)
dfs = df[idx]
plt.figure(figsize=(12,6))
xp = dfs.SALE_PRICE_mean/1.0e6
yp = dfs.SALE_PRICE_mean/1.0e6
err = dfs.SALE_PRICE_sigma/1.0e6
plt.errorbar(xp, yp, err, linestyle='-', marker='o', markersize=5, label='Prediction=Actual')yp = dfs.SALE_PRICE_pred_mean/1.0e6
err = dfs.SALE_PRICE_pred_sigma/1.0e6
plt.errorbar(xp, yp, err, linestyle='-', marker='o', markersize=5, label='OML Neural Network')plt.title('Mean Prediction v. Mean Actual')
plt.xlabel('Sale Price M$')
plt.ylabel('Predicted Sale Price M$')
plt.xscale('log')
plt.yscale('log')
plt.xlim((0.2,15))
plt.ylim((0.2,15))
plt.legend()
plt.show()

The result should look like this.

Next, let’s graph the mean fractional error against the mean sale price with the following code in a new box. Name this box Plot mean fractional error vs SALE_PRICE_mean.

%python
plt.figure(figsize=(10,5))
xp = dfs.SALE_PRICE_mean/1.0e6
yp = (dfs.SALE_PRICE_pred_mean - dfs.SALE_PRICE_mean)/dfs.SALE_PRICE_mean
err = np.sqrt(dfs.SALE_PRICE_sigma**2 + dfs.SALE_PRICE_pred_sigma**2)/dfs.SALE_PRICE_mean
plt.errorbar(xp, yp, err, linestyle='-', marker='o', markersize=5, label='OML Neural Network')
yp = yp*0
plt.plot(xp, yp, linestyle='dashed', label='Prediction=Actual')plt.title('Mean Fractional Error = (Mean_Prediction - Mean_Actual)/Mean_Actual')
plt.ylabel('Fractional Error')
plt.xlabel('Sales Price (M$)')
plt.xscale('log')
plt.xlim((0.2,15))
plt.legend()
plt.show()

The output should look like this.

I think that is enough data analysis for now . If you wish to access this notebook, you can download it here. Let’s now learn how to invoke this model using the CLI in Part 3, also the final component, of this demo.

Predicting Property Prices Using Oracle ADW and AutoML: Part 2 — Exploring the Data & Creating a ML Model with AutoML

Oracle Machine Learning

AutoML

Written by Jared Bach