Thursday, May 28, 2020

Machine learning in Google Sheet with Tensorflow.js and Google Apps Script


This article will show you how you can setup, train, and predict spreadsheet data with deep-learning framework Tensorflow.js. You don't need to call REST APIs or use other 3rd parties storage and algorithm. All your data stay in your secure Google Sheet.

There is a Google Spreadsheet with demo dataset + full Apps Script inside at the end of article.



Intro

Google has recently introduced a new JavaScript runtime (V8 engine) into Google Apps Script. It enhances G Suite platform for new use-cases of automation. It replaces the old Mozilla's Rhino JavaScript interpreter and allows you to include modern JavaScript libraries.

TensorFlow was originally for Python, but Google added support for more programming languages later. (nodejs. JavaScript, Swift,..). Keras is high-level neural networks API that is on top of TensorFlow. It is appropriate for beginners and helps you to build neural networks. Tensorflow.js is JavaScript-based framework for building neural networks and syntax is similar to Keras.

The whole machine learning topic is very complex. It contains a lot of use-cases, design of architectures, setups, and tiny tweaking. My aim is not to show you step-by-step tutorial which would cover machine learning, but inspire you and show you another point-of-view about the power of Google Sheets with Google Apps Script.

Disclaimer: I have used to small hack, to included Tensorflow.js library. I cannot guarantee, that you get 100% accuracy of result.

Use case

I guess that you have plenty of data in your Google Sheets. Imagine the scenario, that based on some multiple columns (with numbers) you want to predict the value in the last column. It is useful if you want to forecast future values from past values or some values are missing and you can fill the gaps. The scenario is named Multivariate Regression.


Deploy Tensorflow.js in Google Apps Script

I copied the whole Tensorflow.js library into one-file code into Google Apps Script project as file tf-js.gs. 


I had to prepare Tensorflow.js library before training and predicting. First, the library uses the name global for the global variable. It was an easier part because I only defined a new variable and added a new line of code:



Second, the Tensorflow.js library uses native APIs  "measuring time" - specifically Performance.now()  or  process.hrtime() 

Performance.now() is available only in browser API (Chrome) and process.hrtime() is available only in backend language API (node.js). I got an error "Cannot measure time in this environment. You should run tf.js in the browser or in Node.js" in Google Apps Script, because I could not use first and second method.

I did not fully reverse engineered library, but I think measuring time is used to for yielding  main thread for other tasks. For this reason I setup yieldEvery as "never" during model compilation. (https://js.tensorflow.org/api/latest/)

If you have more elegant solutions, ping me on Twitter or email.

Data

Boston Housing Prices dataset is "hello world" entry task in the machine learning world. It is a collection of 500 simple real-estate records collected in Boston (Massachusetts) in the late 1970s. Each row includes numeric measurements of a Boston neighborhood (e.g. crime rate, the typical size of homes, how far the region is from the closest highway, whether the area has waterfront property..). 

These columns are named as a features (=inputs into the machine learning model).


We want to predict the price of the home according to this dataset. This column is one and its name is a target (=output from machine learning model).

This prepared function download dataset from Google Cloud Storage into Google Sheet directly.

Data preparation

We have to divide data into training and testing dataset. A variable rowSplit defines row number for this splitting. In our case rows from 2 to 336 will be used as training dataset. The remain rows (from 337 to 507) as a testing dataset. The variables FEATURE_COLUMN_FROM and FEATURE_COLUMN_TO define features columns for training, testing and predicting.   In our case features data are loaded from 1 - 12 columns.


As a last step, select range in Google Sheet. We want to estimates values in column M according to values A- L in selected rows 7-9.

Tensorflow does not work with Array data structure, but with Tensors. Tensors are multi-dimensional data structures. Function createTensor() creates 2D tensor for us.

Several features (columns) contain values in different scale (e.g. tax values 187 - 711) than others (e.g crime rate 0.01 - 88.98). We have to normalize and transform values that improve the performance and training stability of model.


Building the model and training

As you have already known, that deep-learning networks contains more layers with neurons. We need to define the architecture of neural network layer by layer. The syntax is similar metioned Keras. We have an architecture with two layers and each of them contains 50 neurons.


There are activation functions (Sigmoid) in every hidden layer.


The last layer contains only one neuron with default (linear) activation function. It is linear, because our example is regression use-case.

Next step is compilation. In this step, we need to setup
  • optimizer (Stochastic gradient descent), how to find the best solution and neuron's weights
  • loss function (meanSquaredError), how measure optimal solution

Now it is time for training. Method .fit() trains model from data over several iterations (=EPOCHS)

These values like number of epochs, number of layers, number of neurons, type of activation function are hyperparameters. Data scientists around the world tune these values and compare it with previous settings. More info about settings in Tensorflow.JS API https://js.tensorflow.org/api/latest/


Evaluation allows you to check the accuracy of your model. Less loss value is better. You should compare Train loss vs. Test loss. Bigger train loss means Overfitting and it is not ideal.

Prediction

When we are satisfied with quality of our model and loss value is optimal. Now you can predict futures values.  We also need to convert Array prediction values into Tensors and normalized it as well.

In our code snippet predicted values are saved into cell Notes and you can compare it with original values.


Here is a main function, which load, prepare, train the data.



Try it yourself!

If you want to test and play with it, full dataset + Google Apps Script code is available in this Google Sheet

1. Create a copy of spreadsheet
2. Select any of rows (last value will be skipped during training)
3. Open menu Tools --> Script editor
4. Select menu Run --> Run function and choose Main
5. The predicted values will be saved as a notes 


If you like Google Apps Script, folks like you are in this Google Groups community
https://developers.google.com/apps-script/community