Optimizing App Predictions with AutoML Tables

Mikalai Tsytsarau, PhD,
GCP Professional Data Engineer, DELVE

AutoML Tables automatically build and deploy powerful machine learning models based on table data containing feature vectors. It scales model complexity and topology based on input data size, applies regression on simpler datasets, and uses more advanced models, like ensemble and deep learning, for more complex ones. Under the hood, AutoML uses Google Cloud’s infrastructure which supports model training, deployment, and serving with low latency and high scalability regardless of workload volume, and with the added benefit of pricing being dependent on consumed resources only. Moreover, a trained model is ready to be immediately deployed for batch and online prediction using SQL-like or API queries.

AutoML reduces the time needed to build usable machine learning models from months to days (sometimes less); speeding up the development stage saves money and minimizes time-to-market.

The lifecycle of machine learning in AutoML consists of three major steps: connecting data and selecting features, training and evaluating a model, and finally, deploying a model for serving predictions, as schematically shown in Figure 1. Each step of this process has a special configuration page, which details each step with a few possible options and parameters. However, with common data types, it’s absolutely possible to just use the default setting and let the system do the bulk of configuration work.

In the following paragraphs, we will highlight the process of configuring, training and evaluating a model for simple native-App features, and share some important details which can help improve this process and the end result for your App.


Figure 1: AutoML lifecycle (source: https://cloud.google.com/automl-tables/)

Training a Model

The AutoML visual interface is conveniently structured into several tabs that guide you through major machine learning lifecycle steps. AutoML provides detailed feedback on estimated parameters and selected options throughout each step of the configuration, enabling you to troubleshoot as necessary. However, the whole process of feature selection can be automatic, with AutoML recognizing a wide range of out-of-the-box data types (e.g., numbers, categorical features, strings, timestamps, and structures). All this helps to avoid routine data preparation and model configuration problems and lets analysts focus on feature engineering. An example of features from our sample user dataset, recognized by AutoML, is shown in Figure 2:


Figure 2: An example of feature recognition in AutoML

As can be seen in the above figure, we selected a categorical column named ‘payer_status’ as the target for prediction and aim to predict users who will make an in-App purchase. This feature picks one of two possible values: ‘payer’ and ‘non_payer’, making it a classic binary classification problem. Picking other types of targets for prediction is also possible, as well as excluding unnecessary columns, such as user ‘ID’. From this figure, one can also notice that upon evaluating training data, the system automatically determines if any of the features can be empty. At this point, we can override a system’s choice by knowing the properties of our data better and/or expecting other values in the future. We can then proceed by creating a training dataset.

An AutoML training dataset must meet a few minimal and maximal requirements to be useful for machine learning, the major ones are related to the size of the data: A dataset must have 1000 to 100,000,000 rows

Dataset schema must contain between 1 and 1000 features At least 50 rows (instances) should be present for each class

Dataset schema must contain between 1 and 1000 features

– A dataset must have 1000 to 100,000,000 rows
– Dataset schema must contain between 1 and 1000 features
– At least 50 rows (instances) should be present for each class

Usually, 10,000 to 100,000 rows of data are sufficient to start learning and retrieve meaningful results for prototyping. However, the more unique training data you push through the production model, the better performance it will demonstrate due to its adapting architecture and parameters. Having a smaller scope of the problem than the minimal requirements and duplicating training data to meet these requirements is also possible, but not recommended, since the model won’t reach its potential performance. Below is a list of a few tips that can improve training time and model performance:

Usually, 10,000 to 100,000 rows of data are sufficient to start learning and retrieve meaningful results for prototyping. However, the more unique training data you push through the production model, the better performance it will demonstrate due to its adapting architecture and parameters. Having a smaller scope of the problem than the minimal requirements and duplicating training data to meet these requirements is also possible, but not recommended, since the model won’t reach its potential performance. Below is a list of a few tips that can improve training time and model performance:

Tips for Improved Training:

We believe that AutoML facilitates model training and engineering, but does not necessarily eliminate feature engineering which is always beneficial for training, no matter how small or big the problem size is.

– Use a reasonably smaller sample of your data for exploration, then a complete one for production-ready training
– Use as many features as you have in the first training run
– Gradually remove unused features by consulting the model’s evaluation
– Avoid features dependent on a target class, such as caused by it
– Use feature-specific data types where applicable
– Include aggregated or “context” data, along with raw features
– Avoid missing values if possible, use ‘null’ values for empty cells
– Curate categorical feature values, limiting their size
– Try including timestamp & weight columns with rows of data


Tips for Evaluating Model

After setting up the dataset and creating your model, AutoML provides detailed feedback on the model’s performance. Let’s take a look at example model statistics for our test data, shown in Figure 3:


Figure 3: AutoML model evaluation details (simplified)

In the above figure, the main parameter we should look at is called ‘Accuracy’. This is a percentage of all training data samples in which classes were correctly predicted by a model. However, this statistic doesn’t take into account the distribution proportions among classes and can, therefore, be misleading if we have unbalanced data that contains more samples of one class than another. In such a case, even a ‘dumb’ predictor, assigning all records to the same (prevailing) class, can yield good results. Therefore, we should mainly look at more balanced evaluation metrics. For instance, such metrics can be:

– A dataset must have 1000 to 100,000,000 rows
– Dataset schema must contain between 1 and 1000 features
– At least 50 rows (instances) should be present for each class

AUC measures the area under a curve, which plots true positive rate versus false-positive rate, and is demonstrated in Figure 4. The curve shows how ‘true’ positive results can be predicted if we allow a certain ‘false’ positive error level. For instance, if our model reaches a 100% true rate with 0% errors, then our AUC equals 1.0 and we have an absolute and accurate model. We note that this measure is specific to the model in general, and is not dependent on the actual error-performance level we pick (represented by the blue dot in the image). We will get back to this parameter later on.

F1 is another measure for evaluating prediction performance that averages Precision and Recall of the model. Since there is a tradeoff between the proportion of true positive predictions (Precision) and true positive predictions as a fraction of all positive samples (Recall), this average also reflects the model’s discriminatory performance in general, regardless of the relative proportion of class samples.


Figure 4: AutoML model evaluation ROC curve


Figure 5: Class confusion matrix

Finally, AutoML outputs a class confusion matrix, shown in Figure 5. This matrix indicates how many samples of one class were predicted as another and vice-versa, depending on your selected decision boundary. By default, the model picks a boundary which maximizes our target metric, which is F1 in this example. However, there are some situations where you’ll want to override this choice, for example:

– If you want to include more potential buyers for in-app offers – balance towards higher recall of ‘payer’ class
– If you need more precise examples to train for user acquisition – pick the value which optimizes the precision of ‘payer’ class

Tips for Feature Engineering

Feature Engineering has many benefits even in the context of automated learning. First, if we apply domain knowledge, we can engineer more meaningful/better features than would be possible with other available ML methods. Second, engineered features have considerably smaller data sizes, reducing training time and costs. Finally, such features can be used for regular analytics, such as statistical purposes or user segmentation. However, there are some feature engineering challenges that are mainly associated with time and cost of establishing data pipelines and feature processing:

– Designing good features needs experience & business knowledge
– It requires massive retrieval, aggregation, and analysis of event data
– Data, as well as features, are often evolving with time, requiring updates

When designing a model for user predictions, it’s important to test various packs of features and note their performance. Sometimes it is possible to trade off some of the more valuable, but variable, features for those less dependent on user demographics and more on user engagement (e.g. country and language for usage statistics). This way we can design a model less dependent on user acquisition sources, and train it on readily available and static data. Overall, we recommend testing various, even unobvious features, including the following:

– User profile & Geo
– Event data
– App status features

Feature importance provides valuable feedback for data impact on resulting prediction. For instance, it can demonstrate which features were more or less useful for discriminating target values, as shown in Figure 6. In this case, we found that country, language, and device price had a bigger impact on prediction than a device’s brand or its screen size. In another scenario, when adding App statuses and event features to the mix, we found that they, in turn, became more important than user demographic and geographic features.


Figure 6: Feature importance

Tips for pLTV predictions

– Training dataset should include samples from similar user traffic
– The model must be updated in sync with game mechanics
– In-app offers can interfere with LTV prediction!
– Training/prediction features should be uniform w/regards to prediction target
– It’s beneficial to include “meta-event” features, like event frequencies/delays


Ready to take your ads to the next level?  
DELVE is your strategic partner for site-side analytics, campaign management, and advanced marketing science. As experts in Google Marketing Platform and Google Cloud Platform, DELVE drives client growth through a data-driven mindset that converts digital inefficiency into hard ROI.
SEE EXAMPLES of our experience and reviews from our clients.
Contact us to learn more about how we help our clients get advertising right.


DELVE Experts
delve.experts@delvepartners.com


Optimizing App Predictions with AutoML Tables

Mikalai Tsytsarau, PhD,GCP Professional Data Engineer, DELVE AutoML Tables automatically build and deploy powerful machine…

Optimizing App Predictions with AutoML Tables

Mikalai Tsytsarau, PhD,
GCP Professional Data Engineer, DELVE

AutoML Tables automatically build and deploy powerful machine learning models based on table data containing feature vectors. It scales model complexity and topology based on input data size, applies regression on simpler datasets, and uses more advanced models, like ensemble and deep learning, for more complex ones. Under the hood, AutoML uses Google Cloud’s infrastructure which supports model training, deployment, and serving with low latency and high scalability regardless of workload volume, and with the added benefit of pricing being dependent on consumed resources only. Moreover, a trained model is ready to be immediately deployed for batch and online prediction using SQL-like or API queries.

AutoML reduces the time needed to build usable machine learning models from months to days (sometimes less); speeding up the development stage saves money and minimizes time-to-market.

The lifecycle of machine learning in AutoML consists of three major steps: connecting data and selecting features, training and evaluating a model, and finally, deploying a model for serving predictions, as schematically shown in Figure 1. Each step of this process has a special configuration page, which details each step with a few possible options and parameters. However, with common data types, it’s absolutely possible to just use the default setting and let the system do the bulk of configuration work.

In the following paragraphs, we will highlight the process of configuring, training and evaluating a model for simple native-App features, and share some important details which can help improve this process and the end result for your App.


Figure 1: AutoML lifecycle (source: https://cloud.google.com/automl-tables/)

Training a Model

The AutoML visual interface is conveniently structured into several tabs that guide you through major machine learning lifecycle steps. AutoML provides detailed feedback on estimated parameters and selected options throughout each step of the configuration, enabling you to troubleshoot as necessary. However, the whole process of feature selection can be automatic, with AutoML recognizing a wide range of out-of-the-box data types (e.g., numbers, categorical features, strings, timestamps, and structures). All this helps to avoid routine data preparation and model configuration problems and lets analysts focus on feature engineering. An example of features from our sample user dataset, recognized by AutoML, is shown in Figure 2:


Figure 2: An example of feature recognition in AutoML

As can be seen in the above figure, we selected a categorical column named ‘payer_status’ as the target for prediction and aim to predict users who will make an in-App purchase. This feature picks one of two possible values: ‘payer’ and ‘non_payer’, making it a classic binary classification problem. Picking other types of targets for prediction is also possible, as well as excluding unnecessary columns, such as user ‘ID’. From this figure, one can also notice that upon evaluating training data, the system automatically determines if any of the features can be empty. At this point, we can override a system’s choice by knowing the properties of our data better and/or expecting other values in the future. We can then proceed by creating a training dataset.

An AutoML training dataset must meet a few minimal and maximal requirements to be useful for machine learning, the major ones are related to the size of the data: A dataset must have 1000 to 100,000,000 rows

Dataset schema must contain between 1 and 1000 features At least 50 rows (instances) should be present for each class

Dataset schema must contain between 1 and 1000 features

– A dataset must have 1000 to 100,000,000 rows
– Dataset schema must contain between 1 and 1000 features
– At least 50 rows (instances) should be present for each class

Usually, 10,000 to 100,000 rows of data are sufficient to start learning and retrieve meaningful results for prototyping. However, the more unique training data you push through the production model, the better performance it will demonstrate due to its adapting architecture and parameters. Having a smaller scope of the problem than the minimal requirements and duplicating training data to meet these requirements is also possible, but not recommended, since the model won’t reach its potential performance. Below is a list of a few tips that can improve training time and model performance:

Usually, 10,000 to 100,000 rows of data are sufficient to start learning and retrieve meaningful results for prototyping. However, the more unique training data you push through the production model, the better performance it will demonstrate due to its adapting architecture and parameters. Having a smaller scope of the problem than the minimal requirements and duplicating training data to meet these requirements is also possible, but not recommended, since the model won’t reach its potential performance. Below is a list of a few tips that can improve training time and model performance:

Tips for Improved Training:

We believe that AutoML facilitates model training and engineering, but does not necessarily eliminate feature engineering which is always beneficial for training, no matter how small or big the problem size is.

– Use a reasonably smaller sample of your data for exploration, then a complete one for production-ready training
– Use as many features as you have in the first training run
– Gradually remove unused features by consulting the model’s evaluation
– Avoid features dependent on a target class, such as caused by it
– Use feature-specific data types where applicable
– Include aggregated or “context” data, along with raw features
– Avoid missing values if possible, use ‘null’ values for empty cells
– Curate categorical feature values, limiting their size
– Try including timestamp & weight columns with rows of data


Tips for Evaluating Model

After setting up the dataset and creating your model, AutoML provides detailed feedback on the model’s performance. Let’s take a look at example model statistics for our test data, shown in Figure 3:


Figure 3: AutoML model evaluation details (simplified)

In the above figure, the main parameter we should look at is called ‘Accuracy’. This is a percentage of all training data samples in which classes were correctly predicted by a model. However, this statistic doesn’t take into account the distribution proportions among classes and can, therefore, be misleading if we have unbalanced data that contains more samples of one class than another. In such a case, even a ‘dumb’ predictor, assigning all records to the same (prevailing) class, can yield good results. Therefore, we should mainly look at more balanced evaluation metrics. For instance, such metrics can be:

– A dataset must have 1000 to 100,000,000 rows
– Dataset schema must contain between 1 and 1000 features
– At least 50 rows (instances) should be present for each class

AUC measures the area under a curve, which plots true positive rate versus false-positive rate, and is demonstrated in Figure 4. The curve shows how ‘true’ positive results can be predicted if we allow a certain ‘false’ positive error level. For instance, if our model reaches a 100% true rate with 0% errors, then our AUC equals 1.0 and we have an absolute and accurate model. We note that this measure is specific to the model in general, and is not dependent on the actual error-performance level we pick (represented by the blue dot in the image). We will get back to this parameter later on.

F1 is another measure for evaluating prediction performance that averages Precision and Recall of the model. Since there is a tradeoff between the proportion of true positive predictions (Precision) and true positive predictions as a fraction of all positive samples (Recall), this average also reflects the model’s discriminatory performance in general, regardless of the relative proportion of class samples.


Figure 4: AutoML model evaluation ROC curve


Figure 5: Class confusion matrix

Finally, AutoML outputs a class confusion matrix, shown in Figure 5. This matrix indicates how many samples of one class were predicted as another and vice-versa, depending on your selected decision boundary. By default, the model picks a boundary which maximizes our target metric, which is F1 in this example. However, there are some situations where you’ll want to override this choice, for example:

– If you want to include more potential buyers for in-app offers – balance towards higher recall of ‘payer’ class
– If you need more precise examples to train for user acquisition – pick the value which optimizes the precision of ‘payer’ class

Tips for Feature Engineering

Feature Engineering has many benefits even in the context of automated learning. First, if we apply domain knowledge, we can engineer more meaningful/better features than would be possible with other available ML methods. Second, engineered features have considerably smaller data sizes, reducing training time and costs. Finally, such features can be used for regular analytics, such as statistical purposes or user segmentation. However, there are some feature engineering challenges that are mainly associated with time and cost of establishing data pipelines and feature processing:

– Designing good features needs experience & business knowledge
– It requires massive retrieval, aggregation, and analysis of event data
– Data, as well as features, are often evolving with time, requiring updates

When designing a model for user predictions, it’s important to test various packs of features and note their performance. Sometimes it is possible to trade off some of the more valuable, but variable, features for those less dependent on user demographics and more on user engagement (e.g. country and language for usage statistics). This way we can design a model less dependent on user acquisition sources, and train it on readily available and static data. Overall, we recommend testing various, even unobvious features, including the following:

– User profile & Geo
– Event data
– App status features

Feature importance provides valuable feedback for data impact on resulting prediction. For instance, it can demonstrate which features were more or less useful for discriminating target values, as shown in Figure 6. In this case, we found that country, language, and device price had a bigger impact on prediction than a device’s brand or its screen size. In another scenario, when adding App statuses and event features to the mix, we found that they, in turn, became more important than user demographic and geographic features.


Figure 6: Feature importance

Tips for pLTV predictions

– Training dataset should include samples from similar user traffic
– The model must be updated in sync with game mechanics
– In-app offers can interfere with LTV prediction!
– Training/prediction features should be uniform w/regards to prediction target
– It’s beneficial to include “meta-event” features, like event frequencies/delays


Ready to take your ads to the next level?  
DELVE is your strategic partner for site-side analytics, campaign management, and advanced marketing science. As experts in Google Marketing Platform and Google Cloud Platform, DELVE drives client growth through a data-driven mindset that converts digital inefficiency into hard ROI.
SEE EXAMPLES of our experience and reviews from our clients.
Contact us to learn more about how we help our clients get advertising right.


Related Posts

How to Count Google Analytics Sessions Using Big Query

When you’re ready to start working with your data, it’s essential to use tools that…

Improve Your ROAS In One Day With Google Analytics 360

Improving return on ad spend is one of the most important goals for a marketing…

Join DELVE in Donating to UNICEF USA this Holiday Season

You’ve heard us say it here on the blog before: a data-driven approach to your…

New Attribution in Google Analytics

Attribution has always been an integral part of both Google Analytics 360 and Google Analytics…

Google Ads: Smart-Bidding Best Practices

Google’s machine learning models for bid optimization are collectively called “smart bidding” Most of these…

Optimizing App Predictions with AutoML Tables

Mikalai Tsytsarau, PhD,GCP Professional Data Engineer, DELVE AutoML Tables automatically build and deploy powerful machine…