Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer

Most App developers are familiar with Firebase – a powerful platform and Software Development Kit backed by Google, which provides a secure database, analytics, messaging, and everything else needed for app development, all in one seamless package.

Firebase provides convenient user profile and event logging functionality, which facilitates app analytics, user attribution, and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds, and it also provides free analytics.

Firebase also generates predictions and organizes user segmentation based on an event stream (event occurrences): all with full integration.

However, when we base predictions only on event occurrences, it’s possible to miss the context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

Figure 1 shows the differences between Event Analytics and Feature Analytics:

Figure 1: events sequence and aggregation

In-app events arrive in an ordered sequence, analyzed for churn or causality patterns (e.g., using funnels for Event Analytics)

Next, we analyze events of the same kind as a collection with statistical methods (e.g., revealing the distribution of their values for Feature Analytics)

Let’s discuss the benefits feature engineering can bring to the table:

Benefits of Feature Engineering

  • Analysts control features
  • Analysts use domain knowledge to engineer more meaningful features
  • Feature Engineering facilitates better understanding of users
  • Features can be used for regular app analytics too, like segmentation
  • Feature Engineering facilitates more straightforward analytical queries compared to events
  • Feature Engineering uses a considerably smaller size of data

However, there are also some challenges with feature engineering:

Challenges of Feature Engineering

  • Designing good features needs experience & business knowledge
  • Requires massive retrieval, aggregation, and analysis of event data
  • Events params and data can be different for various events
  • Events params and features are often evolving with time

Additionally, the nested storage format of event and user parameters in Firebase makes it challenging to construct efficient aggregation queries in any database and makes data extraction even more complicated. Luckily, Google’s BigQuery can speed-up feature engineering and analytics, which makes the overall process much more manageable.

Understanding BigQuery

BigQuery is Google’s enterprise analytical data warehouse, which runs blazing-fast SQL queries on gigabytes and petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, where it is processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, and as long as analysts de-normalize and unnest repeated Firebase data. 

Understanding Firebase Data

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format. This provides user data, such as demographic and campaign data, in a nested structure named user_dim and event data in a structure named event_dim. 

The user_dim-event_dim tuples display as repeated rows in small batches per user, which we can easily unroll. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing the parameters is a desired target for app analysts, but it’s also a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

Figure 2: sample event data from Firebase

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pair rows. Additionally, all rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate specific event parameters with the UNNEST function (see Figure 3).

Figure 3: sample aggregate query

Efficient Feature Engineering: Step-By-Step

First, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language, into a separate lookup table, to focus on events and profile aggregation.

Then, we extract and store all events and associated profiles from batch records on the same denormalized row structure, which contains a JSON field that stores each property. This row structure creates query event data for users and analyzed features on-demand. It also streams user events and constructs features continually.

BigQuery’s UDF forms the basis for the corresponding transform query. This allows us to define the JavaScript function that transforms repeated key-value rows into JSON. Figure 4 shows this query:

Figure 4: Denormalizing transform query, aggregating repeated rows

After this transformation, we define aggregation queries that construct features for every user in the database and their relevant events – all in one single pass.

Since the source table stores the complete set of data, the massive parallelism of BigQuery is, in this case, extremely efficient. For example, we can execute complex queries, (shown in Figure 5), by creating complete user and event features, readied for Data Science or other types of advanced analytics:

Figure 5: Feature aggregation query example, constructing a feature table in one pass

Stay tuned- in a future blog, we’ll focus on how to extract maximum use from collected features and learn how to apply machine learning that reveals prediction power and other vital insights. 

Are you looking for more information on the Google Cloud Platform and Google Big Query? Check out our blog post here, and learn how we leverage the power of Google Cloud Platform to elevate your digital marketing strategy.


Ready to take your ads to the next level?  
DELVE is your strategic partner for site-side analytics, campaign management, and advanced marketing science. As experts in Google Marketing Platform and Google Cloud Platform, DELVE drives client growth through a data-driven mindset that converts digital inefficiency into hard ROI.
SEE EXAMPLES of our experience and reviews from our clients.
Contact us to learn more about how we help our clients get advertising right.
DELVE Experts
delve.experts@delvepartners.com


Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer Most App developers are familiar with Firebase –…

Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer

Most App developers are familiar with Firebase – a powerful platform and Software Development Kit backed by Google, which provides a secure database, analytics, messaging, and everything else needed for app development, all in one seamless package.

Firebase provides convenient user profile and event logging functionality, which facilitates app analytics, user attribution, and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds, and it also provides free analytics.

Firebase also generates predictions and organizes user segmentation based on an event stream (event occurrences): all with full integration.

However, when we base predictions only on event occurrences, it’s possible to miss the context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

Figure 1 shows the differences between Event Analytics and Feature Analytics:

Figure 1: events sequence and aggregation

In-app events arrive in an ordered sequence, analyzed for churn or causality patterns (e.g., using funnels for Event Analytics)

Next, we analyze events of the same kind as a collection with statistical methods (e.g., revealing the distribution of their values for Feature Analytics)

Let’s discuss the benefits feature engineering can bring to the table:

Benefits of Feature Engineering

  • Analysts control features
  • Analysts use domain knowledge to engineer more meaningful features
  • Feature Engineering facilitates better understanding of users
  • Features can be used for regular app analytics too, like segmentation
  • Feature Engineering facilitates more straightforward analytical queries compared to events
  • Feature Engineering uses a considerably smaller size of data

However, there are also some challenges with feature engineering:

Challenges of Feature Engineering

  • Designing good features needs experience & business knowledge
  • Requires massive retrieval, aggregation, and analysis of event data
  • Events params and data can be different for various events
  • Events params and features are often evolving with time

Additionally, the nested storage format of event and user parameters in Firebase makes it challenging to construct efficient aggregation queries in any database and makes data extraction even more complicated. Luckily, Google’s BigQuery can speed-up feature engineering and analytics, which makes the overall process much more manageable.

Understanding BigQuery

BigQuery is Google’s enterprise analytical data warehouse, which runs blazing-fast SQL queries on gigabytes and petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, where it is processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, and as long as analysts de-normalize and unnest repeated Firebase data. 

Understanding Firebase Data

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format. This provides user data, such as demographic and campaign data, in a nested structure named user_dim and event data in a structure named event_dim. 

The user_dim-event_dim tuples display as repeated rows in small batches per user, which we can easily unroll. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing the parameters is a desired target for app analysts, but it’s also a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

Figure 2: sample event data from Firebase

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pair rows. Additionally, all rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate specific event parameters with the UNNEST function (see Figure 3).

Figure 3: sample aggregate query

Efficient Feature Engineering: Step-By-Step

First, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language, into a separate lookup table, to focus on events and profile aggregation.

Then, we extract and store all events and associated profiles from batch records on the same denormalized row structure, which contains a JSON field that stores each property. This row structure creates query event data for users and analyzed features on-demand. It also streams user events and constructs features continually.

BigQuery’s UDF forms the basis for the corresponding transform query. This allows us to define the JavaScript function that transforms repeated key-value rows into JSON. Figure 4 shows this query:

Figure 4: Denormalizing transform query, aggregating repeated rows

After this transformation, we define aggregation queries that construct features for every user in the database and their relevant events – all in one single pass.

Since the source table stores the complete set of data, the massive parallelism of BigQuery is, in this case, extremely efficient. For example, we can execute complex queries, (shown in Figure 5), by creating complete user and event features, readied for Data Science or other types of advanced analytics:

Figure 5: Feature aggregation query example, constructing a feature table in one pass

Stay tuned- in a future blog, we’ll focus on how to extract maximum use from collected features and learn how to apply machine learning that reveals prediction power and other vital insights. 

Are you looking for more information on the Google Cloud Platform and Google Big Query? Check out our blog post here, and learn how we leverage the power of Google Cloud Platform to elevate your digital marketing strategy.


Ready to take your ads to the next level?  
DELVE is your strategic partner for site-side analytics, campaign management, and advanced marketing science. As experts in Google Marketing Platform and Google Cloud Platform, DELVE drives client growth through a data-driven mindset that converts digital inefficiency into hard ROI.
SEE EXAMPLES of our experience and reviews from our clients.
Contact us to learn more about how we help our clients get advertising right.

Related Posts

How does your company score on Google’s Digital Marketing Maturity Framework and why should your brand care?

In 2017, Boston Consulting Group and Google developed the Digital Marketing Maturity Framework. In a…

Privacy Considerations for Digital Marketing

First the General Data Protection Regulation (GDPR) and now the California Consumer Privacy Act (CCPA)….

Circumventing Suspension in Google Ads

Account suspension issues can be very frustrating but we’re here to help. Most suspensions can…

Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer Most App developers are familiar with Firebase –…