Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer

Most App developers are familiar with Firebase – a powerful platform and SDK backed by Google which provides database, analytics, messaging and everything else needed for app development in one seamless package.

Apart from everything else, Firebase provides convenient user profile and event logging functionality, which can be used to facilitate app analytics, user attribution and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds and provides free analytics.

Firebase also generates predictions and organizes user segmentation based on events stream (event occurrences): a functionality that is seamlessly integrated, as usual.

However, when we base predictions only on event occurrences, we can miss the important context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

The differences between Event Analytics and Feature Analytics are schematically shown in Figure 1:

Figure 1: events sequence and aggregation

In-app events arrive in an ordered sequence and can be analyzed for churn or causality patterns (e.g., using funnels for Event Analytics)

Events of the same kind can also be analyzed as a collection with statistical methods (e.g., revealing the distribution of their values for Feature Analytics)

Let’s discuss the benefits feature engineering can bring to the table:

FEATURE ENGINEERING BENEFITS

  • Analysts are in control of features
  • Domain knowledge is used to engineer meaningful/better features
  • Facilitates better understanding of users
  • Features can be used for regular app analytics too, like segmentation
  • Facilitates simpler analytical queries compared to events
  • Considerably smaller size of data

However, there are some challenges to feature engineering, which are summarized below:

FEATURE ENGINEERING CHALLENGES

  • Designing good features needs experience & business knowledge
  • Requires massive retrieval, aggregation and analysis of event data
  • Events params and data can be different for various events
  • Events params and features are often evolving with time

Moreover, the nested storage format of event and user parameters in Firebase makes it especially difficult to construct efficient aggregation queries in any database and makes data extraction even more complicated. Let’s see if Google’s BigQuery can ease this work for us and speed-up feature engineering and analytics.

WHAT IS BIGQUERY

BigQuery is Google’s enterprise analytical data warehouse which can run blazing-fast SQL queries on gigabytes to petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, which can be processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, so we need to come up with a solution to de-normalize and unnest repeated Firebase data to have an efficient and effective solution.

UNDERSTANDING FIREBASE DATA

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format, which provides user data, such as demographic and campaign, in a nested structure named user_dim and event data in a structure named event_dim. Such user_dim-event_dim tuples are displayed as repeated rows in small batches per user, which can be easily unrolled. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing such parameters is both a desired target for app analysts and a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

Figure 2: sample event data from Firebase

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pairs rows, and additionally, all such rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate certain event parameters with UNNEST() function (see Figure 3).

Figure 3: sample aggregate query

EFFICIENT FEATURE ENGINEERING

First of all, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language into a separate lookup table, to focus on events and profiles aggregation.

Following this, we extract and store all events and associated profiles from batch records into the same denormalized row structure, which contains a JSON field for storing properties of each. Such row structure allows for:

  • Query event data for users and analyze features on-demand
  • Stream user events and construct features on a continuous basis

The corresponding transform query is based on BigQuery’s UDF functionality, which allows us to define the JavaScript function that transforms repeated key-value rows into JSON. This query is shown in Figure 4:

Figure 4: Denormalizing transform query, aggregating repeated rows

After this transformation, we can define aggregation queries that construct features for every user in the database and their relevant events – all in a single pass. Since a complete data set is stored in the source table, the massive parallelism of BigQuery can be very efficient in this case. For instance, we can execute complex queries, as shown in Figure 5, by creating complete user and event features, readying for Data Science or other kinds of advanced analytics:

Figure 5: Feature aggregation query example, constructing a feature table in one pass

Our next article will focus on how to extract maximum use from collected features, by applying machine learning to reveal their prediction power and importance, so stay tuned!

DELVE Experts
delve.experts@delvepartners.com


Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer Most App developers are familiar with Firebase –…

Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer

Most App developers are familiar with Firebase – a powerful platform and SDK backed by Google which provides database, analytics, messaging and everything else needed for app development in one seamless package.

Apart from everything else, Firebase provides convenient user profile and event logging functionality, which can be used to facilitate app analytics, user attribution and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds and provides free analytics.

Firebase also generates predictions and organizes user segmentation based on events stream (event occurrences): a functionality that is seamlessly integrated, as usual.

However, when we base predictions only on event occurrences, we can miss the important context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).

The differences between Event Analytics and Feature Analytics are schematically shown in Figure 1:

Figure 1: events sequence and aggregation

In-app events arrive in an ordered sequence and can be analyzed for churn or causality patterns (e.g., using funnels for Event Analytics)

Events of the same kind can also be analyzed as a collection with statistical methods (e.g., revealing the distribution of their values for Feature Analytics)

Let’s discuss the benefits feature engineering can bring to the table:

FEATURE ENGINEERING BENEFITS

  • Analysts are in control of features
  • Domain knowledge is used to engineer meaningful/better features
  • Facilitates better understanding of users
  • Features can be used for regular app analytics too, like segmentation
  • Facilitates simpler analytical queries compared to events
  • Considerably smaller size of data

However, there are some challenges to feature engineering, which are summarized below:

FEATURE ENGINEERING CHALLENGES

  • Designing good features needs experience & business knowledge
  • Requires massive retrieval, aggregation and analysis of event data
  • Events params and data can be different for various events
  • Events params and features are often evolving with time

Moreover, the nested storage format of event and user parameters in Firebase makes it especially difficult to construct efficient aggregation queries in any database and makes data extraction even more complicated. Let’s see if Google’s BigQuery can ease this work for us and speed-up feature engineering and analytics.

WHAT IS BIGQUERY

BigQuery is Google’s enterprise analytical data warehouse which can run blazing-fast SQL queries on gigabytes to petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, which can be processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, so we need to come up with a solution to de-normalize and unnest repeated Firebase data to have an efficient and effective solution.

UNDERSTANDING FIREBASE DATA

After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format, which provides user data, such as demographic and campaign, in a nested structure named user_dim and event data in a structure named event_dim. Such user_dim-event_dim tuples are displayed as repeated rows in small batches per user, which can be easily unrolled. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing such parameters is both a desired target for app analysts and a complicated task.

Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:

Figure 2: sample event data from Firebase

In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pairs rows, and additionally, all such rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate certain event parameters with UNNEST() function (see Figure 3).

Figure 3: sample aggregate query

EFFICIENT FEATURE ENGINEERING

First of all, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language into a separate lookup table, to focus on events and profiles aggregation.

Following this, we extract and store all events and associated profiles from batch records into the same denormalized row structure, which contains a JSON field for storing properties of each. Such row structure allows for:

  • Query event data for users and analyze features on-demand
  • Stream user events and construct features on a continuous basis

The corresponding transform query is based on BigQuery’s UDF functionality, which allows us to define the JavaScript function that transforms repeated key-value rows into JSON. This query is shown in Figure 4:

Figure 4: Denormalizing transform query, aggregating repeated rows

After this transformation, we can define aggregation queries that construct features for every user in the database and their relevant events – all in a single pass. Since a complete data set is stored in the source table, the massive parallelism of BigQuery can be very efficient in this case. For instance, we can execute complex queries, as shown in Figure 5, by creating complete user and event features, readying for Data Science or other kinds of advanced analytics:

Figure 5: Feature aggregation query example, constructing a feature table in one pass

Our next article will focus on how to extract maximum use from collected features, by applying machine learning to reveal their prediction power and importance, so stay tuned!

Related Posts

How does your company score on Google’s Digital Marketing Maturity Framework and why should your brand care?

In 2017, Boston Consulting Group and Google developed the Digital Marketing Maturity Framework. In a…

Circumventing Suspension in Google Ads

Account suspension issues can be very frustrating but we’re here to help. Most suspensions can…

Efficient Feature Engineering of Event Data in BigQuery

Mikalai Tsytsarau, PhD GCP Professional Data Engineer Most App developers are familiar with Firebase –…

Press Release: DELVE Achieves the Marketing Analytics Partner Specialization in the Google Cloud Partner Advantage Program

Google Cloud Recognizes DELVE’s Technical Proficiency and Proven Success In Marketing Analytics.   Boulder, Sept. 10, 2019…