16 Oct Efficient Feature Engineering of Event Data in BigQuery
Mikalai Tsytsarau, PhD GCP Professional Data Engineer
Most App developers are familiar with Firebase – a powerful platform and SDK backed by Google which provides database, analytics, messaging and everything else needed for app development in one seamless package.
Apart from everything else, Firebase provides convenient user profile and event logging functionality, which can be used to facilitate app analytics, user attribution and churn analysis. A distinct feature of Firebase is its ability to log an unlimited number of events comprised of 500 user-defined kinds and provides free analytics.
Firebase also generates predictions and organizes user segmentation based on events stream (event occurrences): a functionality that is seamlessly integrated, as usual.
However, when we base predictions only on event occurrences, we can miss the important context of each event or user. Ideally, we need to analyze both event occurrences and aggregated (‘meta’) event data (i.e., engineer new features).
The differences between Event Analytics and Feature Analytics are schematically shown in Figure 1:
In-app events arrive in an ordered sequence and can be analyzed for churn or causality patterns (e.g., using funnels for Event Analytics)
Events of the same kind can also be analyzed as a collection with statistical methods (e.g., revealing the distribution of their values for Feature Analytics)
Let’s discuss the benefits feature engineering can bring to the table:
FEATURE ENGINEERING BENEFITS
- Analysts are in control of features
- Domain knowledge is used to engineer meaningful/better features
- Facilitates better understanding of users
- Features can be used for regular app analytics too, like segmentation
- Facilitates simpler analytical queries compared to events
- Considerably smaller size of data
However, there are some challenges to feature engineering, which are summarized below:
FEATURE ENGINEERING CHALLENGES
- Designing good features needs experience & business knowledge
- Requires massive retrieval, aggregation and analysis of event data
- Events params and data can be different for various events
- Events params and features are often evolving with time
Moreover, the nested storage format of event and user parameters in Firebase makes it especially difficult to construct efficient aggregation queries in any database and makes data extraction even more complicated. Let’s see if Google’s BigQuery can ease this work for us and speed-up feature engineering and analytics.
WHAT IS BIGQUERY
BigQuery is Google’s enterprise analytical data warehouse which can run blazing-fast SQL queries on gigabytes to petabytes of data. Firebase can export complete event data in its original format to BigQuery daily, which can be processed and analyzed on a massive scale. BigQuery can handle huge feature aggregation queries as long as they have efficient joins, so we need to come up with a solution to de-normalize and unnest repeated Firebase data to have an efficient and effective solution.
UNDERSTANDING FIREBASE DATA
After setting up a daily export to BigQuery, Firebase creates a new table in specified BigQuery dataset, or in separate ones, if you have Android and iOS versions of your app. As previously mentioned, Firebase exports all data to BigQuery in its native format, which provides user data, such as demographic and campaign, in a nested structure named user_dim and event data in a structure named event_dim. Such user_dim-event_dim tuples are displayed as repeated rows in small batches per user, which can be easily unrolled. However, Firebase also supports custom app events, which can be associated with up to 25 parameters as key-value pairs. Unpacking and analyzing such parameters is both a desired target for app analysts and a complicated task.
Let’s take a look at the example row of event data for one user, loaded in a single batch from Firebase, shown in Figure 2:
In the above figure, we see that each (sub)row of event data is associated with a set of repeated key-value pairs rows, and additionally, all such rows in a batch are associated with a contemporary user profile that has a set of its own parameter key-value records (not shown). This requires writing rather inefficient SQL constructs when you need to unroll and aggregate certain event parameters with UNNEST() function (see Figure 3).
EFFICIENT FEATURE ENGINEERING
First of all, we unpack all user and event properties from repeated rows to serialized JSON. Next, we detach repeated static user data, such as country and language into a separate lookup table, to focus on events and profiles aggregation.
Following this, we extract and store all events and associated profiles from batch records into the same denormalized row structure, which contains a JSON field for storing properties of each. Such row structure allows for:
- Query event data for users and analyze features on-demand
- Stream user events and construct features on a continuous basis
After this transformation, we can define aggregation queries that construct features for every user in the database and their relevant events – all in a single pass. Since a complete data set is stored in the source table, the massive parallelism of BigQuery can be very efficient in this case. For instance, we can execute complex queries, as shown in Figure 5, by creating complete user and event features, readying for Data Science or other kinds of advanced analytics:
Our next article will focus on how to extract maximum use from collected features, by applying machine learning to reveal their prediction power and importance, so stay tuned!