Google Analytics Premium clients have the option to export clickstream (hit-level) data into Google BigQuery through a native integration.
As I mentioned in the previous post clickstream data empowers analysts to answer much more complex (and valuable) business questions, namely by integration with other data sources (e.g. CRM).
Many technical teams are new to Google BigQuery and how to work with it. On top of that, the structure of how the Google Analytics (GA) Premium data is stored in BigQuery may not be obvious at first glance.
By following the instructions below, you’ll jump-start your knowledge and more quickly arrive at an ETL process that can bring this valuable clickstream data into your own data warehouse.
Why export Google Analytics BigQuery clickstream data?
In one word: data enrichment — but here are three specific points:
1) Business Logic Let’s say you have custom rules to determine a channel (combination of source, medium, campaign, etc) and you need to update the clickstream data with this new field. Google BigQuery doesn’t currently support UPDATE or DELETE operations. For some use cases, this can be problematic.
2) Analysis on Google Analytics BigQuery data can be a little cumbersome since data is scattered across multiple tables and is in a nested structure. While nesting is useful to express hierarchical data (reducing duplication) it can pose its own set of challenges to the average SQL user.
3) Data Integration You may have your own data warehouse already and you would like to analyze the clickstream data using your own infrastructure. Furthermore, you may want to join this data with your other data sources (CRM, Email, Social, etc) to increase context for improved business outcomes.
Below, I provide instructions on how to jump-start your journey towards exporting of clickstream data out of BigQuery and importing into your own data warehouse.
How to export Google Analytics clickstream data from BigQuery
Google Analytics Premium can provide daily exports of clickstream data into Google BigQuery and stores it using a nested JSON format (to avoid duplication and save space). Take a look at the nested schema
To export, the first step is to create a SQL query to select all hits for a given day without nesting.
One caveat though, Google Analytics clickstream data doesn’t have a column with a hit timestamp, you’ll need to create one by adding the hit time (hits.time) converted to seconds and adding the visitStartTime (visitStartTime+ hits.time/1000).
This new column is named hit_timestamp.
Below is a snippet of the SQL query. You can find the complete SQL query on my
GitHub.
Step-by-Step Instructions
Go to the Google Analytics BigQuery project and create a new dataset (in my example, I’ll call this ETL). This dataset will hold your clickstream data exports.
Grab the SQL query.
Edit the SQL query to tailor your Google Analytics implementation (customize the fields for dimensions, enhanced ecommerce, etc.) Don’t add more dimensions/metrics than the ones your current implementation is using.
Paste your query into the query window.
Select ‘Show Option’ and configure a destination table (in my example sessions_20150101). Select the dataset you created in ‘step 1′ and name the table appropriately. While you are testing you can add a ‘LIMIT’ to the SQL query to limit the number of returned rows.
Once the query is finished, you will have your flattened clickstream data in the new table under the new dataset. You can append other tables to this table for analysis or export it to a CSV for storage or analysis in another platform.
Exporting Google Analytics Clickstream Data to Your Own Data Warehouse
Create a bucket in Google Storage under the same project.
Export the table to Google Storage and download it for analysis or load into your data warehouse. The file pattern will divide the file into multiples and compress them using gzip. This is handy for analysis with Hive (Hadoop). The format itself is a CSV.
The exported data is now in Google Storage:
You can now create an ETL job to automate any of the processes described above to keep your data warehouse data up-to-date.