Predictive Analytics
Leverage GA4's built-in predictions and BigQuery ML for advanced forecasting and customer intelligence.
GA4 Predictive Metrics
Available Predictions
GA4 automatically generates predictions when you have sufficient data:
| Metric | Description | Requirement | |--------|-------------|-------------| | Purchase probability | Likelihood of purchase in 7 days | 1,000+ purchasers/non-purchasers in 28 days | | Churn probability | Likelihood of no activity in 7 days | 1,000+ active/churned users in 28 days | | Predicted revenue | Expected revenue in 28 days | 1,000+ users with revenue in 28 days |
Accessing Predictions
Predictions appear as dimensions/metrics in:
- Exploration reports
- Audience builder
- BigQuery exports
Predictive Audiences
Create audiences based on predictions:
- Admin → Audiences → New audience
- Choose predictive condition:
| Audience | Condition | |----------|-----------| | Likely purchasers | Purchase probability > 90th percentile | | Likely churners | Churn probability > 80th percentile | | High value | Predicted revenue > $100 |
Use Cases
| Audience | Action | |----------|--------| | Likely to purchase | Show limited-time offers | | Likely to churn | Send retention campaign | | High predicted value | Premium experience | | Low purchase probability | Awareness content |
BigQuery ML
What is BigQuery ML?
Machine learning directly in BigQuery using SQL:
- No data export required
- Built-in model types
- Automatic feature engineering
- Scalable predictions
Supported Model Types
| Model Type | Use Case | |------------|----------| | LINEAR_REG | Predict continuous values (LTV) | | LOGISTIC_REG | Binary classification (will churn?) | | KMEANS | Customer segmentation | | BOOSTED_TREE_CLASSIFIER | Complex classification | | ARIMA_PLUS | Time series forecasting | | MATRIX_FACTORIZATION | Recommendations |
Churn Prediction Model
Step 1: Prepare Training Data
-- Create training dataset
CREATE OR REPLACE TABLE `project.dataset.churn_training` AS
WITH user_features AS (
SELECT
user_pseudo_id,
-- Activity features
COUNT(*) as total_events,
COUNT(DISTINCT event_date) as active_days,
COUNTIF(event_name = 'page_view') as pageviews,
COUNTIF(event_name = 'purchase') as purchases,
-- Engagement features
AVG((SELECT value.int_value FROM UNNEST(event_params)
WHERE key = 'engagement_time_msec')) / 1000 as avg_engagement_sec,
-- Recency
DATE_DIFF(CURRENT_DATE(),
MAX(PARSE_DATE('%Y%m%d', event_date)), DAY) as days_since_last_visit,
-- Session features
COUNT(DISTINCT (SELECT value.int_value FROM UNNEST(event_params)
WHERE key = 'ga_session_id')) as sessions
FROM `project.analytics_XXXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
GROUP BY user_pseudo_id
),
churned_users AS (
SELECT DISTINCT user_pseudo_id
FROM `project.analytics_XXXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
)
SELECT
f.*,
CASE WHEN c.user_pseudo_id IS NOT NULL THEN 0 ELSE 1 END as churned
FROM user_features f
LEFT JOIN churned_users c ON f.user_pseudo_id = c.user_pseudo_id
WHERE f.total_events >= 5 -- Minimum activity threshold
Step 2: Train the Model
-- Train logistic regression model
CREATE OR REPLACE MODEL `project.dataset.churn_model`
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churned'],
auto_class_weights = TRUE,
data_split_method = 'AUTO_SPLIT'
) AS
SELECT
total_events,
active_days,
pageviews,
purchases,
avg_engagement_sec,
days_since_last_visit,
sessions,
churned
FROM `project.dataset.churn_training`
Step 3: Evaluate Model
-- Check model performance
SELECT *
FROM ML.EVALUATE(MODEL `project.dataset.churn_model`)
Key metrics:
- precision: % of predicted churners who actually churned
- recall: % of actual churners correctly identified
- accuracy: Overall correctness
- f1_score: Balance of precision and recall
Step 4: Make Predictions
-- Predict churn for current users
SELECT
user_pseudo_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob as churn_probability
FROM ML.PREDICT(MODEL `project.dataset.churn_model`, (
SELECT
user_pseudo_id,
COUNT(*) as total_events,
COUNT(DISTINCT event_date) as active_days,
-- ... same features as training
FROM `project.analytics_XXXXXX.events_*`
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY))
GROUP BY user_pseudo_id
))
WHERE predicted_churned_probs[OFFSET(1)].prob > 0.7
ORDER BY churn_probability DESC
Customer LTV Prediction
Training Data
-- LTV training dataset
CREATE OR REPLACE TABLE `project.dataset.ltv_training` AS
SELECT
user_id,
-- Historical features (first 90 days)
SUM(CASE WHEN day_number <= 90 THEN revenue ELSE 0 END) as revenue_90d,
COUNTIF(day_number <= 90 AND event_name = 'purchase') as purchases_90d,
COUNT(DISTINCT CASE WHEN day_number <= 90 THEN event_date END) as active_days_90d,
-- Target: Total lifetime revenue
SUM(revenue) as total_ltv
FROM (
SELECT
user_id,
event_date,
event_name,
(SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value') as revenue,
DATE_DIFF(PARSE_DATE('%Y%m%d', event_date),
MIN(PARSE_DATE('%Y%m%d', event_date)) OVER (PARTITION BY user_id), DAY) as day_number
FROM `project.analytics_XXXXXX.events_*`
WHERE user_id IS NOT NULL
)
GROUP BY user_id
HAVING purchases_90d >= 1
Train LTV Model
-- Linear regression for LTV
CREATE OR REPLACE MODEL `project.dataset.ltv_model`
OPTIONS (
model_type = 'LINEAR_REG',
input_label_cols = ['total_ltv'],
data_split_method = 'AUTO_SPLIT'
) AS
SELECT
revenue_90d,
purchases_90d,
active_days_90d,
total_ltv
FROM `project.dataset.ltv_training`
Predict LTV
-- Predict LTV for recent customers
SELECT
user_id,
predicted_total_ltv,
NTILE(10) OVER (ORDER BY predicted_total_ltv DESC) as ltv_decile
FROM ML.PREDICT(MODEL `project.dataset.ltv_model`, (
SELECT
user_id,
SUM(revenue) as revenue_90d,
COUNT(DISTINCT CASE WHEN event_name = 'purchase' THEN event_date END) as purchases_90d,
COUNT(DISTINCT event_date) as active_days_90d
FROM `project.analytics_XXXXXX.events_*`
WHERE user_id IS NOT NULL
AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
GROUP BY user_id
))
Customer Segmentation
K-Means Clustering
-- Create customer segments
CREATE OR REPLACE MODEL `project.dataset.customer_segments`
OPTIONS (
model_type = 'KMEANS',
num_clusters = 5,
standardize_features = TRUE
) AS
SELECT
user_pseudo_id,
total_revenue,
purchase_count,
days_since_last_purchase,
avg_order_value,
total_sessions
FROM `project.dataset.customer_features`
Analyze Segments
-- Get cluster characteristics
SELECT
CENTROID_ID as segment,
COUNT(*) as customers,
AVG(total_revenue) as avg_revenue,
AVG(purchase_count) as avg_purchases,
AVG(days_since_last_purchase) as avg_recency
FROM ML.PREDICT(MODEL `project.dataset.customer_segments`,
TABLE `project.dataset.customer_features`)
GROUP BY CENTROID_ID
ORDER BY avg_revenue DESC
Segment Naming
| Cluster | Characteristics | Name | |---------|-----------------|------| | 1 | High revenue, frequent, recent | VIP/Champions | | 2 | High revenue, infrequent | High-Value Sleepers | | 3 | Medium revenue, recent | Promising | | 4 | Low revenue, frequent | Budget Loyalists | | 5 | Low revenue, churning | At-Risk |
Revenue Forecasting
Time Series with ARIMA_PLUS
-- Daily revenue forecast
CREATE OR REPLACE MODEL `project.dataset.revenue_forecast`
OPTIONS (
model_type = 'ARIMA_PLUS',
time_series_timestamp_col = 'date',
time_series_data_col = 'revenue',
auto_arima = TRUE,
data_frequency = 'DAILY'
) AS
SELECT
PARSE_DATE('%Y%m%d', event_date) as date,
SUM((SELECT value.double_value FROM UNNEST(event_params) WHERE key = 'value')) as revenue
FROM `project.analytics_XXXXXX.events_*`
WHERE event_name = 'purchase'
AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY))
GROUP BY date
Generate Forecast
-- Forecast next 30 days
SELECT
forecast_timestamp as date,
forecast_value as predicted_revenue,
prediction_interval_lower_bound,
prediction_interval_upper_bound
FROM ML.FORECAST(MODEL `project.dataset.revenue_forecast`,
STRUCT(30 AS horizon, 0.95 AS confidence_level))
Activation & Integration
Export to Google Ads
Use predicted audiences for targeting:
- Create audience in GA4 based on predictions
- Link GA4 to Google Ads
- Use audience for bidding/targeting
Export to CRM
-- Export high-value predictions
EXPORT DATA OPTIONS (
uri = 'gs://bucket/high_value_users/*.csv',
format = 'CSV',
overwrite = true
) AS
SELECT
user_id,
predicted_total_ltv,
churn_probability
FROM `project.dataset.predictions`
WHERE predicted_total_ltv > 500
AND churn_probability < 0.3
Reverse ETL
Use tools like Census, Hightouch, or custom scripts to sync predictions back to:
- CRM (Salesforce, HubSpot)
- Email platforms (Klaviyo, Mailchimp)
- Ad platforms (custom audiences)
Previous: Looker Studio Next: Data Warehouse Integration