Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
82 views

M5 - Custom Model Building With SQL in BigQuery ML Slides

This document discusses using BigQuery ML to build custom machine learning models with SQL. It provides an overview of supported model types in BigQuery ML including linear regression, logistic regression, DNN models and XGBoost models. It also demonstrates how to create and train models, evaluate model performance, and use trained models to make predictions. Steps shown include extracting training data from SQL queries, creating and training models, evaluating models on a test set, and using models to batch predict on new data.

Uploaded by

Deepak Kapoor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

M5 - Custom Model Building With SQL in BigQuery ML Slides

This document discusses using BigQuery ML to build custom machine learning models with SQL. It provides an overview of supported model types in BigQuery ML including linear regression, logistic regression, DNN models and XGBoost models. It also demonstrates how to create and train models, evaluate model performance, and use trained models to make predictions. Steps shown include extracting training data from SQL queries, creating and training models, evaluating models on a test set, and using models to batch predict on new data.

Uploaded by

Deepak Kapoor
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Custom Model

building with SQL


in BigQuery ML
Agenda
BigQuery ML for Quick Model
Building

Supported Models
BigQuery ML is a way to build custom models
Build a Custom Build Custom Call a Pretrained Model
Model Model (codeless)

AutoML Cloud Cloud Cloud Cloud Video


Cloud TPUs Compute Engine Translation API Vision API Speech API Intelligence API

Data Loss Cloud Speech Cloud Natural


Prevention API Synthesis API Language API
Cloud Dataproc Kubernetes Engine

Cloud AI Platform BigQuery ML


Dialogflow
Working with BigQuery ML
FROM
ML.EVALUATE(MODEL
`bqml_tutorial.sample_model`,
TABLE eval_table)

1 Dataset 2 Create/train 3 Evaluate 4 Predict/classify

CREATE MODEL `bqml_tutorial.sample_model` FROM


OPTIONS(model_type='logistic_reg') AS ML.PREDICT(MODEL
SELECT `bqml_tutorial.sample_model`,
table game_to_predict) )
AS predict
Where was this article published?

1 Techcrunch

2 GitHub

3 NY Times
SQL query to extract data
*no clusters, no
SELECT
url, title indexes, ad hoc query!
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 10
AND LENGTH(url) > 0
LIMIT 10
Use regex to get source + train on words of title

https://console.cloud.google.com/bigquery?sq=711916710713:47df84978c64458ea04b3cb4ae5de878
Create model CREATE OR REPLACE MODEL advdata.txtclass
OPTIONS(model_type='logistic_reg',
input_label_cols=['source'])
AS
Query to extract
training data
WITH extracted AS (
...
)
, ds AS (
SELECT ARRAY_CONCAT(SPLIT(title, " "), ['NULL', 'NULL',
'NULL', 'NULL', 'NULL']) AS words, source FROM extracted
WHERE (source = 'github' OR source = 'nytimes' OR source
= 'techcrunch')
)

SELECT
source,
words[OFFSET(0)] AS word1,
words[OFFSET(1)] AS word2,
words[OFFSET(2)] AS word3,
words[OFFSET(3)] AS word4,
words[OFFSET(4)] AS word5
FROM ds
Evaluate model
SELECT * FROM ML.EVALUATE(MODEL advdata.txtclass)

precision recall accuracy f1_score log_loss roc_auc


0.783 0.783 0.79 0.783 0.858 0.918

(BQML splits the training data and reports evaluation statistics on the
held-out set)
Predict using trained model
SELECT * FROM ML.PREDICT(MODEL advdata.txtclass,(
SELECT 'government' AS word1, 'shutdown' AS word2, 'leaves'
AS word3, 'workers' AS word4, 'reeling' AS word5
UNION ALL SELECT 'unlikely', 'partnership', 'in', 'house',
'gives'
UNION ALL SELECT 'fitbit', 's', 'fitness', 'tracker', 'is'
UNION ALL SELECT 'downloading', 'the', 'android', 'studio',
'project'
))

Row predicted_source word1 word2 word3 word4 word5

1 nytimes government shutdown leaves workers reeling

2 nytimes unlikely partnership in house gives


“Batch prediction”
3 techcrunch fitbit s fitness tracker is

4 techcrunch downloading the android studio project

https://console.cloud.google.com/bigquery?sq=663413318684:4d854a43
ae93416eaeb349e1fc4888cb
Demo: Train a model with BigQuery ML to
predict NYC taxi fares
Agenda
BigQuery ML for Quick Model
Building

Supported Models
Linear Classifier (Logistic regression)
DNN Classifier (alpha)
xgboost Classifier (alpha)
Linear Regression
DNN Regression (alpha)
xgboost Regression (alpha)
Train on TF, predict with BigQuery
CREATE OR REPLACE MODEL advdata.txtclass_tf2
OPTIONS (model_type='tensorflow',

model_path='gs://cloud-training-demos-ml/txtcls/trained_finetune_native
/export/exporter/1549825580/*')

SELECT
input,
(SELECT AS STRUCT(p, ['github', 'nytimes', 'techcrunch'][ORDINAL(s)])
prediction FROM
(SELECT p, ROW_NUMBER() OVER() AS s FROM
(SELECT * FROM UNNEST(dense_1) AS p))
ORDER BY p DESC LIMIT 1).*

FROM ML.PREDICT(MODEL advdata.txtclass_tf2,


(
SELECT 'Unlikely Partnership in House Gives Lawmakers Hope for Border
Deal' AS input
UNION ALL SELECT "Fitbit\'s newest fitness tracker is just for
employees and health insurance members"
UNION ALL SELECT "Show HN: Hello, a CLI tool for managing social media"
))
Recommendation engine (matrix factorization alpha)
create or replace model models.suggested_products_1or2_example
options(model_type='matrix_factorization',
user_col='user_id', item_col='product_id', rating_col='rating',
l2_reg=10)
AS

with purchases AS (
select product_id, user_id from
operations.orders_with_lines, unnest(order_lines)
),

total_purchases as (
select product_id, user_id, count(*) as numtimes
from purchases
group by product_id, user_id
)

select
product_id, user_id,
IF(numtimes < 2, 1, 2) AS rating
FROM total_purchases
So what do we recommend for a given set of users?
with users AS (
SELECT
user_id, count(*) as num_orders
from operations.orders_with_lines
group by user_id
order by num_orders desc
limit 10
),

products as (
select product_id, count(*) as num_orders
from operations.orders_with_lines, unnest(order_lines)
group by product_id
order by num_orders desc
limit 10
)

SELECT * FROM ML.PREDICT(MODEL models.suggested_products_1or2,


(SELECT user_id, product_id
FROM users, products)
)
So what do we recommend for a given set of users?
Row predicted_rating user_id product_id
1 1.5746015507788755 101797 26209

2 1.8070705987455633 101797 13176

3 1.7171094544245578 101797 27845

4 1.9763373899260837 101797 47209

5 1.8659380090171271 101797 21137

6 1.721610848530093 101797 47766

7 1.9516130703939483 101797 21903


Clustering
CREATE OR REPLACE MODEL
demos_eu.london_station_clusters
OPTIONS(model_type='kmeans', num_clusters=4,
standardize_features = true) AS
1. 4 clusters (hardcoded)
2. Standardize features since different
dynamic ranges
WITH hs AS …,
3. Remove the cluster “id” fields (keep
stationstats AS …
just the attributes)
SELECT * except(station_name, isweekday)
from stationstats
Which cluster?
WITH hs AS ...,
stationstats AS ...,

SELECT * except(nearest_centroids_distance)
FROM ML.PREDICT(MODEL
demos_eu.london_station_clusters,
(SELECT * FROM stationstats WHERE
REGEXP_CONTAINS(station_name, 'Kennington')))
Find cluster attributes
WITH T AS (
SELECT
centroid_id,
ARRAY_AGG(STRUCT(numerical_feature AS name, ROUND(feature_value,1)
AS value) ORDER BY centroid_id) AS cluster
FROM ML.CENTROIDS(MODEL demos_eu.london_station_clusters)
GROUP BY centroid_id
)
SELECT
CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,
(SELECT value from unnest(cluster) WHERE name = 'duration') AS
duration,
(SELECT value from unnest(cluster) WHERE name = 'num_trips') AS
num_trips,
(SELECT value from unnest(cluster) WHERE name = 'bikes_count') AS
bikes_count,
(SELECT value from unnest(cluster) WHERE name =
'distance_from_city_center') AS distance_from_city_center
FROM T
ORDER BY centroid_id ASC
Visualize attributes in Data Studio ...
Use the transform clause

Pre Feature Train


Inputs Trained Model
processing creation model

Same Deploy

Ideally, call with


input variables Prediction

Clients Model
serving
TRANSFORM ensures transformations are
automatically applied during ML.PREDICT
CREATE OR REPLACE MODEL ch09edu.bicycle_model CREATE OR REPLACE MODEL ch09edu.bicycle_model
OPTIONS(input_label_cols=['duration'], OPTIONS(input_label_cols=['duration'],
model_type='linear_reg') model_type='linear_reg')
AS TRANSFORM(
SELECT * EXCEPT(start_date)
SELECT , CAST(EXTRACT(dayofweek from start_date) AS STRING)
duration as dayofweek
, start_station_name , CAST(EXTRACT(hour from start_date) AS STRING)
, CAST(EXTRACT(dayofweek from start_date) AS STRING) as hourofday
as dayofweek )
, CAST(EXTRACT(hour from start_date) AS STRING) AS
as hourofday SELECT
FROM duration, start_station_name, start_date
`bigquery-public-data.london_bicycles.cycle_hire` FROM
`bigquery-public-data.london_bicycles.cycle_hire`
SELECT * FROM ML.PREDICT(MODEL ch09edu.bicycle_model,(
350 AS duration
SELECT * FROM ML.PREDICT(MODEL ch09edu.bicycle_model,(
, 'Kings Cross' AS start_station_name
350 AS duration
, '3' as dayofweek
, 'Kings Cross' AS start_station_name
, '18' as hourofday
, CURRENT_TIMESTAMP() as start_date
))
))
Reminder: BigQuery ML Cheatsheet
● Label = alias a column as ‘label’ or specify column in OPTIONS using input_label_cols

● Feature = passed through to the model as part of your SQL SELECT statement
SELECT * FROM ML.FEATURE_INFO(MODEL `mydataset.mymodel`)

● Model = an object created in BigQuery that resides in your BigQuery dataset

● Model Types = Linear Regression, Logistic Regression


CREATE OR REPLACE MODEL <dataset>.<name>
OPTIONS(model_type='<type>') AS
<training dataset>

● Training Progress = SELECT * FROM ML.TRAINING_INFO(MODEL `mydataset.mymodel`)

● Inspect Weights = SELECT * FROM ML.WEIGHTS(MODEL `mydataset.mymodel`, (<query>))

● Evaluation = SELECT * FROM ML.EVALUATE(MODEL `mydataset.mymodel`)

● Prediction = SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, (<query>))


Lab
Predict Bike Trip Duration with a
Regression Model in BQML
Objectives

● Query and explore the London bicycles dataset for feature engineering
● Create a linear regression model in BQML
● Evaluate the performance of your machine learning model
● Extract your model weights
Lab
Movie Recommendations in
BigQuery ML
Objectives

● Train a recommendation model in BigQuery


● Make product predictions for both single users and batch users
Module Summary
● You can train and evaluate machine learning models
directly in BigQuery

You might also like