0% found this document useful (0 votes)

82 views

M5 - Custom Model Building With SQL in BigQuery ML Slides

This document discusses using BigQuery ML to build custom machine learning models with SQL. It provides an overview of supported model types in BigQuery ML including linear regression, logistic regression, DNN models and XGBoost models. It also demonstrates how to create and train models, evaluate model performance, and use trained models to make predictions. Steps shown include extracting training data from SQL queries, creating and training models, evaluating models on a test set, and using models to batch predict on new data.

Uploaded by

Deepak Kapoor

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

M5 - Custom Model Building With SQL in BigQuery ML Slides

Uploaded by

Deepak Kapoor

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Custom Model

building with SQL

in BigQuery ML
Agenda
BigQuery ML for Quick Model
Building

Supported Models
BigQuery ML is a way to build custom models
Build a Custom Build Custom Call a Pretrained Model
Model Model (codeless)

AutoML Cloud Cloud Cloud Cloud Video

Cloud TPUs Compute Engine Translation API Vision API Speech API Intelligence API

Data Loss Cloud Speech Cloud Natural

Prevention API Synthesis API Language API
Cloud Dataproc Kubernetes Engine

Cloud AI Platform BigQuery ML

Dialogﬂow
Working with BigQuery ML
FROM
ML.EVALUATE(MODEL
`bqml_tutorial.sample_model`,
TABLE eval_table)

1 Dataset 2 Create/train 3 Evaluate 4 Predict/classify

CREATE MODEL `bqml_tutorial.sample_model` FROM

OPTIONS(model_type='logistic_reg') AS ML.PREDICT(MODEL
SELECT `bqml_tutorial.sample_model`,
table game_to_predict) )
AS predict
Where was this article published?

1 Techcrunch

2 GitHub

3 NY Times
SQL query to extract data
*no clusters, no
SELECT
url, title indexes, ad hoc query!
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
LENGTH(title) > 10
AND LENGTH(url) > 0
LIMIT 10
Use regex to get source + train on words of title

https://console.cloud.google.com/bigquery?sq=711916710713:47df84978c64458ea04b3cb4ae5de878
Create model CREATE OR REPLACE MODEL advdata.txtclass
OPTIONS(model_type='logistic_reg',
input_label_cols=['source'])
AS
Query to extract
training data
WITH extracted AS (
...
)
, ds AS (
SELECT ARRAY_CONCAT(SPLIT(title, " "), ['NULL', 'NULL',
'NULL', 'NULL', 'NULL']) AS words, source FROM extracted
WHERE (source = 'github' OR source = 'nytimes' OR source
= 'techcrunch')
)

SELECT
source,
words[OFFSET(0)] AS word1,
words[OFFSET(1)] AS word2,
words[OFFSET(2)] AS word3,
words[OFFSET(3)] AS word4,
words[OFFSET(4)] AS word5
FROM ds
Evaluate model
SELECT * FROM ML.EVALUATE(MODEL advdata.txtclass)

precision recall accuracy f1_score log_loss roc_auc

0.783 0.783 0.79 0.783 0.858 0.918

(BQML splits the training data and reports evaluation statistics on the
held-out set)
Predict using trained model
SELECT * FROM ML.PREDICT(MODEL advdata.txtclass,(
SELECT 'government' AS word1, 'shutdown' AS word2, 'leaves'
AS word3, 'workers' AS word4, 'reeling' AS word5
UNION ALL SELECT 'unlikely', 'partnership', 'in', 'house',
'gives'
UNION ALL SELECT 'fitbit', 's', 'fitness', 'tracker', 'is'
UNION ALL SELECT 'downloading', 'the', 'android', 'studio',
'project'
))

Row predicted_source word1 word2 word3 word4 word5

1 nytimes government shutdown leaves workers reeling

2 nytimes unlikely partnership in house gives

“Batch prediction”
3 techcrunch ﬁtbit s ﬁtness tracker is

4 techcrunch downloading the android studio project

https://console.cloud.google.com/bigquery?sq=663413318684:4d854a43
ae93416eaeb349e1fc4888cb
Demo: Train a model with BigQuery ML to
predict NYC taxi fares
Agenda
BigQuery ML for Quick Model
Building

Supported Models
Linear Classifier (Logistic regression)
DNN Classifier (alpha)
xgboost Classifier (alpha)
Linear Regression
DNN Regression (alpha)
xgboost Regression (alpha)
Train on TF, predict with BigQuery
CREATE OR REPLACE MODEL advdata.txtclass_tf2
OPTIONS (model_type='tensorflow',

model_path='gs://cloud-training-demos-ml/txtcls/trained_finetune_native
/export/exporter/1549825580/*')

SELECT
input,
(SELECT AS STRUCT(p, ['github', 'nytimes', 'techcrunch'][ORDINAL(s)])
prediction FROM
(SELECT p, ROW_NUMBER() OVER() AS s FROM
(SELECT * FROM UNNEST(dense_1) AS p))
ORDER BY p DESC LIMIT 1).*

FROM ML.PREDICT(MODEL advdata.txtclass_tf2,

(
SELECT 'Unlikely Partnership in House Gives Lawmakers Hope for Border
Deal' AS input
UNION ALL SELECT "Fitbit\'s newest fitness tracker is just for
employees and health insurance members"
UNION ALL SELECT "Show HN: Hello, a CLI tool for managing social media"
))
Recommendation engine (matrix factorization alpha)
create or replace model models.suggested_products_1or2_example
options(model_type='matrix_factorization',
user_col='user_id', item_col='product_id', rating_col='rating',
l2_reg=10)
AS

with purchases AS (
select product_id, user_id from
operations.orders_with_lines, unnest(order_lines)
),

total_purchases as (
select product_id, user_id, count(*) as numtimes
from purchases
group by product_id, user_id
)

select
product_id, user_id,
IF(numtimes < 2, 1, 2) AS rating
FROM total_purchases
So what do we recommend for a given set of users?
with users AS (
SELECT
user_id, count(*) as num_orders
from operations.orders_with_lines
group by user_id
order by num_orders desc
limit 10
),

products as (
select product_id, count(*) as num_orders
from operations.orders_with_lines, unnest(order_lines)
group by product_id
order by num_orders desc
limit 10
)

SELECT * FROM ML.PREDICT(MODEL models.suggested_products_1or2,

(SELECT user_id, product_id
FROM users, products)
)
So what do we recommend for a given set of users?
Row predicted_rating user_id product_id
1 1.5746015507788755 101797 26209

2 1.8070705987455633 101797 13176

3 1.7171094544245578 101797 27845

4 1.9763373899260837 101797 47209

5 1.8659380090171271 101797 21137

6 1.721610848530093 101797 47766

7 1.9516130703939483 101797 21903

Clustering
CREATE OR REPLACE MODEL
demos_eu.london_station_clusters
OPTIONS(model_type='kmeans', num_clusters=4,
standardize_features = true) AS
1. 4 clusters (hardcoded)
2. Standardize features since different
dynamic ranges
WITH hs AS …,
3. Remove the cluster “id” fields (keep
stationstats AS …
just the attributes)
SELECT * except(station_name, isweekday)
from stationstats
Which cluster?
WITH hs AS ...,
stationstats AS ...,

SELECT * except(nearest_centroids_distance)
FROM ML.PREDICT(MODEL
demos_eu.london_station_clusters,
(SELECT * FROM stationstats WHERE
REGEXP_CONTAINS(station_name, 'Kennington')))
Find cluster attributes
WITH T AS (
SELECT
centroid_id,
ARRAY_AGG(STRUCT(numerical_feature AS name, ROUND(feature_value,1)
AS value) ORDER BY centroid_id) AS cluster
FROM ML.CENTROIDS(MODEL demos_eu.london_station_clusters)
GROUP BY centroid_id
)
SELECT
CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,
(SELECT value from unnest(cluster) WHERE name = 'duration') AS
duration,
(SELECT value from unnest(cluster) WHERE name = 'num_trips') AS
num_trips,
(SELECT value from unnest(cluster) WHERE name = 'bikes_count') AS
bikes_count,
(SELECT value from unnest(cluster) WHERE name =
'distance_from_city_center') AS distance_from_city_center
FROM T
ORDER BY centroid_id ASC
Visualize attributes in Data Studio ...
Use the transform clause

Pre Feature Train

Inputs Trained Model
processing creation model

Same Deploy

Ideally, call with

input variables Prediction

Clients Model
serving
TRANSFORM ensures transformations are
automatically applied during ML.PREDICT
CREATE OR REPLACE MODEL ch09edu.bicycle_model CREATE OR REPLACE MODEL ch09edu.bicycle_model
OPTIONS(input_label_cols=['duration'], OPTIONS(input_label_cols=['duration'],
model_type='linear_reg') model_type='linear_reg')
AS TRANSFORM(
SELECT * EXCEPT(start_date)
SELECT , CAST(EXTRACT(dayofweek from start_date) AS STRING)
duration as dayofweek
, start_station_name , CAST(EXTRACT(hour from start_date) AS STRING)
, CAST(EXTRACT(dayofweek from start_date) AS STRING) as hourofday
as dayofweek )
, CAST(EXTRACT(hour from start_date) AS STRING) AS
as hourofday SELECT
FROM duration, start_station_name, start_date
`bigquery-public-data.london_bicycles.cycle_hire` FROM
`bigquery-public-data.london_bicycles.cycle_hire`
SELECT * FROM ML.PREDICT(MODEL ch09edu.bicycle_model,(
350 AS duration
SELECT * FROM ML.PREDICT(MODEL ch09edu.bicycle_model,(
, 'Kings Cross' AS start_station_name
350 AS duration
, '3' as dayofweek
, 'Kings Cross' AS start_station_name
, '18' as hourofday
, CURRENT_TIMESTAMP() as start_date
))
))
Reminder: BigQuery ML Cheatsheet
● Label = alias a column as ‘label’ or specify column in OPTIONS using input_label_cols

● Feature = passed through to the model as part of your SQL SELECT statement
SELECT * FROM ML.FEATURE_INFO(MODEL `mydataset.mymodel`)

● Model = an object created in BigQuery that resides in your BigQuery dataset

● Model Types = Linear Regression, Logistic Regression

CREATE OR REPLACE MODEL <dataset>.<name>
OPTIONS(model_type='<type>') AS
<training dataset>

● Training Progress = SELECT * FROM ML.TRAINING_INFO(MODEL `mydataset.mymodel`)

● Inspect Weights = SELECT * FROM ML.WEIGHTS(MODEL `mydataset.mymodel`, (<query>))

● Evaluation = SELECT * FROM ML.EVALUATE(MODEL `mydataset.mymodel`)

● Prediction = SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, (<query>))

Lab
Predict Bike Trip Duration with a
Regression Model in BQML
Objectives

● Query and explore the London bicycles dataset for feature engineering
● Create a linear regression model in BQML
● Evaluate the performance of your machine learning model
● Extract your model weights
Lab
Movie Recommendations in
BigQuery ML
Objectives

● Train a recommendation model in BigQuery

● Make product predictions for both single users and batch users
Module Summary
● You can train and evaluate machine learning models
directly in BigQuery

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
Challenges and Risks
No ratings yet
Challenges and Risks
13 pages
Sample Outline Azure Machine Learning Engineering
No ratings yet
Sample Outline Azure Machine Learning Engineering
17 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
Lesson Plan in Rhyme Scheme
100% (1)
Lesson Plan in Rhyme Scheme
14 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Lab 12 Introduction To Rapidminer/Weka.: Objective
No ratings yet
Lab 12 Introduction To Rapidminer/Weka.: Objective
24 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Lab - Qlik Replicate With Google BigQuery
No ratings yet
Lab - Qlik Replicate With Google BigQuery
23 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Confusion Matrix
No ratings yet
Confusion Matrix
6 pages
MySQL 8 For Developers
No ratings yet
MySQL 8 For Developers
113 pages
Query 1:: Unique Liquor Stores in Iowa
No ratings yet
Query 1:: Unique Liquor Stores in Iowa
3 pages
Association Rule Mining Lesson PDF
No ratings yet
Association Rule Mining Lesson PDF
9 pages
Hive Workshop Practical
No ratings yet
Hive Workshop Practical
29 pages
Database Lab Manual
No ratings yet
Database Lab Manual
86 pages
Machine Learning
0% (1)
Machine Learning
3 pages
2021S - A Step by Step Guide To Regression Analysis
No ratings yet
2021S - A Step by Step Guide To Regression Analysis
10 pages
Introduction To Machine Learning (CS419M)
No ratings yet
Introduction To Machine Learning (CS419M)
25 pages
Plete Python Manual 4th HQ PDF-Edition 2019
No ratings yet
Plete Python Manual 4th HQ PDF-Edition 2019
163 pages
Databook PDF
No ratings yet
Databook PDF
64 pages
Cs2258 Dbms Lab Manual
100% (1)
Cs2258 Dbms Lab Manual
169 pages
Apache Calcite Tutorial
No ratings yet
Apache Calcite Tutorial
83 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page
Pandas
100% (1)
Pandas
1,131 pages
XV. Anomaly Detection
0% (1)
XV. Anomaly Detection
4 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Exercises - Mastering Postgresql - Mastering SQL Using Postgresql
No ratings yet
Exercises - Mastering Postgresql - Mastering SQL Using Postgresql
25 pages
Cloud Computing Big Data Technology
No ratings yet
Cloud Computing Big Data Technology
2 pages
Mahout Tutorial
100% (1)
Mahout Tutorial
38 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
LLM With Knowledge Graphs
No ratings yet
LLM With Knowledge Graphs
40 pages
Pandas Complete Notes
No ratings yet
Pandas Complete Notes
105 pages
Java Reflection Complete Self-Assessment Guide
From Everand
Java Reflection Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Distributed Database System
No ratings yet
Distributed Database System
6 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
112 pages
(Skiena, 2017) - Book - The Data Science Design Manual - 3
No ratings yet
(Skiena, 2017) - Book - The Data Science Design Manual - 3
1 page
Weka Tutorial
No ratings yet
Weka Tutorial
2 pages
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
100% (2)
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
35 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
POL BigDataStatisticsJune2014
No ratings yet
POL BigDataStatisticsJune2014
27 pages
Validate XML Against An XSD Using Notepad++
No ratings yet
Validate XML Against An XSD Using Notepad++
5 pages
WEKA Manual For Version 3-6-5
No ratings yet
WEKA Manual For Version 3-6-5
303 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
Full Data Modeling and Database Design 2nd Edition Narayan S. Umanath Ebook All Chapters
100% (14)
Full Data Modeling and Database Design 2nd Edition Narayan S. Umanath Ebook All Chapters
58 pages
2012 CIO Event Scotland
No ratings yet
2012 CIO Event Scotland
42 pages
Beginning With Shell Scripting: 1) Kernel 2) Shell 3) Process 4) Redirectors, Pipes, Filters Etc
No ratings yet
Beginning With Shell Scripting: 1) Kernel 2) Shell 3) Process 4) Redirectors, Pipes, Filters Etc
7 pages
Python For Non-Programmers Final
No ratings yet
Python For Non-Programmers Final
218 pages
Big Query
No ratings yet
Big Query
5 pages
w1 - Introduction To ML
No ratings yet
w1 - Introduction To ML
41 pages
Big Data Hadoop in Health Care
No ratings yet
Big Data Hadoop in Health Care
51 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
CSE-Machine Learning & Big Data - WSS Source Book
No ratings yet
CSE-Machine Learning & Big Data - WSS Source Book
181 pages
Flask Restplus
No ratings yet
Flask Restplus
86 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
M1 - Introduction To Analytics and AI Slides
No ratings yet
M1 - Introduction To Analytics and AI Slides
20 pages
M6 - Custom Model Building With Cloud AutoML Slides
No ratings yet
M6 - Custom Model Building With Cloud AutoML Slides
31 pages
M4 - Production ML Pipelines With Kubeflow Slides
No ratings yet
M4 - Production ML Pipelines With Kubeflow Slides
28 pages
M3 - Big Data Analytics With Cloud AI Platform Notebooks Slides
No ratings yet
M3 - Big Data Analytics With Cloud AI Platform Notebooks Slides
17 pages
Trọn bộ tài liệu IELTS 0-7.5+: Sublists of the Academic Word List
No ratings yet
Trọn bộ tài liệu IELTS 0-7.5+: Sublists of the Academic Word List
26 pages
Sample IMDO Questions: Molecular and Cell Biology
No ratings yet
Sample IMDO Questions: Molecular and Cell Biology
5 pages
FINAL THESIS Intern 1
100% (1)
FINAL THESIS Intern 1
86 pages
Four Feet Two Sandals 2
No ratings yet
Four Feet Two Sandals 2
3 pages
Path Goal Theory: Richille Ann B. Orquita Discussant
No ratings yet
Path Goal Theory: Richille Ann B. Orquita Discussant
18 pages
Greenberg CH08 ADA
No ratings yet
Greenberg CH08 ADA
26 pages
2023 Calendars - 03 - Undergraduate 2023
No ratings yet
2023 Calendars - 03 - Undergraduate 2023
1 page
Humor, Laughter, and Those Aha Moments
100% (1)
Humor, Laughter, and Those Aha Moments
8 pages
Shrijana Majhi Chhetri Sop
No ratings yet
Shrijana Majhi Chhetri Sop
1 page
Susmita Did's Lesson Plan New
No ratings yet
Susmita Did's Lesson Plan New
28 pages
Example: 1. Do You Often Wear Jeans? - Yes, I Do
No ratings yet
Example: 1. Do You Often Wear Jeans? - Yes, I Do
2 pages
SE Intro
No ratings yet
SE Intro
6 pages
Unit 2 Lesson 1.3 Pronunciation
No ratings yet
Unit 2 Lesson 1.3 Pronunciation
20 pages
Project Budgeting and Scheduling
No ratings yet
Project Budgeting and Scheduling
25 pages
Theories of Personality Matrix: Prepared by
No ratings yet
Theories of Personality Matrix: Prepared by
10 pages
Revised Master Rubric
No ratings yet
Revised Master Rubric
3 pages
Task 2
No ratings yet
Task 2
5 pages
Learn Italian - Upper Intermediate Italian 1
90% (10)
Learn Italian - Upper Intermediate Italian 1
125 pages
C1 Reading and UoE Answer Sheet (Fillable)
No ratings yet
C1 Reading and UoE Answer Sheet (Fillable)
2 pages
Resume New
No ratings yet
Resume New
1 page
Axiology
No ratings yet
Axiology
31 pages
Examining The Relationship Between Motivational Factors and Job Satisfaction An Empirical Study
100% (1)
Examining The Relationship Between Motivational Factors and Job Satisfaction An Empirical Study
13 pages
09 Ebcr
100% (1)
09 Ebcr
22 pages
Appecon Midterm
No ratings yet
Appecon Midterm
31 pages
DERA
No ratings yet
DERA
11 pages
What Is Dyscalculia
No ratings yet
What Is Dyscalculia
6 pages
Table Topics Master Script PDF
100% (1)
Table Topics Master Script PDF
2 pages
Form Pre Assessment
No ratings yet
Form Pre Assessment
2 pages
Mit School of Distance Education, Pune
No ratings yet
Mit School of Distance Education, Pune
3 pages