Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Feat: implement support for bigframes#3620

Merged
tobymao merged 1 commit intoSQLMesh:mainfrom
z3z1ma:feat/bigframes
Jan 15, 2025
Merged

Feat: implement support for bigframes#3620
tobymao merged 1 commit intoSQLMesh:mainfrom
z3z1ma:feat/bigframes

Conversation

@z3z1ma
Copy link
Contributor

@z3z1ma z3z1ma commented Jan 11, 2025

I PR'ed the upstream lib so that they have no upper bound on their SQLGlot pin. This means the library is now fully compatible with SQLMesh. This PR adds support. Instead of bundling the extra with bigquery, I have temporarily added it to its own extra so it is opt-in. Once they cut a version release, we can add it with a proper >= version specifier.

This enables many novel uses such as using Gemini / LLM capabilities at scale more effectively, using ML models (training , predicting, etc), and so on leveraging BigQuery's compute through a dataframe interface wrapped intuitively by sqlmesh's python model system.

https://cloud.google.com/bigquery/docs/use-bigquery-dataframes#pandas-examples

https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb

I have tested this locally. I was able to crunch through 35G in 15 seconds with no local computation. Its very nice.

The following example requires no seeds or pre-loaded data as it leverages BQ's open datasets. You can run this with or without the remote function.

"""Working example to get your feet wet"""
import typing as t
from datetime import datetime

from bigframes.pandas import DataFrame

from sqlmesh import ExecutionContext, model


def get_bucket(num: int):
    if not num:
        return "NA"
    boundary = 10
    return "at_or_above_10" if num >= boundary else "below_10"


@model(
    "mart.wiki",
    columns={
        "title": "text",
        "views": "int",
        "bucket": "text",
    },
)
def execute(
    context: ExecutionContext,
    start: datetime,
    end: datetime,
    execution_time: datetime,
    **kwargs: t.Any,
) -> DataFrame:
    # Create a remote function to be used in the Bigframe DataFrame
    remote_get_bucket = context.bigframe.remote_function([int], str)(get_bucket)

    # Returns the Bigframe DataFrame handle, no data is computed locally
    df = context.bigframe.read_gbq("bigquery-samples.wikipedia_pageviews.200809h")

    df = (
        # This runs entirely on the BigQuery engine lazily
        df[df.title.str.contains(r"[Gg]oogle")]
        .groupby(["title"], as_index=False)["views"]
        .sum(numeric_only=True)
        .sort_values("views", ascending=False)
    )

    return df.assign(bucket=df["views"].apply(remote_get_bucket))

Support was simple given the way the query_factory works and similar art with Snowpark. Very straightforward. Good work on that.

You can even do interesting things like include a read from a gcs bucket in BQ engine:

filepath_or_buffer = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
df_from_gcs = bpd.read_csv(filepath_or_buffer)

@z3z1ma z3z1ma force-pushed the feat/bigframes branch 2 times, most recently from 8f25d05 to 10d89b8 Compare January 11, 2025 08:31
Copy link
Contributor

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for driving this! LGTM, just one question to make sure the imports are safe. Will let others chime in here too.

fix: fix a few issues

chore: use optional import pattern

WIP
@z3z1ma
Copy link
Contributor Author

z3z1ma commented Jan 15, 2025

FYI @tobymao this PR is good to merge whenever you all want to, the test failure is completely unrelated I think.

@georgesittas
Copy link
Contributor

FYI @tobymao this PR is good to merge whenever you all want to, the test failure is completely unrelated I think.

Yeah, this test is flaky for some reason, no need to worry about it.

@tobymao tobymao merged commit 579298d into SQLMesh:main Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants