feat: add allow_large_results to peek by Genesis929 · Pull Request #1448 · googleapis/python-bigquery-dataframes

Genesis929 · 2025-03-04T00:10:54Z

Added allow_large_results to peek
Switched allow_large_results to query_and_wait.
Added warning messages to metrics and allow_large_results option to warn user that we won't have metrics and won't sampling.
skip sampling when allow_large_results=False as we can't get the table size.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast · 2025-03-04T20:22:16Z

bigframes/_config/bigquery_options.py

+    "To prevent downloading excessive data, it is recommended to limit the data "
+    "fetched with methods like .head() or .sample() before proceeding with downloads."


Might be better to point folks at peek().

Suggested change

"To prevent downloading excessive data, it is recommended to limit the data "

"fetched with methods like .head() or .sample() before proceeding with downloads."

"To prevent downloading excessive data, it is recommended to use the peek() method, "

"or limit the data with methods like .head() or .sample() before proceeding with downloads."

tswast · 2025-03-04T20:24:31Z

bigframes/_config/bigquery_options.py

+        if not value:
+            warnings.warn(LARGE_RESULTS_WARNING_MESSAGE, UserWarning)


I don't think the setter makes the most sense for this warning. We want to encourage people to use this, not discourage them. I'd rather see it when they call a function that initiates query_and_wait and sampling has been requested.

tswast · 2025-03-04T20:25:22Z

bigframes/_config/bigquery_options.py

 UNKNOWN_LOCATION_MESSAGE = "The location '{location}' is set to an unknown value. Did you mean '{possibility}'?"

+LARGE_RESULTS_WARNING_MESSAGE = (
+    "Sampling is disabled because 'allow_large_results' is set to False. "


Isn't the "head" style sampling still supported, just not sampling by number of bytes?

For now the "head" sampling is size_based like others, user cannot decide how many rows they want to download.

tswast · 2025-03-04T20:26:32Z

bigframes/core/blocks.py

+            # Since we cannot acquire the table size without a query_job,
+            # we skip the sampling.
+            fraction = 2


Here seems like a good place for the warning.

Updated, just feel a little weird that the warning message shown when people already started downloading without sampling.

tswast · 2025-03-04T20:29:45Z

bigframes/session/executor.py

+            assert query_job is not None and query_job.destination is not None
            table = self.bqclient.get_table(query_job.destination)


I'm curious why we don't use the destination argument for this instead of fetching from the job.

Yes, seems using destination is more straight forward, update, thanks.

tswast · 2025-03-04T20:33:14Z

bigframes/session/executor.py

+        job_config = bigquery.QueryJobConfig()
+        # Use explicit destination to avoid 10GB limit of temporary table
+        if use_explicit_destination:
+            destination_table = self.storage_manager.create_temp_table(


I don't think it needs to be a full table, right? We should be able to avoid the tables.create call. _random_table seems more appropriate (

python-bigquery-dataframes/bigframes/session/temp_storage.py

Line 70 in a745290

def _random_table(self, skip_cleanup: bool = False) -> bigquery.TableReference:

)

I assume the main reason is we need to set an expiration time for temp table.

Ah, good point.

tswast · 2025-03-04T20:36:12Z

tests/system/small/test_dataframe_io.py

-    # Direct call to_pandas uses global default setting (allow_large_results=True),
-    # table has 'bqdf' prefix.
-    scalars_df_index.to_pandas()
-    assert scalars_df_index._query_job.destination.table_id.startswith("bqdf")


Seems like it would still be useful to have such a test. Why are we removing this part?

It's removed because it's not the main thing we test, added it back.

tswast · 2025-03-04T20:37:18Z

tests/system/small/test_dataframe.py

+    # The metrics won't be updated when we call query_and_wait.
+    assert session._metrics.execution_count == execution_count


Seems like we could still track the number of queries we execute, right? Shouldn't need a job object for that.

Updated metric to count the execution if query_job is None.

Tests updated.

bigframes/session/__init__.py

tswast · 2025-03-05T17:13:31Z

bigframes/session/_io/bigquery/__init__.py

+                api_timeout=timeout,
+            )
+            if metrics is not None:
+                metrics.count_job_stats()


https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query response includes totalBytesProcessed. Let's create a follow-up issue to include those metrics, too.

Filed internal issue 400961399

feat: add allow_large_results to peek

acc951f

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Mar 4, 2025

Genesis929 added 5 commits March 4, 2025 01:26

mypy fix

e882453

update sampling logic

1f4bb3b

update sampling logic

ae775c6

Merge branch 'main' into large_results_update_huanc

2efac60

update default value

8608e4f

Genesis929 added the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 4, 2025

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Mar 4, 2025

update annotation

94b8abe

product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Mar 4, 2025

type fixes

2308f92

Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025

Merge branch 'main' into large_results_update_huanc

3e5efbb

Genesis929 marked this pull request as ready for review March 4, 2025 08:23

Genesis929 requested a review from a team as a code owner March 4, 2025 08:23

Genesis929 requested review from a team and GarrettWu March 4, 2025 08:23

blunderbuss-gcf bot assigned sycai Mar 4, 2025

bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025

Genesis929 requested a review from tswast March 4, 2025 08:23

Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025

bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025

update assert

4bf907e

Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025

bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 4, 2025

tswast reviewed Mar 4, 2025

View reviewed changes

update metrics, tests and warnings

98f1326

Genesis929 requested a review from tswast March 5, 2025 01:04

tswast approved these changes Mar 5, 2025

View reviewed changes

tswast added 2 commits March 5, 2025 11:17

Apply suggestions from code review

2bf3555

Merge branch 'main' into large_results_update_huanc

69c686c

tswast enabled auto-merge (squash) March 5, 2025 17:19

This was referenced Mar 5, 2025

chore(main): release 1.39.0 #1421

Merged

feat: Support dry_run in to_pandas() #1436

Merged

tswast merged commit 67487b9 into main Mar 5, 2025
23 checks passed

tswast deleted the large_results_update_huanc branch March 5, 2025 18:35

		"To prevent downloading excessive data, it is recommended to limit the data "
		"fetched with methods like .head() or .sample() before proceeding with downloads."

		if not value:
		warnings.warn(LARGE_RESULTS_WARNING_MESSAGE, UserWarning)

		assert query_job is not None and query_job.destination is not None
		table = self.bqclient.get_table(query_job.destination)

		# The metrics won't be updated when we call query_and_wait.
		assert session._metrics.execution_count == execution_count

Conversation

Genesis929 commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Genesis929 Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Genesis929 commented Mar 4, 2025 •

edited

Loading

Genesis929 Mar 4, 2025 •

edited

Loading