Conversation
| "To prevent downloading excessive data, it is recommended to limit the data " | ||
| "fetched with methods like .head() or .sample() before proceeding with downloads." |
There was a problem hiding this comment.
Might be better to point folks at peek().
| "To prevent downloading excessive data, it is recommended to limit the data " | |
| "fetched with methods like .head() or .sample() before proceeding with downloads." | |
| "To prevent downloading excessive data, it is recommended to use the peek() method, " | |
| "or limit the data with methods like .head() or .sample() before proceeding with downloads." |
| if not value: | ||
| warnings.warn(LARGE_RESULTS_WARNING_MESSAGE, UserWarning) |
There was a problem hiding this comment.
I don't think the setter makes the most sense for this warning. We want to encourage people to use this, not discourage them. I'd rather see it when they call a function that initiates query_and_wait and sampling has been requested.
| UNKNOWN_LOCATION_MESSAGE = "The location '{location}' is set to an unknown value. Did you mean '{possibility}'?" | ||
|
|
||
| LARGE_RESULTS_WARNING_MESSAGE = ( | ||
| "Sampling is disabled because 'allow_large_results' is set to False. " |
There was a problem hiding this comment.
Isn't the "head" style sampling still supported, just not sampling by number of bytes?
There was a problem hiding this comment.
For now the "head" sampling is size_based like others, user cannot decide how many rows they want to download.
| # Since we cannot acquire the table size without a query_job, | ||
| # we skip the sampling. | ||
| fraction = 2 |
There was a problem hiding this comment.
Here seems like a good place for the warning.
There was a problem hiding this comment.
Updated, just feel a little weird that the warning message shown when people already started downloading without sampling.
bigframes/session/executor.py
Outdated
| assert query_job is not None and query_job.destination is not None | ||
| table = self.bqclient.get_table(query_job.destination) |
There was a problem hiding this comment.
I'm curious why we don't use the destination argument for this instead of fetching from the job.
There was a problem hiding this comment.
Yes, seems using destination is more straight forward, update, thanks.
| job_config = bigquery.QueryJobConfig() | ||
| # Use explicit destination to avoid 10GB limit of temporary table | ||
| if use_explicit_destination: | ||
| destination_table = self.storage_manager.create_temp_table( |
There was a problem hiding this comment.
I don't think it needs to be a full table, right? We should be able to avoid the tables.create call. _random_table seems more appropriate (
There was a problem hiding this comment.
I assume the main reason is we need to set an expiration time for temp table.
| # Direct call to_pandas uses global default setting (allow_large_results=True), | ||
| # table has 'bqdf' prefix. | ||
| scalars_df_index.to_pandas() | ||
| assert scalars_df_index._query_job.destination.table_id.startswith("bqdf") |
There was a problem hiding this comment.
Seems like it would still be useful to have such a test. Why are we removing this part?
There was a problem hiding this comment.
It's removed because it's not the main thing we test, added it back.
tests/system/small/test_dataframe.py
Outdated
| # The metrics won't be updated when we call query_and_wait. | ||
| assert session._metrics.execution_count == execution_count |
There was a problem hiding this comment.
Seems like we could still track the number of queries we execute, right? Shouldn't need a job object for that.
There was a problem hiding this comment.
Updated metric to count the execution if query_job is None.
Tests updated.
| api_timeout=timeout, | ||
| ) | ||
| if metrics is not None: | ||
| metrics.count_job_stats() |
There was a problem hiding this comment.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query response includes totalBytesProcessed. Let's create a follow-up issue to include those metrics, too.
There was a problem hiding this comment.
Filed internal issue 400961399
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕