Backporting FileBasedSink changes from Beam#482
Backporting FileBasedSink changes from Beam#482jkff wants to merge 1 commit intoGoogleCloudPlatform:masterfrom
Conversation
| LOG.debug("{} temporary files matched {}", matches.size(), pattern); | ||
| LOG.debug("Removing {} files.", matches.size()); | ||
| fileOperations.remove(matches); | ||
| protected final void removeTemporaryFiles(List<String> knownFiles, PipelineOptions options) |
There was a problem hiding this comment.
not backwards-compatible. Please be on-guard for this.
There was a problem hiding this comment.
The rest of the current change is technically also not backward-compatible (e.g. the set of protected member variables has changed) for someone who was overriding FileBasedSink/FileBasedWriteOperation/... in a highly non-trivial way (e.g. trying to construct filenames manually using the protected members).
Note that, of the file-based sinks I've seen, none would be broken by this change, because they don't peek into these protected variables.
However, as for those that potentially do, I don't think there's a way to make the current change backward-compatible for them. There's a choice:
- Make the current change, with a (I think) very small chance of breaking somebody, but in return, fix deletion of files and data loss/duplication issues.
- Abandon the current change and say that the problem is only fixed in the Beam SDK
- Do something much weaker but backward-compatible instead: e.g., simply add code for removing all known files, but do not place temp files in a subdirectory. As an unfortunate side effect, this will cause a divergence in code and behavior between Dataflow and Beam SDKs.
Which would you prefer? Are there other options?
There was a problem hiding this comment.
The Beam and Dataflow SDKs have diverged on several points already and are expected to keep diverging. Maintaining consistency between the two is not a goal, providing the best experience with Beam SDK is, maintaining backwards compatibility in Dataflow SDK is.
|
OK, I'm closing this PR in favor of #484. |
Original PRs:
apache/beam#1050
apache/beam#1278
Changes simply copied over. No modifications.
R: @lukecwik