Utility transforms to transform from one file format to another for a large number of files using Apache Beam running on Google Cloud Dataflow.
The transformations supported by this utility are:
- CSV to Avro
- Avro to CSV
Setup instructions assume you have an active Google Cloud Project and with an associated billing account. The following instructions will help you prepare your development environment.
-
Install Cloud SDK.
-
Setup Cloud SDK
gcloud init -
Select your Google Cloud Project if not already selected
gcloud config set project [PROJECT_ID] -
Clone repository
git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git -
Navigate to the sample code directory
cd dataflow/transforms
The examples are configured for Cloud Dataflow which run on Google Compute Engine.
The Compute Engine default service account requires the permissions
storage.objects.create, storage.objects.get, and storage.objects.create to read and write
objects in your Google Cloud Storage bucket IAM policy.
Learn more about Cloud Storage IAM Roles and Bucket-level IAM.
The following steps are optional if:
- If the project you use to run these Dataflow transformations also own the buckets used to read/write objects.
- If the bucket you're reading data from is public, e.g., allUsers are granted
roles/storage.objectViewerviewer.
-
Get the Compute Engine default service account using the following gcloud command:
gcloud compute project-info describeThe default service can be found next to
defaultServiceAccount:in response after running the command. -
Grant the
roles/storage.objectViewerrole to the bucket to get and list objects from a Dataflow job:gsutil iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectViewer gs://[BUCKET_NAME]- Replace
[COMPUTE_DEFAULT_SERVICE_ACCOUNT]with the Compute Engine default service account. - Replace
[BUCKET_NAME]with the bucket you use to read your input data.
- Replace
-
Grant the
roles/storage.objectCreatorrole to the bucket to create objects on output from a Dataflow job:gsutil iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectCreator gs://[BUCKET_NAME]- Replace
[COMPUTE_DEFAULT_SERVICE_ACCOUNT]with the Compute Engine default service account. - Replace
[BUCKET_NAME]with the bucket you use to read your input data.
- Replace
-
If the bucket contains both input and output data, grant the
roles/storage.objectAdminrole to the default service account using the gsutil:gsutil iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectAdmin gs://[BUCKET_NAME]- Replace
[COMPUTE_DEFAULT_SERVICE_ACCOUNT]with the Compute Engine default service account. - Replace
[BUCKET_NAME]with the bucket you use to read and write your input and output data respectively.
- Replace
To transform Avro formatted files to Csv use the following command:
# Example
mvn compile exec:java -Dexec.mainClass=com.example.AvroToCsv \
-Dexec.args="--avroSchema=gs://bucket/schema.avsc --inputFile=gs://bucket/*.avro --output=gs://bucket/output --runner=Dataflow"Full description of options can be found by using the following command:
mvn compile exec:java -Dexec.mainClass=com.example.AvroToCsv -Dexec.args="--help=com.example.SampleOptions"To transform CSV formatted files without a header to Avro use the following command:
# Example
mvn compile exec:java -Dexec.mainClass=com.example.CsvToAvro \
-Dexec.args="--avroSchema=gs://bucket/schema.avsc --inputFile=gs://bucket/*.csv --output=gs://bucket/output --runner=Dataflow"Full description of options can be found by using the following command:
mvn compile exec:java -Dexec.mainClass=com.example.CsvToAvro -Dexec.args="--help=com.example.SampleOptions"Existing example does not support headers in a CSV files.
Tests can be run locally using the DirectRunner.
mvn verify