Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Data Format Transformations using Cloud Dataflow and Apache Beam

Utility transforms to transform from one file format to another for a large number of files using Apache Beam running on Google Cloud Dataflow.

The transformations supported by this utility are:

  • CSV to Avro
  • Avro to CSV

Setup

Setup instructions assume you have an active Google Cloud Project and with an associated billing account. The following instructions will help you prepare your development environment.

  1. Install Cloud SDK.

  2. Setup Cloud SDK

    gcloud init
    
  3. Select your Google Cloud Project if not already selected

    gcloud config set project [PROJECT_ID]
    
  4. Clone repository

    git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
    
  5. Navigate to the sample code directory

    cd dataflow/transforms
    

Grant required permissions

The examples are configured for Cloud Dataflow which run on Google Compute Engine. The Compute Engine default service account requires the permissions storage.objects.create, storage.objects.get, and storage.objects.create to read and write objects in your Google Cloud Storage bucket IAM policy.

Learn more about Cloud Storage IAM Roles and Bucket-level IAM.

The following steps are optional if:

  • If the project you use to run these Dataflow transformations also own the buckets used to read/write objects.
  • If the bucket you're reading data from is public, e.g., allUsers are granted roles/storage.objectViewer viewer.
  1. Get the Compute Engine default service account using the following gcloud command:

    gcloud compute project-info describe
    

    The default service can be found next to defaultServiceAccount: in response after running the command.

  2. Grant the roles/storage.objectViewer role to the bucket to get and list objects from a Dataflow job:

    gsutil iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectViewer gs://[BUCKET_NAME]
    
    • Replace [COMPUTE_DEFAULT_SERVICE_ACCOUNT] with the Compute Engine default service account.
    • Replace [BUCKET_NAME] with the bucket you use to read your input data.
  3. Grant the roles/storage.objectCreator role to the bucket to create objects on output from a Dataflow job:

    gsutil iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectCreator gs://[BUCKET_NAME]
    
    • Replace [COMPUTE_DEFAULT_SERVICE_ACCOUNT] with the Compute Engine default service account.
    • Replace [BUCKET_NAME] with the bucket you use to read your input data.
  4. If the bucket contains both input and output data, grant the roles/storage.objectAdmin role to the default service account using the gsutil:

    gsutil iam ch serviceAccount:[COMPUTE_DEFAULT_SERVICE_ACCOUNT]:objectAdmin gs://[BUCKET_NAME]
    
    • Replace [COMPUTE_DEFAULT_SERVICE_ACCOUNT] with the Compute Engine default service account.
    • Replace [BUCKET_NAME] with the bucket you use to read and write your input and output data respectively.

Using transformations

Avro to CSV transformation

To transform Avro formatted files to Csv use the following command:

# Example

mvn compile exec:java -Dexec.mainClass=com.example.AvroToCsv \
     -Dexec.args="--avroSchema=gs://bucket/schema.avsc --inputFile=gs://bucket/*.avro --output=gs://bucket/output --runner=Dataflow"

Full description of options can be found by using the following command:

mvn compile exec:java -Dexec.mainClass=com.example.AvroToCsv -Dexec.args="--help=com.example.SampleOptions"

CSV to Avro transformation

To transform CSV formatted files without a header to Avro use the following command:

# Example

mvn compile exec:java -Dexec.mainClass=com.example.CsvToAvro \
     -Dexec.args="--avroSchema=gs://bucket/schema.avsc --inputFile=gs://bucket/*.csv --output=gs://bucket/output --runner=Dataflow"

Full description of options can be found by using the following command:

mvn compile exec:java -Dexec.mainClass=com.example.CsvToAvro -Dexec.args="--help=com.example.SampleOptions"

Existing example does not support headers in a CSV files.

Run Tests

Tests can be run locally using the DirectRunner.

mvn verify