Satyanarayana Gupta Kunda
Satyanarayana Gupta Kunda
Satyanarayana Gupta Kunda
Data Engineer AWS | Azure | GCP | Hadoop | Big Data Analytics | Python | Java | ETL |
SQL | Snowflake
Professional Summary
Professional 9+years of experience as a Data Engineer and coding with analytical programming
using Python, SQL, PySpark, Scala, R. Proficient in end-to-end Big Data solutions design, development
and deployment leveraging Hadoop and Spark ecosystems, with extensive experience in ETL processes
for scalable data processing and analysis. Expertise in designing, deploying, and managing cloud
solutions using AWS, Azure and GCP platforms, with a strong focus on scalability, reliability, and cost
optimization
TECHNICAL SKILLS
• Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie,
Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.
• Hadoop Distribution: Cloudera distribution and Horton works
• Programming Languages: Scala, Core Java, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell
Scripting
• Script Languages: JavaScript, jQuery, Python.
• Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake,
NoSQL, HBase, MongoDB
• Cloud Platforms: AWS, Azure, GCP
• Distributed Messaging System: Apache Kafka
• Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL
• Batch Processing: Hive, MapReduce, Pig, Spark
• Operating System: Linux (Ubuntu, Red Hat), Microsoft Windows
• Reporting Tools/ETL Tools: Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI
• GCP Services: BigQuery, Cloud Storage, Dataflow, Dataproc, Pub/Sub, and others.
CERTIFICATIONS
• Certified on Python Programming by HackerRank.
• Certified on Problem Solving by HackerRank.
• Certified on the course “A.I for Everyone by Andrew Ng” from Coursera.
• Certified on the course “Creating Database Tables with SQL” from Coursera.
• Certified on AWS cloud practitioner.
• Microsoft Certified: Azure Administrator Associate
• Microsoft Certified: Azure Data Engineer Associate
• Currently completing the course “Machine Learning by Andrew Ng” from Coursera.
Environment: Hadoop, Spark, Hive, Teradata, Tableau, Linux, Python, Kafka, AWS S3 Buckets,
AWS Glue, Stream sets, Postgres, AWS EC2, Oracle PL/SQL, Development toolkit (JIRA,
Bitbucket/Git, Service now, etc.,)
Sr Data Engineer
Kaiser Permanente, Atlanta, GA
Feb 2020 – March 2022
Responsibilities:
• Craft highly scalable and resilient cloud architectures that address customer business problems
and accelerate the adoption of AWS services for clients.
• Developed upgrade and downgrade scripts in SQL that filter corrupted records with missing
values along with identifying unique records based on different criteria.
• Implemented Azure Storage - Storage accounts, blob storage, and Azure SQL Server. Explored on
the Azure storage accounts like Blob storage.
• Experience in building, deploying, troubleshooting data extraction for a huge number of records
using Azure Data Factory (ADF).
• Developed custom Spark transformations and user-defined functions to handle complex data
processing requirements.
• Worked with cross-functional teams to design and develop Spark-based solutions that meet
business requirements.
• Extensive experience in designing and implementing ETL solutions using tools such as
Informatica, Talend, and Apache NiFi.
• Knowledge of data warehousing concepts, such as data modeling, dimensional modeling, and
ETL architectures.
• Involved in the development of real time streaming applications using PySpark, Apache Flink
Kafka. Hive on distributed Hadoop Cluster.
• Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
• Migrate data from traditional database systems to Azure databases.
• Design and implement migration strategies for traditional systems on Azure (Lift and shift/Azure
Migrate, other third-party tools.
• Familiarity with Hadoop security mechanisms, such as Kerberos, Ranger, and Sentry.
• Experience in setting up and managing Hadoop clusters, including configuration, tuning, and
troubleshooting.
• Ability to handle large volumes of data and optimize ETL processes for better performance.
• Setup and maintain the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse,
Azure Data Factory, Azure SQL Data warehouse.
• Develop conceptual solutions & create proofs-of-concept to demonstrate viability of solutions.
• Implement Copy activity, Custom Azure Data Factory Pipeline Activities.
• Primarily involved in Data Migration using SQL, SQL Azure, Azure storage, and Azure Data
Factory, SSIS, PowerShell.
• Understanding of NoSQL databases, such as HBase and Cassandra, and their integration with
Hadoop.
• Knowledge of data warehousing concepts and their implementation in Hadoop.
• Developed scalable and efficient ETL workflows using Apache Spark to process large volumes of
data from multiple sources.
• Built Spark applications using Spark SQL and data frames for data manipulation, querying, and
analysis.
• Optimized Spark jobs to improve performance and reduce processing time.
• Develop dashboards and visualizations to help business users analyze data as well as providing
data insight to upper management with a focus on Microsoft products like SQL Server Reporting
Services (SSRS) and Power BI.
• Developed Python scripts to do file validations in Databricks and automated the process using
ADF.
• Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks.
• Used Logic App to take decisional actions based on the workflow.
• Developed custom alerts using Azure Data Factory, SQLDB and Logic App.
• Developed Databricks ETL pipelines using notebooks, Spark Dataframes, SPARK SQL and python
scripting.
• Developed complex SQL queries using stored procedures, common table expressions (CTEs), and
temporary table to support Power BI reports.
• Independently manage development of ETL processes - development to delivery.
Environment: Azure SQL, Azure Storage Explorer, Azure Storage, Azure Blob Storage, Azure Backup,
Azure Files, Azure Data Lake Storage, SQL Server Management Studio 2016, Visual Studio 2015, VSTS,
Azure Blob, Spark, Hadoop, Power BI, PowerShell, C# .Net, SSIS, DataGrid, ETL.
Responsibilities:
• Developed upgrade and downgrade scripts in SQL that filter corrupted records with missing
values along with identifying unique records based on different criteria.
• Worked on Ingesting data by going through cleansing and transformations and leveraging AWS
Lambda, AWS Glue and Step Functions
• Created yaml files for each data source and including glue table stack creation
• Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
• Developed Lambda functions and assigned IAM roles to run python scripts along with various
triggers (SQS, Event Bridge, SNS)
• Developed and executed a migration strategy to move Data Warehouse from an Oracle platform
to AWS Redshift.
• Migrate data from on-premises to AWS storage buckets.
• Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue
using Python and PySpark.
• Experience in integrating ETL processes with Big Data technologies, such as Hadoop and Spark.
• Familiarity with cloud-based ETL tools, such as AWS Glue and Azure Data Factory.
• Knowledge of data governance principles and their implementation in ETL processes.
• Familiarity with ETL tools, such as Talend and Informatica, and their integration with Hadoop.
• Proficiency in using Hadoop-based data integration tools, such as Flume and Sqoop.
• Experience with Hadoop deployment models, such as on-premises, cloud-based, and hybrid.
• Installed/Configured/Maintained Apache Hadoop clusters for application development based on
the requirements.
• Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in
java and Scala for data cleaning and preprocessing.
• Developed Spark scripts by using Scala Shell commands as per the requirement.
• Developed Spark code using python for Pyspark and Spark-SQL for faster testing and processing
of data.
• Performed Data Cleaning, features scaling, features engineering using pandas and numpy
packages in python.
• Created interactive and highly informative Power BI reports
• Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job
for daily imports.
• Developed a Spark job which indexes data into Elastic Search from external Hive tables which are
in HDFS.
• Developed Spark programs with Scala and applied principles of functional programming to do
batch processing.
• Involved in the development of real time streaming applications using PySpark, Apache Flink,
Kafka, Hive on distributed Hadoop Cluster.
• Involved in designing and deploying multi-tier applications using all the AWS services like (EC2,
Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and
auto- scaling in AWS Cloud Formation.
• Developed application to clean semi-structured data like JSON/XML into structured files before
ingesting them into HDFS.
• Built real time pipeline for streaming data using Kafka and Spark Streaming.
• Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive
tables.
• Responsible for importing data from Postgres to HDFS, HIVE using SQOOP tool.
• Create Snowflake data models and optimize data workflows to support data processing and
analysis for a financial services client.
• Implemented DevOps practices such as continuous integration and delivery using Jenkins and
Travis CI.
• Worked with business stakeholders to define KPIs and build dashboards using Tableau and Quick
Sight.
• Developed framework for automated data ingestion from different sources like relational
databases, delimited files, JSON files, XML files into HDFS and build Hive/Impala tables on top
of them.
• Developed real-time data ingestion application using Flume and Kafka.
• Developed ETL processes using Talend and Apache Nifi to transfer data from legacy databases to
Snowflake.
• Designed, developed, and deployed ETL pipelines using AWS services like, Lambda, Glue, EMR,
Step Function, CloudWatch events, SNS, Redshift, S3, IAM, etc.
• Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources
like S3, ORC/Parquet/Text Files into AWS Redshift.
• Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
• Optimize Snowflake databases for performance and scalability for an e-commerce client,
resulting in a 30% increase in data processing speed.
• Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera
Hadoop Cluster in Parquet format.
• Developed a tool to load S3 JSON file into Hive table in parquet format in Scala and Apache
Spark.
Environment: Cloudera, Hive, Impala, Spark, Apache Kafka, Flume, Scala, AWS, EC2, S3, Dynamo
DB, Auto Scaling, Lambda, Nifi, Power BI, Snowflake, Core Java, Shell-scripting, SQL, Sqoop,
Oozie, Java, PL/SQL, Oracle 12c, SQL Server, HBase,
Responsibilities:
• Running Spark SQL operations on JSON, converting the data into a tabular structure with data
frames, and storing and writing the data to Hive and HDFS.
• Developing shell scripts for data ingestion and validation with different parameters, as well as
writing custom shell scripts to invoke spark Employment.
• Tuned performance of Informatica mappings and sessions for improving the process and making
it efficient after eliminating bottlenecks.
• Familiarity with data science tools and frameworks, such as R, Python, and TensorFlow, and their
integration with Hadoop.
• Expertise in designing and implementing ETL workflows using workflow management tools, such
as Apache Airflow and Oozie.
• Familiarity with data visualization tools, such as Tableau and Power BI, and their integration with
ETL processes.
• Knowledge of machine learning algorithms and their implementation in Hadoop.
• Expertise in developing Hadoop-based real-time streaming applications, using tools such as
Kafka and Storm.
• Strong debugging and troubleshooting skills for Hadoop-based applications
• Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks
• Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
• Created a risk-based machine learning model (logistic regress, random forest, SVM, etc.) to
predict which customers are more likely to be delinquent based on historical performance data
and rank order them.
• Evaluated model output using the uncertainty matrix (Precision, Recall as well as Teradata
resources and utilities (BTEQ, Fast load, Multi Load, Fast Export, and TPUMP).
• Developed a monthly report using Python to code the payment results of customers and make
suggestions to the manager.
• Ingestion and processing of Comcast setup box click stream events in real time with Spark 2.x,
Spark Streaming, Databricks, Apache Storm, Kafka, Apache-Memory Igniter’s grid (Distributed
Cache)
• Used various DML and DDL commands for data retrieval and manipulation, such as Select, Insert,
Update, Sub Queries, Inner Joins, Outer Joins, Union, Advanced SQL, and so on.
• Using Informatica Power Center 9.6.1, I extracted, transformed, and loaded data into Netezza
Data Warehouse from various sources such as Oracle and flat files.
• Constructed the data pipelines for pulling the data from SQL Server, Hive. Landed the data in
AWS S3 and loaded into snowflake after transforming.
• Performed data analysis on large relational datasets using optimized diverse SQL queries.
• Developed queries to create, modify, and delete update Oracle database and to analyse the
data.
Environment: Scala 2.12.8, Python 3.7.2, PySpark, Spark 2.4, Spark ML Lib, Spark SQL, TensorFlow 1.9,
NumPy 1.15.2, Keras 2.2.4, Power BI, Spark SQL, Spark Streaming, HIVE, Kafka, ORC, Avro, Parquet,
HBase, HDFS Informatica, AWS
Software Developer
First America, Hyderabad, India
Project: Data Analytics
May 2013 – June 2015
Responsibilities:
• Develop, improve, and scale processes, structures, workflows, and best practices for data
management and analytics.
• Having experience in working with data ingestion, storage, processing and analyzing the big data
• Collaborate with Project Management to provide accurate forecasts, reports, and status.
• Work in a fast-paced agile development environment to analyze, create, and evaluate possible
business use cases.
• Experience in designing, developing, and implementing ETL processes for a variety of data
sources, including databases, flat files and web services.
• Strong proficiency in ETL tools such as Informatica, Talend, and SSIS.
• Familiarity with both batch processing and real-time streaming ETL methods.
• Hands-on experience with methods such as Pig and Hive for data collection, Sqoop for data
absorption, Oozie for scheduling, and Zookeeper for cluster resource coordination.
• Worked on the Apache Spark Scala code base, performing actions and transformations on RDDs,
Data Frames, and Datasets using SparkSQL and Spark Streaming Contexts.
• Transferred data from HDFS to Relational Database Systems using Sqoop and vice versa. Upkeep
and troubleshooting
• Worked on analyzing Hadoop clusters with various big data analytic tools such as Pig, HBase
database, and Sqoop.
• Worked on NoSQL enterprise development and data loading into HBase with Impala and Sqoop.
• Executed several MapReduce jobs in Pig and Hive for data cleaning and pre-processing.
• Build Hadoop solutions for big data problems by using MR1 and MR2 in YARN.
• Evaluated Hadoop and its ecosystem's suitability for the aforementioned project, and
implemented / validated with various proof of concept (POC) applications in order to ultimately
adopt them to benefit from the Big Data Hadoop initiative.
• Developed PySpark Applications by using python and Implemented Apache PySpark data
processing project to handle data from various RDBMS and Streaming sources.
• Handled importing of data from various data sources, performed data control checks using
PySpark and loaded data into HDFS.
Environment: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Core Java,
Spring, MVC, XML, Spark 1.9, PL/SQL, HDFS, JSON, Hibernate, Bootstrap, JQuery, JDBC, JSP, JavaScript,
AJAX, Oracle 10g/11g, MySQL, SQL server, Teradata, HBase, Cassandra.
EDUCATION
Master of Science in Information Systems Technologies from University of Memphis, Memphis, TN, USA
May 2016
B. Tech in Computer Science from Jawaharlal Nehru Technological University, Hyderabad, India May
2013