PR Ofessional Summary: Data Frames and RDD's
PR Ofessional Summary: Data Frames and RDD's
PR Ofessional Summary: Data Frames and RDD's
• Around 12 years of professional IT experience in Big data Environment, Hadoop Ecosystem and good experience in
Spark, SQL, Java Development.
• Hands on experience across Hadoop Eco System that includes extensive experience in Big Data technologies like
HDFS, MapReduce, YARN, Spark, Sqoop, Hive, Pig, Impala, Oozie, Oozie Coordinator, Zoo-Keeper and Apache
Cassandra, HBase.
• Experience in using various tools like Sqoop, Flume, Kafka, NiFi, Pig to ingest structured, semi-structured and
unstructured data into the cluster.
• Designing both time driven and data driven automated workflows using Oozie and used Zookeeper for cluster co-
ordination.
• Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell,
GSUTIL, BQ command line utilities, Data Proc, Stack driver
• Experience in Hadoop cluster using Cloudera's CDH, Horton works HDP.
• Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and
summarization activities according to the requirement
• Data pipeline consists Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze
operational data.
• Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
• Developed Spark jobs and Hive Jobs to summarize and transform data.
• Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
• Experience in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and
internal/external tables.
• Working knowledge in GCP, AWS environment and AWS spark with Strong experience in Cloud computing platforms
such as AWS services
• Expertise in writing Map-Reduce Jobs in Java, Python for processing large sets of structured, semi-structured and
unstructured data sets and stores them in HDFS.
• Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell,
GSUTIL, BQ command line utilities, Data Proc, Stack driver
• Experience working with Python, UNIX and shell scripting.
• Design and implement large scale distributed solutions in AWS and GCP clouds
• Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files and
Databases.
• Good knowledge of cloud integration with AWS using Elastic Map Reduce (EMR), Simple Storage Service (S3), EC2,
Redshift and Microsoft Azure.
• Experience with complete Software Development Life Cycle (SDLC) process which includes Requirement Gathering,
Analysis, Designing, Developing, Testing, Implementing and Documenting.
• Worked with waterfall and Agile methodologies.
• Good team player with excellent communication skills with strong attitude towards learning new technologies.
• Hands on Experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets APIs.
• Worked on Spark for enhancing the executions of current processing in Hadoop utilizing Spark Context, Spark SQL,
Data Frames and RDD’s.
• Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Python.
• Hands on experience Using Hive Tables by Spark, performing transformations and Creating DataFrames on Hive
tables using Spark.
• Used Spark-Structured-Streaming to perform necessary transformations.
• Expertise in converting MapReduce programs into Spark transformations using Spark RDD's
TECHNICAL SKILLSET
HDFS, MapReduce, Hive, beeline, Sqoop, Flume, Oozie, Impala, pig, Kafka,
HADOOP
Zookeeper, NiFi, Cloudera Manager, Horton Works
Spark Components Spark Core, Spark SQL (Data Frames and Dataset), Scala, Python.
WORK EXPERIENCE
Responsibilities:
Worked with product owners, Designers, QA and other engineers in Agile development environment to deliver timely
solutions to as per customer requirements.
Transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka
brokers.
Created Airflow Scheduling scripts in Python
Installed and configured apache airflow for workflow management and created workflows in python.
Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and
summarization activities according to the requirement
Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
Experience with Cloud Service Providers such as Amazon AWS, Microsoft Azure, and Google GCP
Built pipelines to move hashed and un-hashed data from XML files to Data lake.
Data pipeline consists Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze
operational data.
Developed Spark jobs and Hive Jobs to summarize and transform data.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.
Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-
party tools) worked on Azure suite: Azure SQL Database, Azure Data Lake(ADLS), Azure Data Factory(ADF) V2,
Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service(AAS), Azure Blob Storage,
Azure Search, Azure App Service,Azure data Platform Services.
Used Oozie for automating the end-to-end data pipelines and Oozie coordinators for scheduling the workflows.
Experience with Cloud Service Providers such as Amazon AWS, Microsoft Azure, and Google GCP
Demonstrable experience designing and implementing complex applications and distributed systems into public cloud
infrastructure (AWS, GCP, Azure, etc…)
Involved in migration from Database to Snowflake database in cloud.
Worked with developer teams on NiFi workflow to pick up the data from rest API server, from data lake as well as
from SFTP server and send that to Kafka.
Involved in creating Hive tables, loading data and writing hive queries, views and worked on them using Hive QL.
Launched multi-node kubernetes cluster in Google Kubernetes Engine (GKE) and migrated the dockerized application
from AWS to GCP.
Performed Optimizations of Hive Queries using Map side joins, dynamic partitions and Bucketing.
Applied Hive queries to perform data analysis on HBase using the serde tables in meeting the data requirements for
the downstream applications.
Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and
query the data into HBase.
Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP
Implemented MapReduce secondary sorting to get better performance for sorting results in MapReduce programs.
Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business
users.
Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data
Lake, Data Factory, Data Lake Analytics, Stream
Load and transform large sets of structured, semi structured that includes Avro, sequence files.
Worked on migration of all existed jobs to Spark, to get performance and decrease time of execution.
Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon
Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.
Experience with ELK Stack in building quick search and visualization capability for data.
Experience with different data formats like Json, Avro, parquet, ORC formats and compressions like snappy & bzip.
Strong knowledge in RDBMS concepts, Data Modeling (Facts and Dimensions, Star/Snowflake Schemas), Data
Migration, Data Cleansing and ETL Processes.
Coordinated with the testing team for bug fixes and created documentation for recorded data, agent usage and
release cycle notes.
Environment: Hadoop, Big Data, HDFS, Scala, Python, Oozie, Hive, HBase, NiFi, Impala, Spark, AWS, Linux.
Responsibilities:
Developed an EDW solution, which is a cloud based EDW and Data Lake that supports Data asset management,
Data Integration, and continuous data analytic discovery workloads.
Developed and implemented real-time data pipelines with Spark Streaming, Kafka, and Cassandra to replace
existing lambda architecture without losing the fault-tolerant capabilities of the existing architecture.
Created a Spark Streaming application to consume real-time data from Kafka sources and applied real-time data
analysis models that we can update on new data in the stream as it arrives.
Worked on importing, transforming large sets of structured semi-structured and unstructured data.
Used Spark-Structured-Streaming to perform necessary transformations and data model which gets the data
from Kafka in real time and Persists into HDFS.
Implemented the workflows using the Apache Oozie framework to automate tasks. Used Zookeeper to co-
ordinate cluster services.
Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake Store(ADLS) using Azure
Data Factory(ADF V1/V2).
Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and
SQL datawarehouse environment
Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data
ingestion and transformation in GCP and coordinate task among the team
and created proposed architecture and specifications along with recommendations.
Created various hive external tables, staging tables and joined the tables as per the requirement.
Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external
table. Created Map side Join, Parallel Execution for optimizing the Hive queries.
Developed and implemented hive and spark custom UDFs involving date Transformations such as date
formatting and age calculations as per business requirements.
Written Programs in Spark using Scala and Python for Data quality check.
Written transformations and actions on Data Frames, used Spark SQL on data frames to access hive tables into
spark for faster processing of data.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
Setup Alerting and monitoring using Stack driver in GCP.
Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud
SQL, stack driver monitoring and cloud deployment manager.
Used Spark optimizations techniques like Cache/Refresh tables, broadcasting variables,
Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism and modifying the spark default
configuration variables for performance tuning.
Performed various benchmarking steps to optimize the performance of Spark jobs and thus improve the overall
processing.
Process your semi-structured data with full JSON support in to Snowflake database.
Worked in Agile environment in delivering the agreed user stories within the sprint time.
Environment: Hadoop, HDFS, Hive, Sqoop, Oozie, Spark, Scala, Kafka, Python, Cloudera, Linux.
Responsibilities:
Responsible for building scalable distributed data solutions using Hadoop cluster environment with HortonWorks
distribution.
Design and implement large scale distributed solutions in AWS and GCP clouds.
Used Sqoop to load the data from relational databases.
Involved in converting Hive/SQL queries into spark transformations using Spark RDD’s.
Worked with CSV, Jason, Avro and Parquet file formats.
Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon
Elastic Compute Cloud (EC2) and Amazon Simple Storage Service(S3).
Worked on Kafka to collect and load the data on Hadoop file systems.
Used Hive to form an abstraction on top of structured data resides in HDFS and implemented Partitions,
Buckets on HIVE tables.
Developed and implemented real-time data pipelines with Spark Streaming.
Designed, developed data integration programs in a Hadoop environment with NoSQL data store HBase for data
access and analysis.
Involved in migration from Database to Snowflake database in cloud.
Worked with Python, to develop analytical jobs using PySpark API of spark.
Using Job management scheduler apache Oozie to execute the workflow.
Using Ambarito monitor node’s health, status of the jobs and to run the analytics jobs in Hadoop clusters.
Experience with pyspark for using spark libraries by using python scripting for data analysis.
Worked on Tableau to build customized interactive reports, worksheets, and dashboards.
Involved in performance tuning of spark jobs using Cache and by utilizing complete advantage of cluster
environment.
Environment: Hadoop, Spark, Scala, Python, Kafka, Hive, Sqoop, Pyspark, Ambari, Oozie, HBase, Tableau, Jenkins,
HortonWorks.
Responsibilities:
Designed and developed Web Services using Java/J2EE in WebLogic environment. Developed web pages
using Java Servlet, JSP, CSS, Java Script, DHTML, and HTML. Added extensive Struts validation. Wrote Ant
scripts to build and deploy the application.
Involve in the Analysis, Design, and Development and Unit testing of business requirements.
Developed business logic inJAVA/J2EE technology.
Implemented business logic and generated WSDL for those web services using SOAP.
Worked on Developing JSP pages
Implemented Struts Framework.
Developed Business Logic using Java/J2EE.
Modified Stored Procedures in Oracle Database.
Developed the application using Spring Web MVC framework.
Worked with Spring Configuration files to add new content to the website.
Worked on the Spring DAO module and ORM using Hibernate. Used Hibernate Template and Hibernate Dao
Support for Spring-Hibernate Communication.
Configured Association Mappings such as one-one and one-many in Hibernate
Worked with JavaScript calls as the Search is triggered through JS calls when a Search key is entered in the
Search window
Worked on analyzing other Search engines to make use of best practices.
Collaborated with the Business team to fix defects.
Worked on XML, XSL and XHTML files.
As part of the team to develop and maintain an advanced search engine, would be able to attain
Environment: Java 1.6, J2EE, Eclipse SDK 3.3.2, Java Spring 3.x, jQuery, Oracle 10i, Hibernate, JPA, Json, Apache Ivy,
SQL, stored procedures, Shell Scripting, XML
Responsibilities:
Environment: Java, Spring MVC 3, Spring JPA, Hibernate, REST, ETL scripting, Python, JSP, Servlets, Agile, Apache
Tomcat.