BigData - Resume
BigData - Resume
BigData - Resume
SUMMARY:
7+ years of IT professional experience as Software developer with core expertise in
building Bigdata and Hadoop data pipelines.
Over 4+ Years of Big Data experience in building highly scalable data analytics applications.
Experience in working with Hadoop components like HDFS, Map Reduce, Hive, HBase,
Sqoop, Oozie, Spark, Kafka.
Strong understanding of Distributed systems design, HDFS architecture, internal working
details of MapReduce and Spark processing frameworks.
Solid experience developing Spark Applications for performing high scalable data
transformations using RDD, Dataframe and Spark-SQL.
Good hands-on experiencing working with various Hadoop distributions mainly Cloudera
(CDH), Hortonworks (HDP) and Amazon EMR.
Expertise in developing production ready Spark applications utilizing Spark-Core,
Dataframes, Spark-SQL, Spark-ML and Spark-Streaming API's.
Strong experience troubleshooting failures in spark applications and fine-tuning spark
applications and hive queries for better performance.
Good experience utilizing various optimization options in Spark like broadcast joins,
caching (persisting), sizing executors appropriately, reducing shuffle stages etc.,
Worked extensively on Hive for building complex data analytical applications.
Strong experience writing complex map-reduce jobs including development of custom Input
Formats and custom Record Readers.
Sound knowledge in map side join, reduce side join, shuffle & sort, distributed cache,
compression techniques, multiple hadoop Input & output formats.
Good experience working with AWS Cloud services like S3, EMR, Redshift, Athena, Glue
metastore etc.,
Deep understanding of performance tuning, partitioning for optimizing spark applications.
Worked on building real time data workflows using Kafka, Spark streaming and HBase.
Extensive knowledge on NoSQL databases like HBase, Cassandra and Mongo DB.
Solid experience in working with csv, text, Avro, parquet, orc, json formats of data.
Extensive experience in performing ETL on structured, semi-structured data using Pig Latin
Scripts.
Designed and implemented Hive and Pig UDF's using Java for evaluation, filtering, loading
and storing of data.
Strong understanding of Data Modelling and experience with Data Cleansing, Data
Profiling and Data analysis.
Experience in writing test cases in Java Environment using JUnit.
Proficiency in programming with different IDE’s like Eclipse, Net Beans.
Good Knowledge about scalable, secure cloud architecture based on Amazon Web Services
like EC2, Cloud Formation, VPC, S3, etc.
Good knowledge in the core concepts of programming such as algorithms, data structures,
collections.
Good understanding of Service Oriented architecture (SOA) and web services like XML, XSD,
XSDL, and SOAP.
Excellent communication and inter-personal skills, flexible and adaptive to new
environments, self-motivated, team player, positive thinker and enjoy working in
multicultural environment.
Analytical, organized and enthusiastic to work in a fast paced and team-oriented
environment.
Expertise in interacting with business users and understanding the requirement and
providing solutions to match their requirement.
TECHNICAL SKILLS:
Hadoop/Big Data: Spark, Hive, HDFS, MapReduce, Sqoop, Oozie, Kafka, Impala
Programming Java, Scala, Python
languages:
Cloud AWS-EC2, S3, EMR, RDS, Lambda, Redshift, Athena, Glue Metastore
Database: NoSQL (Hbase, Cassandra, MongoDB) , Teradata, Oracle, DB2, MySQL,
Posgtres
IDE Tools: Eclipse, Intellij
Development Agile, Waterfall
Approach:
Version Control: CVS, SVN, Git
Reporting Tools: Tableau, QlikView
PROFESSIONAL EXPERIENCE:
Responsibilities:
Developed Spark applications using Scala utilizing Data frames and Spark SQL API for
faster processing of data.
Developed highly optimized Spark applications to perform various data cleansing,
validation, transformation and summarization activities according to the requirement
Data pipeline consists Spark, Hive, Sqoop and custom build Input Adapters to ingest,
transform and analyze operational data.
Developed Spark jobs and Hive Jobs to summarize and transform data.
Used Spark for interactive queries, processing of streaming data and integration with
popular NoSQL database for huge volume of data.
Involved in converting Hive/SQL queries into Spark transformations using Spark
DataFrames and Scala.
Automated creation and termination of AWS EMR clusters using Amazon Java SDK.
Involved in deploying spark and hive applications in AWS stack.
Handled importing data from different data sources into S3 using Sqoop and performing
transformations using Hive, and Spark.
Exported the analyzed data to the Redshift using spark, to further visualize and generate
reports for the BI team.
Helped DevOps Engineers for deploying code and debug issues.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for
reporting.
Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
Environment: AWS EMR, S3, Spark, Scala, Hive, Sqoop, ETL, Java, Athena, Glue, Maven, GitHub
Responsibilities:
Involved in creating data ingestion pipelines for collecting clinical trial and health record
data from various external sources like FTP Servers and S3 buckets.
Involved in migrating existing Teradata Datawarehouse to AWS S3 based data lakes.
Involved in migrating existing traditional ETL jobs to Spark and Hive Jobs on new cloud
data lake.
Wrote complex Spark applications for performing various de-normalization of the
datasets and creating a unified data analytics layer for downstream teams.
Primarily responsible for fine-tuning long running Spark applications, writing custom
spark udfs, troubleshooting failures etc.,
Involved in building a real time pipeline using Kafka and Spark streaming for delivering
event messages to downstream application team from an external rest-based application.
Involved in creating Hive scripts for performing adhoc data analysis required by the
business teams.
Worked extensively on migrating on prem workloads to AWS Cloud.
Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena and Glue
Metastore.
Used broadcast variables in spark, effective & efficient Joins, caching and other capabilities
for data processing.
Involved in continuous Integration(CI/CD) of application using Jenkins.
Responsible for debugging and troubleshooting the running applications in production.
Environment: AWS EMR, Spark, Hive, HDFS, Sqoop, Kafka, Oozie, HBase, Scala, MapReduce.
Responsibilities:
Involved in writing Spark applications using Scala to perform various data cleansing,
validation, transformation, and summarization activities according to the requirement.
Load the data into Spark RDD and perform in-memory data computation to generate the
output as per the requirements.
Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze
operational data.
Worked on performance tuning of Spark application to improve performance.
Real time streaming the data using Spark with Kafka. Responsible for handling Streaming data
from web server console logs.
Worked on different file formats like Text, Sequence files, Avro, Parquet, JSON, XML files and
Flat files using Map Reduce Programs.
Developed daily process to do incremental import of data from DB2 and Teradata into Hive
tables using Sqoop.
Wrote Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in
HDFS.
Analyzed the SQL scripts and designed the solution to implement using Spark.
Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and
Aggregation and how does it translate to MR jobs.
Work with cross functional consulting teams within the data science and analytics team to
design, develop and execute solutions to derive business insights and solve client’s operational
and strategic problems.
Exported the analyzed data to the relational databases using Sqoop for visualization and to
generate reports for the BI team.
Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into
HBase tables.
Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive,
designed both Managed and External tables, also worked on optimization of Hive queries.
Involved in collecting and aggregating large amounts of log data using Flume and staging data
in HDFS for further analysis.
Environment: HDFS, Map Reduce, Sqoop, Hive, Pig, Oozie, HBase, Python, Yarn, Spark, Tableau and
Cloudera Manager.
Responsibilities:
Responsible for building scalable distributed data solutions using Hadoop. Worked hands
on with ETL process using Pig.
Worked on data analysis in HDFS using MapReduce, Hive and PIG jobs.
Worked on MapReduce programming and Mapreduce-HBase Integration.
Involved in creating external table, partitioning, bucketing of table in Hive.
Ensuring adherence to guidelines and standards in project process.
Facilitating testing in different dimensions.
Wrote and modified stored procedures to load and modifying of data according to business
rule changes.
Worked on production support environment.
Extracted the data from Teradata into HDFS using Sqoop.
Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
Developed Hive queries to process the data and generate the data cubes for visualizing.
Kerberos security was implemented to safeguard the cluster.
Worked on a stand-alone as well as a distributed Hadoop application.
Environment: Apache Hadoop, Cloudera, Pig, Hive, SQOOP, Flume, Java/J2EE, Oracle 11G, Crontab,
JBoss 5.1.0Application Server, Linux OS, Windows OS, AWS.
Responsibilities:
Performed analysis for the client requirements based on the developed detailed design
documents.
Developed User Interface using JavaScript and HTML.
Implemented MVC architecture by creating Model, View and Controller classes.
Involved in unit testing, debugging and bug fixing of application modules.
Extensively involved in writing the SQL queries to fetch data from database.
Defined Web Services using XML-based Web Services Description Language.
Building Java API's/Services backing User interface screens using Spring MVC.
Have experience in integrating other systems through XML.
Worked with Core Java concepts like Collections Framework, multi-threading, memory
management.
Experience of resolving issues with JVM and multi-threading. Connected to backend database
by using JDBC.
Using JDBC and SQL developed, data access objects.
Environment: Java, J2EE, JSP, JDBC, EJB, log4j, XML, Apache Tomcat, JUNIT, DB2, SQL Server, CVS.