Aakanksha Aundhkar Professional Summary
Aakanksha Aundhkar Professional Summary
Aakanksha Aundhkar Professional Summary
aakanshaanu2915@gmail.com/ 304-787-8057
Professional Summary:
Overall 8+ years of experience in Machine Learning, Data-mining with large datasets of Structured and
Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
Experience in coding SQL/PL SQL using Procedures, Triggers, and Packages.
Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining
solutions to various business problems and generating data visualizations using R, Python.
Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and
methodologies.
Data Driven and highly analytical with working knowledge and statistical model approaches and
methodologies (Clustering, Regression analysis, Hypothesis testing, Decision trees, Machine learning),
rules and ever-evolving regulatory environment.
Professional working experience in Machine Learning algorithms such as Linear Regression, Logistic
Regression, Naive Bayes, Decision Trees, K-Means Clustering and Association Rules.
Expertise in transforming business requirements into analytical models, designing algorithms, building
models, developing data mining and reporting solutions that scale across a massive volume of
structured and unstructured data.
Experience with data visualization using tools like Ggplot, Matplotlib, Seaborn, Tableau and using
Tableau software to publish and presenting dashboards, storyline on web and desktop platforms.
Experienced in python data manipulation for loading and extraction as well as with python libraries such
as NumPy, SciPy and Pandas for data analysis and numerical computations.
Well experienced in Normalization, De-Normalization and Standardization techniques for optimal
performance in relational and dimensional database environments.
Experience in multiple software tools and languages to provide data-driven analytical solutions to
decision makers or research teams.
Familiar with predictive models using numeric and classification prediction algorithms like support
vector machines and neural networks, and ensemble methods like bagging, boosting and random forest
to improve the efficiency of the predictive model.
Worked on Text Mining and Sentimental analysis for extracting the unstructured data from various
social Media platforms like Facebook, Twitter, and Reddit.
Good Knowledge of NoSQL databases like Mongo DB and HBase.
Develop, maintain and teach new tools and methodologies related to data science and high-
performance computing.
Extensive hands-on experience and high proficiency with structures, semi-structured and unstructured
data, using a broad range of data science programming languages and big data tools including R, Python,
Spark, SQL, Scikit Learn, Hadoop Map Reduce
Cluster Analysis, Principal Component Analysis (PCA), Association Rules, Recommender Systems.
Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design
Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Adept in statistical programming languages like R and Python including Big Data technologies like
Hadoop, Hive.
Hands on experience with RStudio for doing data pre-processing and building machine learning
algorithms on different datasets.
Worked and extracted data from various database sources like Oracle, SQL Server, and DB2.
Implemented machine learning algorithms on large datasets to understand hidden patterns and capture
insights.
Predictive Modelling Algorithms: Logistic Regression, Linear Regression, Decision Trees, K-Nearest
Neighbors, Bootstrap Aggregation (Bagging), Naive Bayes Classifier, Random Forests, Boosting, Support
Vector Machines.
Flexible with Unix/Linux and Windows Environments, working with Operating Systems like Centos5/6,
Ubuntu13/14, Cosmos.
Technical Skills:
OLAP/ BI / ETL Tool: Business Objects 6.1/XI, MS SQL Server 2008/2005 Analysis Services (MS OLAP,
SSAS), Integration Services (SSIS), Reporting Services (SSRS), Performance Point
Server (PPS), Oracle 9i OLAP, MS Office Web Components (OWC11), DTS, MDX,
Crystal Reports 10, Crystal Enterprise 10(CMC)
Web Technologies JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL,
Tools Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power
designer. Big Data Technologies spark peg, Hive, HDFS, Map Reduce, Pig, Kafka.
Databases SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, MySQL, MS Access,
HDFS, HBase, Teradata, Netezza, Mongo DB, Cassandra, SAP HANA.
Reporting Tools MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI,
Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
Version Control Tools SVM, GitHub.
Project Execution Methodologies Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified
Process (RUP), Rapid Application Development (RAD), Joint Application
Development (JAD).
BI Tools Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE,
QlikView, SAP Business Intelligence, Amazon Redshift, or
Azure Data Warehouse
Operating System Windows, Linux, Unix, Macintosh HD, Red Hat.
Education Details:
Bachelors of Engineering in Computer Science
Professional Experience:
Client: Wal-Mart, Rosemead, CA June 2018 to Present
Data Scientist/Machine Learning Engineer
Responsibilities:
Implemented Data Exploration to analyze patterns and to select features using Python SciPy.
Built Factor Analysis and Cluster Analysis models using Python SciPy to classify customers into different
target groups.
Built predictive models including Support Vector Machine, Random Forests and Naïve Bayes Classifier
using Python Scikit-Learn to predict the personalized product choice for each client.
Using R's dplyr and ggplot2 packages, performed an extensive graphical visualization of overall data,
including customized graphical representation of revenue reports, specific item sales statistics and
visualization.
Designed and implemented cross-validation and statistical tests including Hypothetical Testing, ANOVA,
Auto-correlation to verify the models' significance.
Designed an A/B experiment for testing the business performance of the new recommendation system.
Supported MapReduce Programs running on the cluster.
Evaluated business requirements and prepared detailed specifications that follow project guidelines
required to develop written programs.
Configured Hadoop cluster with Namenode and slaves and formatted HDFS.
Used Oozie workflow engine to run multiple Hive and Pig jobs.
Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using
Hadoop MapReduce and HDFS.
Performed Data Enrichment jobs to deal missing value, to normalize data, and to select features by
using HiveQL.
Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
Analyzed the partitioned and bucketed data and compute various metrics for reporting.
Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
Worked on loading the data from MySQL to HBase where necessary using Sqoop.
Developed Hive queries for Analysis across different banners.
Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter dataand
uploaded to database.
Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched
instances with respect to specific applications.
Developed Hive queries for analysis, and exported the result set from Hive to MySQL using Sqoop after
processing the data.
Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
Created HBase tables to store various data formats of data coming from different portfolios.
Worked on improving performance of existing Pig and Hive Queries.
Created reports and dashboards, by using D3.js and Tableau 9.x, to explain and communicate data
insights, significant features, models scores and performance of new recommendation system to both
technical and business teams.
Utilize SQL, Excel and several Marketing/Web Analytics tools (Google Analytics, AdWords) in order to
complete business & marketing analysis and assessment.
Used Git 2.x for version control with Data Engineer team and Data Scientists colleagues.
Used agile methodology and SCRUM process for project developing.
Environment:R, Python, HDFS, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Metadata, MS Excel, Mainframes MS
Vision, Map-Reduce, Rational Rose, SQL, and MongoDB.
Perform Data Profiling to learn about behavior with various features such as traffic pattern, location,
Date and Time etc.
Extracted the data from hive tables by writing efficient Hive queries.
Performed preliminary data analysis using descriptive statistics and handled anomalies such as removing
duplicates and imputing missing values.
Analyze Data and Performed Data Preparation by applying historical model on the data set in AZUREML.
Application of various machine learning algorithms and statistical modeling like decision trees, text
analytics, natural language processing (NLP), supervised and unsupervised, regression models, social
network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn
package in python, Matlab.
Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation.
Performed data cleaning and feature selection using MLlib package in PySpark and working with deep
learning frameworks such as Caffe, Neon etc.
Conducted a hybrid of Hierarchical and K-means Cluster Analysis using IBM SPSS and identified
meaningful segments of customers through a discovery approach.
Develop Spark/Scala, Python, R for regular expression (regex) project in the Hadoop/Hive environment
with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and
to classify unlabeled data.
Evaluate models using Cross Validation, Log loss function, ROC curves and used AUC for feature
selection and elastic technologies like ElasticSearch, Kibana etc.
Work with NLTK library to NLP data processing and finding the patterns.
Categorize comments into positive and negative clusters from different social networking sites using
Sentiment Analysis and Text Analytics.
Analyze traffic patterns by calculating autocorrelation with different time lags.
Ensure that the model has low False Positive Rate and Text classification and sentiment analysis for
unstructured and semi-structured data.
Addressed over fitting by implementing of the algorithm regularization methods like L2 and L1.
Use Principal Component Analysis in feature engineering to analyze high dimensional data.
Create and design reports that will use gathered metrics to infer and draw logical conclusions of past
and future behavior.
Perform Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going
to deliver on time for the new route.
Implemented different models like Logistic Regression, Random Forest and Gradient-Boost Trees to
predict whether a given die will pass or fail the test.
Perform data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from
Oracle database and used ETL for data transformation.
Use MLlib, Spark's Machine learning library to build and evaluate different models.
Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in
python.
Develop MapReduce pipeline for feature extraction using Hive and Pig.
Create Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data.
Create various types of data visualizations using Python and Tableau.
Communicate the results with operations team for taking best decisions.
Collect data needs and requirements by Interacting with the other departments.
Environment:Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, SSRS, PL/SQL, T-SQL, Tableau, MLlib,
regression, Cluster analysis, Scala NLP, Spark, Kafka, Mongo DB, logistic regression, Hadoop, PySpark, Teradata,
random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, Map
Reduce, AWS.
Responsibilities:
Gathered business requirements, definition and design of the data sourcing, worked with the data
warehouse architect on the development of logical data models.
Collaborated with Data Engineers to filter data as per the project requirements.
Conducted reverse engineering based on demo reports to understand the data without documentation
and redefined the proper requirements and negotiated with our client.
Implemented automated Ticket routing algorithm using term affinity matrix which is one of the NLP
models.
Implemented various statistical techniques to manipulate the data (missing data imputation, principle
component analysis and sampling).
Applied different dimensionality reduction techniques like principle component analysis (PCA) and t-
stochastic neighborhood embedding (t-SNE) on feature matrix.
Identified outliers and inconsistencies in data by conducting exploratory data analysis (EDA) using
python NumPy and Seaborn to see the insights of data and validate each feature.
Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data
and associations between the variables.
Worked with Market Mix Modeling to strategize the advertisement investments to better balance the
ROI on advertisements.
Worked with several prof of concept models using Deep Learning, Neural networks.
Performed feature engineering including feature intersection generating, feature normalize and label
encoding with Scikit-learn preprocessing.
Developed and implemented predictive models using machine learning algorithms such as linear
regression, classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering,
KNN.
Used clustering techniques like DBSCAN, K-means, K-means++ and Hierarchical clustering for customer
profiling to design insurance plans according to their behavior pattern.
Used Grid Search to evaluate the best hyper-parameters for my model and K-fold cross validation
technique to train my model for best results.
Worked with Customer Churn Models including Random forest regression, lasso regression along with
pre-processing of the data.
Implemented time-based learningrate decay and drop based learning rate to reduce the computational
time of the model by 2.8 minutes for 10 epochs.
Used Python 3.X (NumPy, SciPy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop
variety of models and algorithms for analytic purposes.
Designed rich data visualizations to model data into human-readable form with Matplotlib.
Environment:MS Excel, Agile, Oracle 11g, Sql Server, SOA, SSIS, SSRS, ETL, UNIX, T-SQL, HP Quality enter 11,
RDM (Reference Data Management).
Responsibilities:
Wrote SQL queries to retrieve information from the databases depending on the requirement.
Evaluated actual and estimated execution plans for queries with performance issues to alter queries for
better optimal query performance.
Created mappings to extract data from SQL Server and to migrate and transform data from
Text/Access/Excel Spreadsheet using SQL Server Integration Service (SSIS)
Conducted data cleansing to remove unnecessary columns, eliminate redundant and inconsistent data
with SSIS transformations.
Created SQL Server Integration Service (SSIS) packages to extract data from Excel File, Flat Files, Text
Files and Comma Separated Values (CSV) files
Designed new database tables and mapping documents used to guide ETL coding to meet business
information needs.
Created SQL Server Integration Service (SSIS) packages to extract data from Excel File, Flat Files, Text
Files and Comma Separated Values (CSV) files
Designed new database tables and mapping documents used to guide ETL coding to meet business
information needs.
Wrote complex stored procedures that utilized dynamic SQL for reusability purposes such as index
monitoring and defragmentation.
Created SSIS packages that implemented business logic in the transformation stage by utilizing different
transformations in SSIS toolbox such as aggregate, Merge, and Merge join.
Performed unit testing of ETL SSIS packages that performed ETL (Extract, Transform and Load) processes
and data cleansing.
Utilized transformations in SSIS dataflow, SSIS tasks in control flow such as for loop containers and fuzzy
lookup.
Wrote complex stored procedures that utilized dynamic SQL for reusability purposes such as index
monitoring and defragmentation.
Created SSIS packages that implemented business logic in the transformation stage by utilizing different
transformations in SSIS toolbox such as aggregate, Merge, and Merge join.
Performed unit testing of ETL SSIS packages that performed ETL (Extract, Transform and Load) processes
and data cleansing.
Utilized transformations in SSIS dataflow, SSIS tasks in control flow such as for loop containers and fuzzy
lookup.
Implemented loggings with in SSIS packages such as custom logging with the help of Script Task.
Implemented the SDLC for the developing life cycle in agile approach and followed the standards
process in the application.
Conducted Performance tuning/MS SQL server development - SQL Profiler, SQL scripts, stored
Procedures, triggers, functions and transactions analysis, thorough understanding of indices and
Statistics (Query Optimizer).
Optimized indexes, SQL queries, and stored procedures using Database Tuning Advisor, SQL Server
Profiler, and execution plan.
Designed and implemented complex SSIS package to migrate data from multiple data sources for data
analyzing.
Environment: MS SQL Server 2008/2008R2/2012, SQL Server Integration Services (SSIS), MS SQL Server
2005/2008, Windows 2000/NT/XP, SSIS, SSRS, SSAS, Query Analyzer, MS Office 2003.