Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

DSA – 105 Introduction to
Data Science
Week 4 – Tools and Technologies in Data Science
Ferdin Joe John Joseph, PhD
Faculty of Information Technology
Thai-Nichi Institute of Technology

2

Week 4
Agenda
• Tools and Technologies in Data Science
Faculty of Information Technology, Thai - Nichi Institute of
Technology
2

3

Tools and Technologies in 2017
Faculty of Information Technology, Thai - Nichi Institute of
Technology
3

4

Programming Languages
• Python
• R
• Java
• C++
• Perl
• Matlab/Octave
Faculty of Information Technology, Thai - Nichi Institute of
Technology
4

5

Data Bases
• MySQL
• No SQL
• Microsoft SQL server
• Oracle
Faculty of Information Technology, Thai - Nichi Institute of
Technology
5

6

Data Analytics Tools
• SAS
• Tableau
• IBM SPSS Statistics
• Microsoft Excel
• Statistica
• Rapid Miner
• SAP
Faculty of Information Technology, Thai - Nichi Institute of
Technology
6

7

API
• Scala
• Tensor Flow
• Amazon Web Services
Faculty of Information Technology, Thai - Nichi Institute of
Technology
7

8

Servers and Application Frameworks
• Hadoop
• Spark
• Microsoft Azure
• Jupyter
Faculty of Information Technology, Thai - Nichi Institute of
Technology
8

9

Tools and Technologies in 2018
Faculty of Information Technology, Thai - Nichi Institute of
Technology
9

10

Recruiters Requirements 2018
Faculty of Information Technology, Thai - Nichi Institute of
Technology
10

11

Tools - Summary
Faculty of Information Technology, Thai - Nichi Institute of
Technology
11

12

R: The Most Popular Language for Data
Science
Once the data scientist has completed the often time-consuming process of “cleaning” and preparing the data
for analysis, R is a popular software package for actually doing the math and visualizing the results. An open-
source statistical modeling language, R has traditionally been popular in the academic community, which
means that lots of data scientists will be familiar with it.
R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including
text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem,
R has become increasingly popular as programmers have created additional add-on packages for handling big
datasets and parallel processing techniques that have come to dominate statistical modeling today.
• Parallel helps R take advantage of parallel processing for both multicore Windows machines and clusters of
POSIX (OS X, Linux, UNIX) machines.
• Snow helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive
processes like simulations or AI learning processes.
• Rhadoop and Rhipe allow programmers to interface with Hadoop from R, which is particularly important for
the “MapReduce” function of dividing the computing problem among separate clusters and then re-
combining or “reducing” all of the varying results into a single answer.
R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more.
Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make
marketing campaigns more effective, and reporting.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
12

13

Java & the Java Virtual Machine
Organizations that seek to write custom analytics tools from scratch
increasingly use the venerable language Java, as well as other
languages that run on the Java Virtual Machine (JVM). Java is a
variation of the object-oriented C++ language, and because Java runs
on a platform-agnostic virtual machine, programs can be compiled
once and run anywhere.
The upside of using the JVM over a language written to run directly on
the processor is the reduction in development time. This simpler
development process has been a draw for data analytics, making JVM-
based data mining tools very popular. Also, Hadoop—the popular
open-source, distributed big data storage and analysis software—is
written in Java.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
13

14

Java & the Java Virtual Machine
Java has rich open-source libraries for data mining, including Mahout and Weka, and the
JVM provides robust memory management and exception handling. Other programming
languages that can be used with the JVM include:
• Scala: This programming language has the same efficiency as Java because it’s run on the
JVM. However, it’s also become increasingly popular in data mining because it permits
developers to use object-oriented programming (OOP) as well as functional
programming. Users of Scala include The Guardian, LinkedIn, Foursquare, Novell,
Siemens, Twitter, and the SPARK data mining environment at the UC Berkeley AMP Lab.
• Clojure: A dialect of the 1980s-era artificial intelligence language LISP, Clojure is a
primarily (although not 100%) functional language that also runs on the JVM. Clojure
keeps data static and was designed for running concurrent processes. These features are
important because, in contrast, object-oriented code executing concurrent processes will
sometimes attempt to write to the same variable simultaneously. Keeping data
structures immutable avoids this problem. Clojure has access to Java libraries, and the
same development efficiencies as Java. Clojure can use the LISP macro facility to
integrate with Hadoop and SQL. Users of Clojure include Netflix, Zendesk, Citibank,
WalMart Labs, and Spotify.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
14

15

Python: A High-Level Programming Language
with Excellent Data Libraries
Python is a high-level language, meaning that the creators automated
certain housekeeping processes in order to make code easier to write.
Python has robust libraries that support statistical modeling (Scipy and
Numpy), data mining (Orange and Pattern), and visualization
(Matplotlib).
Scikit-learn, a library of machine learning techniques very useful to data
scientists, has attracted developers from Spotify, OKCupid, and
Evernote, but can be challenging to master.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
15

16

Excel: Powerful Data Analytics on a Smaller
Scale
Excel can actually accomplish a lot of sophisticated analysis
It’s easy to use and widely available. While it’s not best for analyzing
truly massive, unstructured datasets
For example, a massive dataset of some 30 million healthcare records
distributed via Hadoop across dozens of servers
It is surprisingly powerful when used for a variety of data analytics
projects at a small scale. These can include clustering, optimization,
and predictive modeling using supervised AI learning or forecasting
techniques.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
16

17

SAS (Statistical Analysis System): Data Mining
Software Suite
Used for advanced analytics, data management, and social media
analytics, SAS is a robust suite that’s popular for business intelligence
analysis of large data and unstructured datasets. In 2015, SAS topped
the Gartner Magic Quadrant list in terms of “ability to execute” in the
category of advanced analytics platforms due to the breadth and
quality of its predictive modeling and data mining techniques. With a
well-regarded visualization tool and integration with open-source tools
like R, Hadoop and Python, SAS also puts significant effort into making
tools backwards compatible, an important feature when looking at
older historical datasets.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
17

18

SAS (Statistical Analysis System): Data Mining
Software Suite
Say, for example, a company’s sales records were prepared for use by
SAS in 1998. With backwards compatibility, they can still be read today.
In large organizations, employee turnover over the years puts a
premium on the continuity of tools. So, when a data scientist retires,
you won’t lose the ability to access their work if they preferred older
software that no one new to the position knows how to use.
SAS can be costly, has a complicated licensing structure that some
customers have found to be annoying, and has a steep learning curve.
Although it’s expensive and complicated, it’s a very popular option,
with more than 65,000 customers.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
18

19

IBM: SPSS Modeler and SPSS Analytics
Forrester Research Wave ranks IBM’s advanced data analytics platform
as the top offering in the advanced analytics category for its breadth of
tools that handle all elements of big data modeling: loading, “cleaning,”
preparing, and then predictive modeling, whether using statistical or
machine learning techniques.
Other makers of highly rated commercial tools for advanced data
analytics include SAP, KNIME, RapidMiner, Oracle, and Alteryx.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
19

20

IBM: SPSS Modeler and SPSS Analytics
SPSS Modeler and SPSS Statistics were acquired by IBM in 2009, and
have a loyal following among statisticians. These tools integrate
Hadoop to facilitate file-system computing using big datasets. The
Social Media Analytics product helps data scientists harvest data from
Twitter, Facebook, and other platforms to perform customer sentiment
analysis. Gartner reports that the IBM advanced analytics platform has
lower customer satisfaction ratings than average, largely due to weak
customer support, inadequate documentation, and a challenging
installation process.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
20

21

SQL vs. NoSQL Databases: Tackling the
“Messiness” of Big Data
Another important distinction in the world of data is SQL databases vs.
NoSQL databases, both of which are well suited to different types of
datasets. Here’s a quick look at what makes them different in the context of
data analysis.
The traditional “relational” database was designed for an era in which data
was far more expensive to collect and to store—and much more carefully
organized. Structured Query Language (SQL) has been the means by which
programmers transfer data to and from those neatly categorized rows and
columns.
Only 5% of the world’s information was structured data—and the rest
consists of articles, photos, videos, social media posts, machine-to-machine
communication, product inventory, and technical documents. So data
scientists turned to a different standard for data storage called “NoSQL.”
Faculty of Information Technology, Thai - Nichi Institute of
Technology
21

22

Databases
MySQL: Open-Source RDBMS
Purchased by Oracle in 2009, MySQL is a widely used RDBMS (relational database management system) and
one part of the LAMP software stack. This free, open-source database management system is used by web
applications like WordPress, Drupal, Facebook, Twitter, and YouTube.
MongoDB
The most popular NoSQL database system available on the market is the open-source MongoDB, which has
been used by Metlife, The Weather Channel, Bosch, and Expedia. MongoDB has well-regarded customer
service, and the tool is particularly popular with startups.
One of the fastest-growing big data projects involving MongoDB is Apache Spark, a distributed computing
framework from the Apache Software Project that’s designed to operationalize real-time analytics. Paired up
with MongoDB, Spark allows organizations to put real-time analytics reporting to use.
Other commonly used open-source NoSQL databases include HBase, MariaDB, and Cassandra.
Oracle
Oracle has nearly 50% of the traditional relational database market, with products such as Oracle Database and
OracleTimesTen. The database behemoth has also entered the market for unstructured data storage with
Oracle NoSQL, and for open-source SQL databases that compete with its proprietary offerings. While popular
and considered to be top-notch by many, they’re expensive.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
22

23

Databases
SQL Server DBMS: Enterprise-Level Database Management
Microsoft SQL Server DBMS is a competitive enterprise-level database
management system that includes support for SQL or noSQL
architectures, in-memory computing, the cloud, and analytics on
transactions. Existing customers are generally impressed with its
performance
Other strong performers in the market include SAP, IBM, EnterpriseDB,
InterSystems, and MarkLogic.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
23

24

File System Computing
Hadoop: File System Computing
What is “file system computing”? It’s a way to store and tackle the analytics for truly massive datasets. For
example, 2 billion data points from sensors on an auto assembly line area that are stored on a cluster of
servers, with each connected to multiple drives, would be enormous. Because this kind of dataset is too large
to extract from the drives to a place where it can be analyzed, software like Hadoop was created.
Hadoop is an open-source software tool specially designed to help data scientists manage the unwieldiness of
big data. It eliminates the need to extract data from the storage devices altogether, bringing the analytics to
the data so it can be processed in place. It has increasingly become the industry standard for file system
computing projects involving big data, with prominent users including Facebook, Yahoo, and The New York
Times.
There are many other platforms that do file system computing, such as SciDB, but Hadoop has risen to the top
with user contributions that extend its functionality, like Hive, Pig, Spark, and MapReduce. Even software
giants like Microsoft and IBM have created their own Hadoop tools, rather than reinventing the wheel.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
24

25

R installation procedure
Follow the procedure in the link below and install R software.
https://bit.ly/2MKNB4j
This will help you learn R along with basic mathematics and statistics in
the next one month time. Concepts learned so far from Java is enough
to accomplish this.
Faculty of Information Technology, Thai - Nichi Institute of
Technology
25

26

Next Week…
• Basic Mathematics I
Faculty of Information Technology, Thai - Nichi Institute of
Technology
26

More Related Content

2019 DSA 105 Introduction to Data Science Week 4

  • 1. DSA – 105 Introduction to Data Science Week 4 – Tools and Technologies in Data Science Ferdin Joe John Joseph, PhD Faculty of Information Technology Thai-Nichi Institute of Technology
  • 2. Week 4 Agenda • Tools and Technologies in Data Science Faculty of Information Technology, Thai - Nichi Institute of Technology 2
  • 3. Tools and Technologies in 2017 Faculty of Information Technology, Thai - Nichi Institute of Technology 3
  • 4. Programming Languages • Python • R • Java • C++ • Perl • Matlab/Octave Faculty of Information Technology, Thai - Nichi Institute of Technology 4
  • 5. Data Bases • MySQL • No SQL • Microsoft SQL server • Oracle Faculty of Information Technology, Thai - Nichi Institute of Technology 5
  • 6. Data Analytics Tools • SAS • Tableau • IBM SPSS Statistics • Microsoft Excel • Statistica • Rapid Miner • SAP Faculty of Information Technology, Thai - Nichi Institute of Technology 6
  • 7. API • Scala • Tensor Flow • Amazon Web Services Faculty of Information Technology, Thai - Nichi Institute of Technology 7
  • 8. Servers and Application Frameworks • Hadoop • Spark • Microsoft Azure • Jupyter Faculty of Information Technology, Thai - Nichi Institute of Technology 8
  • 9. Tools and Technologies in 2018 Faculty of Information Technology, Thai - Nichi Institute of Technology 9
  • 10. Recruiters Requirements 2018 Faculty of Information Technology, Thai - Nichi Institute of Technology 10
  • 11. Tools - Summary Faculty of Information Technology, Thai - Nichi Institute of Technology 11
  • 12. R: The Most Popular Language for Data Science Once the data scientist has completed the often time-consuming process of “cleaning” and preparing the data for analysis, R is a popular software package for actually doing the math and visualizing the results. An open- source statistical modeling language, R has traditionally been popular in the academic community, which means that lots of data scientists will be familiar with it. R has literally thousands of extension packages that allow statisticians to undertake specialized tasks, including text analysis, speech analysis, and tools for genomic sciences. The center of a thriving open-source ecosystem, R has become increasingly popular as programmers have created additional add-on packages for handling big datasets and parallel processing techniques that have come to dominate statistical modeling today. • Parallel helps R take advantage of parallel processing for both multicore Windows machines and clusters of POSIX (OS X, Linux, UNIX) machines. • Snow helps divvy up R calculations on a cluster of computers, which is useful for computationally intensive processes like simulations or AI learning processes. • Rhadoop and Rhipe allow programmers to interface with Hadoop from R, which is particularly important for the “MapReduce” function of dividing the computing problem among separate clusters and then re- combining or “reducing” all of the varying results into a single answer. R is used in industries like finance, health care, marketing, business, pharmaceutical development, and more. Industry leaders like Bank of America, Bing, Facebook, and Foursquare use R to analyze their data, make marketing campaigns more effective, and reporting. Faculty of Information Technology, Thai - Nichi Institute of Technology 12
  • 13. Java & the Java Virtual Machine Organizations that seek to write custom analytics tools from scratch increasingly use the venerable language Java, as well as other languages that run on the Java Virtual Machine (JVM). Java is a variation of the object-oriented C++ language, and because Java runs on a platform-agnostic virtual machine, programs can be compiled once and run anywhere. The upside of using the JVM over a language written to run directly on the processor is the reduction in development time. This simpler development process has been a draw for data analytics, making JVM- based data mining tools very popular. Also, Hadoop—the popular open-source, distributed big data storage and analysis software—is written in Java. Faculty of Information Technology, Thai - Nichi Institute of Technology 13
  • 14. Java & the Java Virtual Machine Java has rich open-source libraries for data mining, including Mahout and Weka, and the JVM provides robust memory management and exception handling. Other programming languages that can be used with the JVM include: • Scala: This programming language has the same efficiency as Java because it’s run on the JVM. However, it’s also become increasingly popular in data mining because it permits developers to use object-oriented programming (OOP) as well as functional programming. Users of Scala include The Guardian, LinkedIn, Foursquare, Novell, Siemens, Twitter, and the SPARK data mining environment at the UC Berkeley AMP Lab. • Clojure: A dialect of the 1980s-era artificial intelligence language LISP, Clojure is a primarily (although not 100%) functional language that also runs on the JVM. Clojure keeps data static and was designed for running concurrent processes. These features are important because, in contrast, object-oriented code executing concurrent processes will sometimes attempt to write to the same variable simultaneously. Keeping data structures immutable avoids this problem. Clojure has access to Java libraries, and the same development efficiencies as Java. Clojure can use the LISP macro facility to integrate with Hadoop and SQL. Users of Clojure include Netflix, Zendesk, Citibank, WalMart Labs, and Spotify. Faculty of Information Technology, Thai - Nichi Institute of Technology 14
  • 15. Python: A High-Level Programming Language with Excellent Data Libraries Python is a high-level language, meaning that the creators automated certain housekeeping processes in order to make code easier to write. Python has robust libraries that support statistical modeling (Scipy and Numpy), data mining (Orange and Pattern), and visualization (Matplotlib). Scikit-learn, a library of machine learning techniques very useful to data scientists, has attracted developers from Spotify, OKCupid, and Evernote, but can be challenging to master. Faculty of Information Technology, Thai - Nichi Institute of Technology 15
  • 16. Excel: Powerful Data Analytics on a Smaller Scale Excel can actually accomplish a lot of sophisticated analysis It’s easy to use and widely available. While it’s not best for analyzing truly massive, unstructured datasets For example, a massive dataset of some 30 million healthcare records distributed via Hadoop across dozens of servers It is surprisingly powerful when used for a variety of data analytics projects at a small scale. These can include clustering, optimization, and predictive modeling using supervised AI learning or forecasting techniques. Faculty of Information Technology, Thai - Nichi Institute of Technology 16
  • 17. SAS (Statistical Analysis System): Data Mining Software Suite Used for advanced analytics, data management, and social media analytics, SAS is a robust suite that’s popular for business intelligence analysis of large data and unstructured datasets. In 2015, SAS topped the Gartner Magic Quadrant list in terms of “ability to execute” in the category of advanced analytics platforms due to the breadth and quality of its predictive modeling and data mining techniques. With a well-regarded visualization tool and integration with open-source tools like R, Hadoop and Python, SAS also puts significant effort into making tools backwards compatible, an important feature when looking at older historical datasets. Faculty of Information Technology, Thai - Nichi Institute of Technology 17
  • 18. SAS (Statistical Analysis System): Data Mining Software Suite Say, for example, a company’s sales records were prepared for use by SAS in 1998. With backwards compatibility, they can still be read today. In large organizations, employee turnover over the years puts a premium on the continuity of tools. So, when a data scientist retires, you won’t lose the ability to access their work if they preferred older software that no one new to the position knows how to use. SAS can be costly, has a complicated licensing structure that some customers have found to be annoying, and has a steep learning curve. Although it’s expensive and complicated, it’s a very popular option, with more than 65,000 customers. Faculty of Information Technology, Thai - Nichi Institute of Technology 18
  • 19. IBM: SPSS Modeler and SPSS Analytics Forrester Research Wave ranks IBM’s advanced data analytics platform as the top offering in the advanced analytics category for its breadth of tools that handle all elements of big data modeling: loading, “cleaning,” preparing, and then predictive modeling, whether using statistical or machine learning techniques. Other makers of highly rated commercial tools for advanced data analytics include SAP, KNIME, RapidMiner, Oracle, and Alteryx. Faculty of Information Technology, Thai - Nichi Institute of Technology 19
  • 20. IBM: SPSS Modeler and SPSS Analytics SPSS Modeler and SPSS Statistics were acquired by IBM in 2009, and have a loyal following among statisticians. These tools integrate Hadoop to facilitate file-system computing using big datasets. The Social Media Analytics product helps data scientists harvest data from Twitter, Facebook, and other platforms to perform customer sentiment analysis. Gartner reports that the IBM advanced analytics platform has lower customer satisfaction ratings than average, largely due to weak customer support, inadequate documentation, and a challenging installation process. Faculty of Information Technology, Thai - Nichi Institute of Technology 20
  • 21. SQL vs. NoSQL Databases: Tackling the “Messiness” of Big Data Another important distinction in the world of data is SQL databases vs. NoSQL databases, both of which are well suited to different types of datasets. Here’s a quick look at what makes them different in the context of data analysis. The traditional “relational” database was designed for an era in which data was far more expensive to collect and to store—and much more carefully organized. Structured Query Language (SQL) has been the means by which programmers transfer data to and from those neatly categorized rows and columns. Only 5% of the world’s information was structured data—and the rest consists of articles, photos, videos, social media posts, machine-to-machine communication, product inventory, and technical documents. So data scientists turned to a different standard for data storage called “NoSQL.” Faculty of Information Technology, Thai - Nichi Institute of Technology 21
  • 22. Databases MySQL: Open-Source RDBMS Purchased by Oracle in 2009, MySQL is a widely used RDBMS (relational database management system) and one part of the LAMP software stack. This free, open-source database management system is used by web applications like WordPress, Drupal, Facebook, Twitter, and YouTube. MongoDB The most popular NoSQL database system available on the market is the open-source MongoDB, which has been used by Metlife, The Weather Channel, Bosch, and Expedia. MongoDB has well-regarded customer service, and the tool is particularly popular with startups. One of the fastest-growing big data projects involving MongoDB is Apache Spark, a distributed computing framework from the Apache Software Project that’s designed to operationalize real-time analytics. Paired up with MongoDB, Spark allows organizations to put real-time analytics reporting to use. Other commonly used open-source NoSQL databases include HBase, MariaDB, and Cassandra. Oracle Oracle has nearly 50% of the traditional relational database market, with products such as Oracle Database and OracleTimesTen. The database behemoth has also entered the market for unstructured data storage with Oracle NoSQL, and for open-source SQL databases that compete with its proprietary offerings. While popular and considered to be top-notch by many, they’re expensive. Faculty of Information Technology, Thai - Nichi Institute of Technology 22
  • 23. Databases SQL Server DBMS: Enterprise-Level Database Management Microsoft SQL Server DBMS is a competitive enterprise-level database management system that includes support for SQL or noSQL architectures, in-memory computing, the cloud, and analytics on transactions. Existing customers are generally impressed with its performance Other strong performers in the market include SAP, IBM, EnterpriseDB, InterSystems, and MarkLogic. Faculty of Information Technology, Thai - Nichi Institute of Technology 23
  • 24. File System Computing Hadoop: File System Computing What is “file system computing”? It’s a way to store and tackle the analytics for truly massive datasets. For example, 2 billion data points from sensors on an auto assembly line area that are stored on a cluster of servers, with each connected to multiple drives, would be enormous. Because this kind of dataset is too large to extract from the drives to a place where it can be analyzed, software like Hadoop was created. Hadoop is an open-source software tool specially designed to help data scientists manage the unwieldiness of big data. It eliminates the need to extract data from the storage devices altogether, bringing the analytics to the data so it can be processed in place. It has increasingly become the industry standard for file system computing projects involving big data, with prominent users including Facebook, Yahoo, and The New York Times. There are many other platforms that do file system computing, such as SciDB, but Hadoop has risen to the top with user contributions that extend its functionality, like Hive, Pig, Spark, and MapReduce. Even software giants like Microsoft and IBM have created their own Hadoop tools, rather than reinventing the wheel. Faculty of Information Technology, Thai - Nichi Institute of Technology 24
  • 25. R installation procedure Follow the procedure in the link below and install R software. https://bit.ly/2MKNB4j This will help you learn R along with basic mathematics and statistics in the next one month time. Concepts learned so far from Java is enough to accomplish this. Faculty of Information Technology, Thai - Nichi Institute of Technology 25
  • 26. Next Week… • Basic Mathematics I Faculty of Information Technology, Thai - Nichi Institute of Technology 26