Numeric data Streams of images Data Generated by social networks
Problems of Large Volume Of Data
1. Expenses in storing and handling huge amounts of data
2. Data is heterogeneous 3. Accessing and processing speed: if we have a 100MBps I/O channel and we need to process 2TBs of data-it will take 6 hours to process the data(1 terabyte (TB) equals 1,000 gigabytes (GB) or 1,000,000 megabytes (MB).) 6 hrs =21600 secs in 1 sec 100MB in 21600secs 100*21600=2160000MB (equivalent to 2TB) Big Data 1. The basic idea behind Big Data is that everything we do leaves a digital trace, or data, which can be analyzed to obtain actionable insights. 2. Big Data is the extraction, analysis and management of processing a large volume of data. It revolves around the data type – Big Data which is a collection of a large amount of data. 3. Almost every industry in the world today makes use of Big Data. Industries like finance, healthcare, banking, manufacturing have to deal with surplus amounts of data. 4. Such amount of data, which could not be processed earlier due to limitations in the computational techniques can now be performed with highly advanced tools and methodologies. 5. Some of the tools for Big Data are – Apache Hadoop, Spark, Flink etc. Characteristics Of Big Data 1. The characteristics of big data are often referred to as the three Vs: • Volume—How much data is there? • Variety—How diverse are different types of data? • Velocity—At what speed is new data generated. • Veracity: How accurate is the data? Data Scientist Data scientists tackle questions about the future. Tools used Data Science 1. Data Science is the study of data. It is about finding patterns in data through an in-depth analysis. 2. Data Science is a field or domain which includes and involves working with a huge amount of data and using it for building predictive models. 3. It’s about digging, capturing, (building the model) analyzing(validating the model), and utilizing the data(deploying the best model). 4. It is an intersection of Data and computing. It is a blend of the field of Computer Science, Business, and Statistics together 5. This technology field uses various modeling techniques such as ML algorithms, statistical methods, and mathematical analysis. 6. With Data Science, employees can assist in the decision-making process which will help the business to grow and enhance the quality of the product. 7. This is a field of applied mathematics and statistics. It brings into play a scientific approach to extract meaningful information and insights and predict future patterns and behaviors from data. How Data Science Finds Relationships Between Data Big Data vs. Data Science 1. Big data is a collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems). 2. Big Data deals with handling and managing huge amount of data. Prior to Big Data, industries did not possess the required tools and resources to manage such a large volume of data. 3. Data science involves using methods to scientifically analyze massive amounts of data using statistical techniques and extract the knowledge it contains. It is more quantitative in nature and uses various statistical approaches to identify the patterns within the data. 4. The process of Data Science involves the extraction, data transformation, data analysis and prediction to gain insights about the data. 5. The relationship between big data and data science is like the relationship between crude oil and an oil refinery. 1. While Big Data is about storing data, Data Science is about analyzing it. However, it is to be kept in mind that Data Science is an ocean of data operations, one that also includes Big Data. A Data Scientist analyzes the data that is quite large and requires a big data platform. Therefore, an ideal data scientist must also possess knowledge of big data tools. 2. The roles of Data Scientists and Big Data specialists also differ. A Data Scientist is required to analyze, draw insights from the data, visualize the data and communicate the results through robust storytelling. A Big Data Specialist, on the other hand, develops, maintains, and administers Big Data clusters that hold a voluminous amount of data. Benefits of Data Science and Big Data 1. This field is applicable in more than one industry, including finance, professional services, and information technology. For example, businesses rely on this field to unveil deeper insights that can help them make smarter business decisions, better understand customers, increase security, analyze company finances, and predict future market trends. 2. Commercial companies in almost every industry use data science and big data to gain insights into their customers, processes, staff and products. 3. Many companies use data science to offer customers a better user experience, as well as to cross-sell, up-sell, and personalize their offerings. 4. A good example of this is Google AdSense, which collects data from internet users, so relevant commercial messages can be matched to the person browsing the internet. Applications Categories of Data The main categories of data are these: 1. Structured 2. Unstructured 3. Natural language 4. Machine-generated 5. Graph-based 6. Audio, video, and images Structured Data 1. The data that depends on a data model and resides in a fixed field within a record 2. it’s often easy to store structured data in tables within databases or Excel files (figure 1.1). 3. SQL, or Structured Query Language, is the preferred way to manage and query data that resides in databases. Example of Structured Data Unstructured data 1. The data that isn’t easy to fit into a data model because the content is context- specific or varying. One example of unstructured data is your regular email (figure 1.2). Unstructured data Natural language 1. Natural language is a special type of unstructured data; it’s challenging to process because it requires knowledge of specific data science techniques and linguistics. It can take different forms, namely either a spoken language or a sign language. 2. NLP does the following – spam filters, uncovering certain words or phrases that signal a spam message – Gmail's email classification(primary,social,updates,spam). – Amazon’s Alexa recognize patterns in speech – Google not only predicts what popular searches may apply to your query as you start typing Machine-generated Data 1. Machine-generated data is information that’s automatically created by a computer, process, application, or other machine without human intervention. 2. Examples of machine data are web server logs, call detail records, network event logs Example Graph-based or Network Data 1. Graph or network data is, in short, data that focuses on the relationship or adjacency of objects. 2. The graph structures use nodes, edges, and properties to represent and store graphical data. 3. Graph-based data is a natural way to represent social networks, and its structure allows you to calculate specific metrics such as the influence of a person and the shortest path between two people. 4. Graph databases are used to store graph-based data and are queried with specialized query languages such as SPARQL. – Netflix uses Graph Database for its Digital Asset Management because it is a perfect way to track which movies (assets) each viewer has already watched, and which movies they are allowed to watch (access management) Examples • Examples of graph-based data can be found on many social media websites (figure 1.4). For instance, on LinkedIn you can see who you know at which company. Your follower list on Twitter is another example of graph- based data. • The power and sophistication comes from multiple, overlapping graphs of the same nodes. For example, imagine the connecting edges here to show “friends” on Facebook. Imagine another graph with the same people which connects business colleagues via LinkedIn. • Imagine a third graph based on movie interests on Netflix. Overlapping the three different-looking graphs makes more interesting questions possible Social Network Audio, image, and video • Audio, image, and video are data types that pose specific challenges to a data scientist. • Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers. • MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video capture to approximately 7 TB per game for the purpose of live, in- game analytics. • High-speed cameras at stadiums will capture ball and athlete movements to calculate in real time, for example, the path taken by a defender relative to two baselines. Streaming data • While streaming data can take almost any of the previous forms. • The data flows into the system when an event happens instead of being loaded into a data store in a batch. • Although this isn’t really a different type of data, we treat it here as such because you need to adapt your process to deal with this type of information. • Examples are the “What’s trending” on Twitter(What’s Trending delivers the latest video news for all things), live sporting or music events, and the stock market.