This document provides an overview of big data and Hadoop. It defines big data as large volumes of structured, semi-structured and unstructured data that is growing exponentially and is too large for traditional databases to handle. It discusses the 4 V's of big data - volume, velocity, variety and veracity. The document then describes Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. It outlines the key components of Hadoop including HDFS, MapReduce, YARN and related modules. The document also discusses challenges of big data, use cases for Hadoop and provides a demo of configuring an HDInsight Hadoop cluster on Azure.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
This document provides an overview of Hadoop and how it can be used for data consolidation, schema flexibility, and query flexibility compared to a relational database. It describes the key components of Hadoop including HDFS for storage and MapReduce for distributed processing. Examples of industry use cases are also presented, showing how Hadoop enables affordable long-term storage and scalable processing of large amounts of structured and unstructured data.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
This document discusses big data analytics using Hadoop. It provides an overview of loading clickstream data from websites into Hadoop using Flume and refining the data with MapReduce. It also describes how Hive and HCatalog can be used to query and manage the data, presenting it in a SQL-like interface. Key components and processes discussed include loading data into a sandbox, Flume's architecture and data flow, using MapReduce for parallel processing, how HCatalog exposes Hive metadata, and how Hive allows querying data using SQL queries.
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
The document discusses Hadoop and big data. It defines Hadoop as an open source, scalable, and fault tolerant platform for storing and processing large amounts of unstructured data distributed across machines. It describes Hadoop's core components like HDFS for data storage and MapReduce/YARN for data processing. It also discusses how Hadoop fits into big data scenarios and landscapes, applying Hadoop to save money, the concept of data lakes, Hadoop in the cloud, and big data analytics with Hadoop.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Learn Big data and Hadoop online at Easylearning Guru. We are offer Instructor led online training and Life Time LMS (Learning Management System). Join Our Free Live Demo Classes of Big Data Hadoop .
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
This was presented at NHN on Jan. 27, 2009.
It introduces Big Data, its storages, and its analyses.
Especially, it covers MapReduce debates and hybrid systems of RDBMS and MapReduce.
In addition, in terms of Schema-Free, various non-relational data storages are explained.
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
This presentation provides an overview of Hadoop, including:
- A brief history of data and the rise of big data from various sources.
- An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers.
- Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture.
- An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes.
- Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.
This document provides an overview of big data concepts, including NoSQL databases, batch and real-time data processing frameworks, and analytical querying tools. It discusses scalability challenges with traditional SQL databases and introduces horizontal scaling with NoSQL systems like key-value, document, column, and graph stores. MapReduce and Hadoop are described for batch processing, while Storm is presented for real-time processing. Hive and Pig are summarized as tools for running analytical queries over large datasets.
The document provides information about Hadoop, its core components, and MapReduce programming model. It defines Hadoop as an open source software framework used for distributed storage and processing of large datasets. It describes the main Hadoop components like HDFS, NameNode, DataNode, JobTracker and Secondary NameNode. It also explains MapReduce as a programming model used for distributed processing of big data across clusters.
The document discusses big data and its applications. It defines big data as large and complex data sets that are difficult to process using traditional data management tools. It outlines the three V's of big data - volume, variety, and velocity. Various types of structured, semi-structured, and unstructured data are described. Examples are given of how big data is used in various industries like automotive, finance, manufacturing, policing, and utilities to improve products, detect fraud, perform simulations, track suspects, and monitor assets. Popular big data software like Hadoop and MongoDB are also mentioned.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
This document discusses big data, including what it is, common data sources, its volume, velocity and variety characteristics, solutions like Hadoop and its HDFS and MapReduce components, and the impact and future of big data. It explains that big data refers to large and complex datasets that are difficult to process using traditional tools. Hadoop provides a framework to store and process big data across clusters of commodity hardware.
Big Data Analytics & Trends Presentation discusses what big data is, why it's important, definitions of big data, data types and landscape, characteristics of big data like volume, velocity and variety. It covers data generation points, big data analytics, example scenarios, challenges of big data like storage and processing speed, and Hadoop as a framework to solve these challenges. The presentation differentiates between big data and data science, discusses salary trends in Hadoop/big data, and future growth of the big data market.
Content1. Introduction2. What is Big Data3. Characte.docxdickonsondorris
Content
1. Introduction
2. What is Big Data
3. Characteristic of Big Data
4. Storing,selecting and processing of Big Data
5. Why Big Data
6. How it is Different
7. Big Data sources
8. Tools used in Big Data
9. Application of Big Data
10. Risks of Big Data
11. Benefits of Big Data
12. How Big Data Impact on IT
13. Future of Big Data
Introduction
• Big Data may well be the Next Big Thing in the IT
world.
• Big data burst upon the scene in the first decade of the
21st century.
• The first organizations to embrace it were online and
startup firms. Firms like Google, eBay, LinkedIn, and
Facebook were built around big data from the
beginning.
• Like many new information technologies, big data can
bring about dramatic cost reductions, substantial
improvements in the time required to perform a
computing task, or new product and service offerings.
• ‘Big Data’ is similar to ‘small data’, but bigger in
size
• but having data bigger it requires different
approaches:
– Techniques, tools and architecture
• an aim to solve new problems or old problems in a
better way
• Big Data generates value from the storage and
processing of very large quantities of digital
information that cannot be analyzed with
traditional computing techniques.
What is BIG DATA?
What is BIG DATA
• Walmart handles more than 1 million customer
transactions every hour.
• Facebook handles 40 billion photos from its user base.
• Decoding the human genome originally took 10years to
process; now it can be achieved in one week.
Three Characteristics of Big Data V3s
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types
1st Character of Big Data
Volume
•A typical PC might have had 10 gigabytes of storage in 2000.
•Today, Facebook ingests 500 terabytes of new data every day.
•Boeing 737 will generate 240 terabytes of flight data during a single
flight across the US.
• The smart phones, the data they create and consume; sensors
embedded into everyday objects will soon result in billions of new,
constantly-updated data feeds containing environmental, location,
and other information, including video.
2nd Character of Big Data
Velocity
• Clickstreams and ad impressions capture user behavior at
millions of events per second
• high-frequency stock trading algorithms reflect market
changes within microseconds
• machine to machine processes exchange data between
billions of devices
• infrastructure and sensors generate massive log data in real-
time
• on-line gaming systems support millions of concurrent
users, each producing multiple inputs per second.
3rd Character of Big Data
Variety
• Big Data isn't just numbers, dates, and strings. Big
Data is also geospatial data, 3D data, audio and
video, and unstructured text, including log files and
social media.
• Traditional database systems were designed to
address smaller volumes of structured data, fewer
updates or a predictable, consistent data stru.
This document provides an overview of big data including:
- It defines big data and discusses its key characteristics of volume, velocity, and variety.
- It describes sources of big data like social media, sensors, and user clickstreams. Tools for big data include Hadoop, MongoDB, and cloud computing.
- Applications of big data analytics include smarter healthcare, traffic control, and personalized marketing. Risks include privacy and high costs. Benefits include better decisions, opportunities for new businesses, and improved customer experiences.
- The future of big data is strong with worldwide revenues projected to grow from $5 billion in 2012 to over $50 billion in 2017, creating millions of new jobs for data scientists and analysts
This document discusses big data principles including what data is, why big data is important, how it differs from traditional data, and its key characteristics. Big data is characterized by volume, variety, and velocity. It comes from many sources and in many formats. Tools like Hadoop enable storage and analysis at scale. Applications include search, customer analytics, business optimization, health, and security. Benefits are better decisions and flexibility to store now and analyze later. The future of big data is predicted to be a $100 billion industry growing at 10% annually.
Big data is large and complex data that cannot be processed by traditional data management tools. It is characterized by high volume, velocity, and variety. Big data comes from many sources and in many formats, including structured, unstructured, and semi-structured data. Storing and processing big data requires specialized systems like Hadoop and NoSQL databases. Big data analytics can provide benefits like improved business decisions and customer satisfaction when applied to areas such as healthcare, security, and manufacturing. However, big data also presents risks regarding privacy, costs, and being overwhelmed by the volume of data.
This document provides an overview of big data, including its definition, characteristics, sources, tools, applications, risks, benefits and future. Big data is characterized by large volumes of data in various formats that are difficult to process using traditional data management and analysis systems. It is generated from sources like user interactions, sensors and systems logs. Tools like Hadoop and NoSQL databases enable storing, processing and analyzing big data. Organizations apply big data analytics to areas such as healthcare, retail and security. While big data poses privacy and management challenges, it also provides opportunities to gain insights and make improved decisions. The big data industry is growing rapidly and expected to be worth over $100 billion.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
This document provides an overview of big data including:
- It defines big data and describes its three key characteristics: volume, velocity, and variety.
- It explains how big data is stored, selected, and processed using techniques like Hadoop and NoSQL databases.
- It discusses some common sources of big data, tools used to analyze it, and applications of big data analytics across different industries.
Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years. In this slide, learn the all about big data
in a simple and easiest way.
This document provides an introduction and overview of big data technologies. It begins with defining big data and its key characteristics of volume, variety and velocity. It discusses how data has exploded in recent years and examples of large scale data sources. It then covers popular big data tools and technologies like Hadoop and MapReduce. The document discusses how to get started with big data and learning related skills. Finally, it provides examples of big data projects and discusses the objectives and benefits of working with big data.
This document provides an overview of big data and Hadoop. It defines big data as high-volume, high-velocity, and high-variety data that requires new techniques to capture value. Hadoop is introduced as an open-source framework for distributed storage and processing of large datasets across clusters of computers. Key components of Hadoop include HDFS for storage and MapReduce for parallel processing. Benefits of Hadoop are its ability to handle large amounts of structured and unstructured data quickly and cost-effectively at large scales.
This document discusses data mining with big data. It defines big data and data mining. Big data is characterized by its volume, variety, and velocity. The amount of data in the world is growing exponentially with 2.5 quintillion bytes created daily. The proposed system would use distributed parallel computing with Hadoop to handle large volumes of varied data types. It would provide a platform to process data across dimensions and summarize results while addressing challenges such as data location, privacy, and hardware resources.
This document provides an overview of big data, including its definition, characteristics, sources, tools, applications, risks, benefits and future. Big data is characterized by its volume, velocity and variety. It is generated from sources like users, applications, sensors and more. Tools like Hadoop and databases are used to store, process and analyze big data. Big data analytics can provide benefits across many industries and applications. However, it also poses risks around privacy, costs and skills that must be addressed. The future of big data is promising, with the market expected to grow significantly in the coming years.
Big data refers to large, complex datasets that cannot be processed by traditional data processing software. It is characterized by high volume, velocity, and variety of data. The amount of data generated is growing rapidly due to factors like increased internet usage and proliferation of IoT devices. Big data is stored and processed using distributed systems like Hadoop and Spark, and non-relational databases. It has enabled more accurate decision making, innovations, and life-saving applications in healthcare. The big data market is forecast to grow significantly in coming years due to increased cloud computing and digital transformation initiatives.
Big Data brings big promise and also big challenges, the primary and most important one being the ability to deliver Value to business stakeholders who are not data scientists!
Data analytics is needed everywhere to predict relevant outcomes, increase profits, and efficiently use resources. It involves transforming processed information into knowledge. Companies use data analytics to understand customer behavior and target them with personalized advertisements and offers. Different types of data include structured, semi-structured, unstructured, and big data. Big data is high volume, high velocity, and highly variable information that requires novel processing techniques. Hadoop is an open source software framework that allows distributed processing of big data across clusters of computers using simple programming models like MapReduce. Data analytics helps analyze these different data types to provide business insights.
Big Data, NoSQL, NewSQL & The Future of Data ManagementTony Bain
It is an exciting and interesting time to be involved in data. More change of influence has occurred in the database management in the last 18 months than has occurred in the last 18 years. New technologies such as NoSQL & Hadoop and radical redesigns of existing technologies, like NewSQL , will change dramatically how we manage data moving forward.
These technologies bring with them possibilities both in terms of the scale of data retained but also in how this data can be utilized as an information asset. The ability to leverage Big Data to drive deep insights will become a key competitive advantage for many organisations in the future.
Join Tony Bain as he takes us through both the high level drivers for the changes in technology, how these are relevant to the enterprise and an overview of the possibilities a Big Data strategy can start to unlock.
This document discusses big data and data mining. It defines big data as large volumes of structured and unstructured data that are difficult to process using traditional techniques due to their size. It outlines the 4 Vs of big data: volume, velocity, variety, and veracity. The proposed system would use distributed parallel computing with Hadoop to identify relationships in huge amounts of data from different sources and dimensions. It discusses challenges of big data like data location, volume, privacy, and gaining insights. Solutions involve parallel programming, distributed storage, and access restrictions.
Similar to Big data with Hadoop - Introduction (20)
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
What Not to Document and Why_ (North Bay Python 2024)Margaret Fero
We’re hopefully all on board with writing documentation for our projects. However, especially with the rise of supply-chain attacks, there are some aspects of our projects that we really shouldn’t document, and should instead remediate as vulnerabilities. If we do document these aspects of a project, it may help someone compromise the project itself or our users. In this talk, you will learn why some aspects of documentation may help attackers more than users, how to recognize those aspects in your own projects, and what to do when you encounter such an issue.
These are slides as presented at North Bay Python 2024, with one minor modification to add the URL of a tweet screenshotted in the presentation.
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecJames Anderson
The lecture titled "Automating AppSec" delves into the critical challenges associated with manual application security (AppSec) processes and outlines strategic approaches for incorporating automation to enhance efficiency, accuracy, and scalability. The lecture is structured to highlight the inherent difficulties in traditional AppSec practices, emphasizing the labor-intensive triage of issues, the complexity of identifying responsible owners for security flaws, and the challenges of implementing security checks within CI/CD pipelines. Furthermore, it provides actionable insights on automating these processes to not only mitigate these pains but also to enable a more proactive and scalable security posture within development cycles.
The Pains of Manual AppSec:
This section will explore the time-consuming and error-prone nature of manually triaging security issues, including the difficulty of prioritizing vulnerabilities based on their actual risk to the organization. It will also discuss the challenges in determining ownership for remediation tasks, a process often complicated by cross-functional teams and microservices architectures. Additionally, the inefficiencies of manual checks within CI/CD gates will be examined, highlighting how they can delay deployments and introduce security risks.
Automating CI/CD Gates:
Here, the focus shifts to the automation of security within the CI/CD pipelines. The lecture will cover methods to seamlessly integrate security tools that automatically scan for vulnerabilities as part of the build process, thereby ensuring that security is a core component of the development lifecycle. Strategies for configuring automated gates that can block or flag builds based on the severity of detected issues will be discussed, ensuring that only secure code progresses through the pipeline.
Triaging Issues with Automation:
This segment addresses how automation can be leveraged to intelligently triage and prioritize security issues. It will cover technologies and methodologies for automatically assessing the context and potential impact of vulnerabilities, facilitating quicker and more accurate decision-making. The use of automated alerting and reporting mechanisms to ensure the right stakeholders are informed in a timely manner will also be discussed.
Identifying Ownership Automatically:
Automating the process of identifying who owns the responsibility for fixing specific security issues is critical for efficient remediation. This part of the lecture will explore tools and practices for mapping vulnerabilities to code owners, leveraging version control and project management tools.
Three Tips to Scale the Shift Left Program:
Finally, the lecture will offer three practical tips for organizations looking to scale their Shift Left security programs. These will include recommendations on fostering a security culture within development teams, employing DevSecOps principles to integrate security throughout the development
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
Hire a private investigator to get cell phone recordsHackersList
Learn what private investigators can legally do to obtain cell phone records and track phones, plus ethical considerations and alternatives for addressing privacy concerns.
Implementations of Fused Deposition Modeling in real worldEmerging Tech
The presentation showcases the diverse real-world applications of Fused Deposition Modeling (FDM) across multiple industries:
1. **Manufacturing**: FDM is utilized in manufacturing for rapid prototyping, creating custom tools and fixtures, and producing functional end-use parts. Companies leverage its cost-effectiveness and flexibility to streamline production processes.
2. **Medical**: In the medical field, FDM is used to create patient-specific anatomical models, surgical guides, and prosthetics. Its ability to produce precise and biocompatible parts supports advancements in personalized healthcare solutions.
3. **Education**: FDM plays a crucial role in education by enabling students to learn about design and engineering through hands-on 3D printing projects. It promotes innovation and practical skill development in STEM disciplines.
4. **Science**: Researchers use FDM to prototype equipment for scientific experiments, build custom laboratory tools, and create models for visualization and testing purposes. It facilitates rapid iteration and customization in scientific endeavors.
5. **Automotive**: Automotive manufacturers employ FDM for prototyping vehicle components, tooling for assembly lines, and customized parts. It speeds up the design validation process and enhances efficiency in automotive engineering.
6. **Consumer Electronics**: FDM is utilized in consumer electronics for designing and prototyping product enclosures, casings, and internal components. It enables rapid iteration and customization to meet evolving consumer demands.
7. **Robotics**: Robotics engineers leverage FDM to prototype robot parts, create lightweight and durable components, and customize robot designs for specific applications. It supports innovation and optimization in robotic systems.
8. **Aerospace**: In aerospace, FDM is used to manufacture lightweight parts, complex geometries, and prototypes of aircraft components. It contributes to cost reduction, faster production cycles, and weight savings in aerospace engineering.
9. **Architecture**: Architects utilize FDM for creating detailed architectural models, prototypes of building components, and intricate designs. It aids in visualizing concepts, testing structural integrity, and communicating design ideas effectively.
Each industry example demonstrates how FDM enhances innovation, accelerates product development, and addresses specific challenges through advanced manufacturing capabilities.
Interaction Latency: Square's User-Centric Mobile Performance MetricScyllaDB
Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run).
However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion).
At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead?
This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/07/intels-approach-to-operationalizing-ai-in-the-manufacturing-sector-a-presentation-from-intel/
Tara Thimmanaik, AI Systems and Solutions Architect at Intel, presents the “Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” tutorial at the May 2024 Embedded Vision Summit.
AI at the edge is powering a revolution in industrial IoT, from real-time processing and analytics that drive greater efficiency and learning to predictive maintenance. Intel is focused on developing tools and assets to help domain experts operationalize AI-based solutions in their fields of expertise.
In this talk, Thimmanaik explains how Intel’s software platforms simplify labor-intensive data upload, labeling, training, model optimization and retraining tasks. She shows how domain experts can quickly build vision models for a wide range of processes—detecting defective parts on a production line, reducing downtime on the factory floor, automating inventory management and other digitization and automation projects. And she introduces Intel-provided edge computing assets that empower faster localized insights and decisions, improving labor productivity through easy-to-use AI tools that democratize AI.
What's Next Web Development Trends to Watch.pdfSeasiaInfotech2
Explore the latest advancements and upcoming innovations in web development with our guide to the trends shaping the future of digital experiences. Read our article today for more information.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
In this follow-up session on knowledge and prompt engineering, we will explore structured prompting, chain of thought prompting, iterative prompting, prompt optimization, emotional language prompts, and the inclusion of user signals and industry-specific data to enhance LLM performance.
Join EIS Founder & CEO Seth Earley and special guest Nick Usborne, Copywriter, Trainer, and Speaker, as they delve into these methodologies to improve AI-driven knowledge processes for employees and customers alike.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
1. Big Data
Hadoop
Tomy Rhymond | Sr. Consultant | HMB Inc. | ttr@hmbnet.com | 614.432.9492
“Torture the data, and it will confess to anything.”
-Ronald Coase, Economics, Nobel Prize Laureate
"The goal is to turn data into information, and information
into insight." – Carly Fiorina
"Data are becoming the new raw material of business." – Craig Mundie,
Senior Advisor to the CEO at Microsoft.
“In God we trust. All others must bring data.” – W. Edwards
Deming, statistician, professor, author, lecturer, and consultant.
2. Agenda
• Data
• Big Data
• Hadoop
• Microsoft Azure HDInsight
• Hadoop Use Cases
• Demo
• Configure Hadoop Cluster / Azure Storage
• C# MapReduce
• Load and Analyze with Hive
• Use Pig Script for Analyze data
• Excel Power Query
3. Huston, we have a “Data” problem.
• IDC estimate put the size of the “digital universe” at 40 zettabytes (ZB) by
2020, which is 50-fold growth from the beginning of 2010.
• By 2020, emerging markets will supplant the developed world as the main
producer of the world’s data.
• This flood of data is coming from many source.
• The New York Stock Exchange generates about 1 terabytes of trade data per day
• Facebook hosts approximately one petabyte of storage
• The Hadron Collider produce about 15 petabytes of data per year
• Internet Archives stores around 2 petabytes of data and growing at a rate of 20
terabytes per month.
• Mobile devices and Social Network attribute to the exponential growth of
the data.
4. 85% Unstructured, 15% Structured
• The data as we know is structured.
• Structured data refers to information with a high degree of organization, such as inclusion in a
relational database is seamless and readily searchable.
• Not all data we collect conform to a specific, pre-defined data model.
• It tends to be the human-generated and people-oriented content that does not fit neatly into
database tables
• 85 percent of business-relevant information originates in unstructured form, primarily
text.
• Lack of structure make compilation a time and energy-consuming task.
• These data are so large and complex that it becomes difficult to process using on-hand
management tools or traditional data processing applications.
• These type of data is being generated by everything around us at all times.
• Every digital process and social media exchange produces it. Systems, sensors and mobile devices
transmit it.
5. Data Types
Relational Data – SQL Data
Un-Structured Data – Twitter Feed
Semi-Structured Data – Json Un-Structured Data – Amazon Review
6. So What is Big Data?
• Big Data is a popular term used to describe the exponential growth and
availability of data, both structured and unstructured.
• Capturing and managing lot of information; Working with many new types of data.
• Exploiting these masses of information and new data types of applications and
extract meaningful value from big data
• The process of applying serious computing to seriously massive and often highly
complex sets of information.
• Big data is arriving from multiple sources at an alarming velocity, volume and
variety.
• More data lead to more accurate analyses. More accurate analysis may lead to
more confident decision making.
7. VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron Collider generates
1 PETA BYTES
Of Data are create per year
Estimated
100 TERA BYTES
Of Data per US Company
IDC Estimate
40 ZETABYTES
Of Data by 2020
500 MILLION
TWEETS
Per day
100 MILLION VIDEO
600 Years of Video
13 Hours of video uploaded
per minute
20 BILLION NETWORK
CONNECTIONS
By 2016
NY Stock Exchange generates
1 TERRA BYTES
Of Trade Data per day
Poor Data Quality cost businesses
600 BILLION A YEAR
30% OF DATA COLLECTED
By marketers are not usable for
real-time decision making
Poor data across business and the
government costs the US economy
3.1 TRILLION DOLLARS
a year
1 IN 3 LEADERS
Don t trust the information they
user to make decision
MAP
REDUCE
RESULT
200 BILLIONS PHOTOS
Facebook has
1 PETTA
BYTES
Of Storage
1.8 BILLION
SMARTPHONES
Estimated
6 BILLION
PEOPLE
Have a cell Phone
Global Healthcare data
150 EXABYTES
2.4 EXABYTES per year
Growth
2.5 QUINTILLION
BYTES
of Data are Created each Day
Big Data
The 4 V’s of Big Data
Volume: We currently see the exponential growth in the data
storage as the data is now more than text data. There are videos,
music and large images on our social media channels. It is very
common to have Terabytes and Petabytes of the storage system for
enterprises.
Velocity: Velocity describes the frequency at which data is
generated, captured and shared. Recent developments mean that
not only consumers but also businesses generate more data in
much shorter cycles.
Variety: Today’s data no longer fits into neat, easy to consume
structures. New types include content, geo-spatial, hardware data
points, location based, log data, machine data, metrics, mobile,
physical data points, process, RFID etc.
Veracity: This refers to the uncertainty of the data available.
Veracity isn’t just about data quality, it’s about data
understandability. Veracity has an impact on the confidence data.
8. Big Data vs Traditional Data
Traditional Big Data
Data Size Gigabytes Petabytes
Access Interactive and Batch Batch
Updates Read and Write many times Write once, read many times
Structure Static Schema Dynamic Schema
Integrity High Low
Scaling Nonlinear Linear
9. Data Storage
• Storage capacity of the hard drives have increased massively over the years
• On the other hand, the access speeds of the drives have not kept up.
• Drive from 1990 could store 1370 MB of data and had a speed of 4.4 MB/s
• can read all the data in about 5 mins.
• Today One Terabyte drives are the norm, but the transfer rate is around 100 MB/s
• Take more than two and half hours to read all the data
• Writing is even slower
• The obvious ways to reduce time is to read from multiple disks at once
• Have 100 disks each holding one hundredth of data. Working in parallel, we could read all the
data in under 2 minutes.
• Move Computing to Data rather than bring data to computing.
10. Why big data should matter to you
• The real issue is not that you are acquiring large amounts of data. It's what you do with
the data that counts. The hopeful vision is that organizations will be able to take data
from any source, harness relevant data and analyze it to find answers that enable
• cost reductions
• time reductions
• new product development and optimized offerings
• smarter business decision making.
• By combining big data and high-powered analytics, it is possible to:
• Determine root causes of failures, issues and defects in near-real time, potentially saving billions of
dollars annually.
• Send tailored recommendations to mobile devices while customers are in the right area to take
advantage of offers.
• Quickly identify customers who matter the most.
• Generate retail coupons at the point of sale based on the customer's current and past purchases.
11. Ok I Got BigData, Now what?
• The huge influx of data raises many challenges.
• Process of inspecting, cleaning, transforming, and modeling data with the
goal of discovering useful information
• To analyze and extract meaningful value from these massive amounts of
data, we need optimal processing power.
• We need parallel processing and therefore requires many pieces of hardware
• When we use many pieces of hardware, the chances that one will fail is fairly high.
• Common way to avoiding data loss is through replication
• Redundant copies of data are kept
• Data analysis tasks need to combine data
• The Data from one disk may need to combine with data from 99 other disks
12. Challenges of Big Data
• Information Growth
• Over 80% of the data in the enterprise consists of unstructured data, growing much
faster pace than traditional data
• Processing Power
• The approach to use single, expensive, powerful computer to crunch information
doesn’t scale for Big Data
• Physical Storage
• Capturing and managing all this information can consume enormous resources
• Data Issues
• Lack of data mobility, proprietary formats and interoperability obstacle can make
working with Big Data complicated
• Costs
• Extract, transform and load (ETL) processes for Big Data can be expensive and time
consuming
13. Hadoop
• Apache™ Hadoop® is an open source software project that enables the
distributed processing of large data sets across clusters of commodity
servers.
• It is designed to scale up from a single server to thousands of machines, with
a very high degree of fault tolerance.
• All the modules in Hadoop are designed with a fundamental assumption that
hardware failures (of individual machines, or racks of machines) are common
and thus should be automatically handled in software by the framework.
• Hadoop consists of the Hadoop Common package, which provides filesystem
and OS level abstractions, a MapReduce engine and the Hadoop Distributed
File System (HDFS).
14. History of Hadoop
• Hadoop is not an acronym; it’s a made-up name.
• Named after stuffed an yellow elephant of Doug Cutting’s (Project Creator) son.
• 2002-2004 : Nutch Project - web-scale, open source, crawler-based search
engine.
• 2003-2004: Google released GFS (Google File System) & MapReduce
• 2005-2006: Added GFS (Google File System) & MapReduce impl to Nutch
• 2006-2008: Yahoo hired Doug Cutting and his team. They spun out storage and
processing parts of Nutch to form Hadoop.
• 2009 : Achieved Sort 500 GB in 59 Seconds (on 1400 nodes) and 100 TB in 173
Minutes (on 3400 nodes)
15. Hadoop Modules
• Hadoop Common: The common utilities that support the other Hadoop
modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource
management.
• Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
• Other Related Modules;
• Cassandra - scalable multi-master database with no single points of failure.
• HBase - A scalable, distributed database that supports structured data storage for large tables.
• Pig - A high-level data-flow language and execution framework for parallel computation.
• Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Zookeeper - A high-performance coordination service for distributed applications.
16. HDFS – Hadoop Distributed File System
• The heart of Hadoop is the HDFS.
• The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware.
• HDFS is designed on the following assumptions and goals:
• Hardware failure is norm rather than exception.
• HDFS is designed more for batch processing rather than interactive use by users.
• Application that run on HDFS have large data sets. A typical file in HDFS is in gigabytes
to terabytes in size.
• HDFS application uses a write-once-read-many access model. A file once created,
written and closed need not be changed.
• A computation requested by an application is much more efficient if it is executed near
the data it operated on. On other words, Moving computation is cheaper than moving
data.
• Easily portable from one platform to another.
17. Hadoop Architecture
• NameNode:
• NameNode is the node which stores the filesystem metadata
i.e. which file maps to what block locations and which blocks
are stored on which datanode.
• Secondary NameNode:
• NameNode is the single point of failure.
• DataNode:
• The data node is where the actual data resides.
• All datanodes send a heartbeat message to the namenode
every 3 seconds to say that they are alive.
• The data nodes can talk to each other to rebalance data,
move and copy data around and keep the replication high.
• Job Tracker/Task Tracker:
• The primary function of the job tracker is resource
management (managing the task trackers), tracking resource
availability and task life cycle management (tracking its
progress, fault tolerance etc.)
• The task tracker has a simple function of following the
orders of the job tracker and updating the job tracker with
its progress status periodically.
Name Node
Secondary Name
Node
text
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
text
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
B1 B2 B3
B4 B5 B6
B1
Data Node
Rack 1 Rack 2
Client
Client
Read
Metadata Ops
Replication
18. HDFS - InputSplit
• InputFormat
• Split the input blocks and files into logical
chunks of type InputSplit, each of which is
assigned to a map task for processing.
• RecordReader
• A RecordReader uses the data within the
boundaries created by the input split to
generate key/value pairs.
19. MapReduce
• Hadoop MapReduce is a software framework for easily
writing applications which process vast amounts of data
(multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• It is this programming paradigm that allows for massive
scalability across hundreds or thousands of servers in a
Hadoop cluster.
• The term MapReduce actually refers to two separate and
distinct tasks that Hadoop programs perform.
• The first is the map job, which takes a set of data and
converts it into another set of data, where individual
elements are broken down into tuples (key/value
pairs).
• The reduce job takes the output from a map as input
and combines those data tuples into a smaller set of
tuples.
BIG DATA
MAP REDUCE RESULT
(tattoo, 1)
Hadoop Map-Reduce
20. Hadoop Distributions
• Microsoft Azure HDInsight
• IBM InfoSphere BigInsights
• Hortonworks
• Amazon Elastic MapReduce
• Cloudera CDH
21. Hadoop Meets The Mainframe
• BMC
• Control-M for Hadoop is an extension of BMC’s larger Control-M product suite that was
born in 1987 as an automated mainframe job scheduler.
• Compuware
• APM is an application performance management suite that also spans the arc of data
enterprise data center computing distributed commodity servers.
• Syncsort
• Syncsort offers Hadoop Connectivity to move data between Hadoop and other platforms
including the mainframe, Hadoop Sort Acceleration, and Hadoop ETL for cross-platform
data integration.
• Informatica
• HParser can run its data transformation services as a distributed application on Hadoop’s
MapReduce engine.
22. Azure HDInsight
• HDInsight makes Apache Hadoop available as a service in the cloud.
• Process, analyze, and gain new insights from big data using the power
of Apache Hadoop
• Drive decisions by analyzing unstructured data with Azure HDInsight,
a big data solution powered by Apache Hadoop.
• Build and run Hadoop clusters in minutes.
• Analyze results with Power Pivot and Power View in Excel.
• Choose your language, including Java and .NET. Query and transform
data through Hive.
23. Azure HDInsight
Scale elastically on demand
Crunch all data –
structured, semi-
structured,
unstructured
Develop in your
favorite language
No hardware to
acquire or
maintain
Connect on-
premises Hadoop
clusters with the
cloud
Use Excel to
visualize your
Hadoop data
Includes NoSQL
transactional capabilities
Azure HDInsight
25. HDInsight
• The combination of Azure Storage and HDInsight provides an
ultimate framework for running MapReduce jobs.
• Creating an HDInsight cluster is quick and easy: log in to Azure,
select the number of nodes, name the cluster, and set
permissions.
• The cluster is available on demand, and once a job is completed,
the cluster can be deleted but the data remains in Azure Storage.
• Use Powershell to submit MapReduce Jobs
• Use C# to create MapReduce Programs
• Support Pig Latin, Avro, Sqoop and more.
26. Use cases
• A 360 degree view of the customer
• Business want to know to utilize social media postings to improve
revenue.
• Utilities: Predict power consumption
• Marketing: Sentiment analysis
• Customer service: Call monitoring
• Retail and marketing: Mobile data and location-based targeting
• Internet of Things (IoT)
• Big Data Service Refinery
27. Demo
• Configure HDInsight Cluster
• Create Mapper and Reducer Program using Visual Studio C#
• Upload Data to Blob Storage using Azure Storage Explorer
• Run Hadoop Job
• Export output to Power Query for Excel
• Hive Example with HDInsight
• Pig Script with HDInsight
28. VARIETY
BIG DATA
VOLUME
VERACITYVELOCITY
Scale of Data Different
Forms of Data
Analysis of
Data
Uncertainty of
Data
Hadron Collider generates
1 PETA BYTES
Of Data are create per year
Estimated
100 TERA BYTES
Of Data per US Company
IDC Estimate
40 ZETABYTES
Of Data by 2020
500 MILLION
TWEETS
Per day
100 MILLION VIDEO
600 Years of Video
13 Hours of video uploaded
per minute
20 BILLION NETWORK
CONNECTIONS
By 2016
NY Stock Exchange generates
1 TERRA BYTES
Of Trade Data per day
Poor Data Quality cost businesses
600 BILLION A YEAR
30% OF DATA COLLECTED
By marketers are not usable for
real-time decision making
Poor data across business and the
government costs the US economy
3.1 TRILLION DOLLARS
a year
1 IN 3 LEADERS
Don t trust the information they
user to make decision
MAP
REDUCE
RESULT
200 BILLIONS PHOTOS
Facebook has
1 PETTA
BYTES
Of Storage
1.8 BILLION
SMARTPHONES
Estimated
6 BILLION
PEOPLE
Have a cell Phone
Global Healthcare data
150 EXABYTES
2.4 EXABYTES per year
Growth
2.5 QUINTILLION
BYTES
of Data are Created each Day
Big Data
29. Resources for HDInsight for Windows Azure
Microsoft: HDInsight
• Welcome to Hadoop on Windows Azure - the welcome page for the Developer Preview for the Apache Hadoop-based Services for Windows Azure.
• Apache Hadoop-based Services for Windows Azure How To Guide - Hadoop on Windows Azure documentation.
• Big Data and Windows Azure - Big Data scenarios that explore what you can build with Windows Azure.
Microsoft: Windows and SQL Database
• Windows Azure home page - scenarios, free trial sign up, development tools and documentation that you need get started building applications.
• MSDN SQL- MSDN documentation for SQL Database
• Management Portal for SQL Database - a lightweight and easy-to-use database management tool for managing SQL Database in the cloud.
• Adventure Works for SQL Database - Download page for SQL Database sample database.
Microsoft: Business Intelligence
• Microsoft BI PowerPivot- a powerful data mashup and data exploration tool.
• SQL Server 2012 Analysis Services - build comprehensive, enterprise-scale analytic solutions that deliver actionable insights.
• SQL Server 2012 Reporting - a comprehensive, highly scalable solution that enables real-time decision making across the enterprise.
Apache Hadoop:
• Apache Hadoop - software library providing a framework that allows for the distributed processing of large data sets across clusters of computers.
• HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.
• Map Reduce - a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.
Hortonworks:
• Sandbox - Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials.
30. About Me
Tomy Rhymond
Sr. Consultant, HMB, Inc.
ttr@hmbnet.com
http://tomyrhymond.wordpress.com
@trhymond
614.432.9492 (m)