Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
BIG DATA
ASSIGNMENT
Submitted to: Mr. Vivek Gautam
Submitted by: Anuja Chatterjee
Roll No. 19DM039
PGDM
Birla Institute of Management Technology
December,2019
Assignment – 1
Questions for “Big Data Analytics” Course
Big Data:: Introduction to Big Data, its origination, explosion and
Challenges
1. How will you define “Big Data”?
Ans: Big data represents the data assets characterized by such a high volume, velocity and
variety to require specific technology and analytical methods for its transformation into value.
In traditional way it cannot be assessed.
2. What lead to the origination of Big Data?
The term Big Data was coined by Roger Mougalas back in 2005. However, in 1663, John
Graunt provided the world with the first statistical analysis of data ever recorded in his book
‘Natural and Political Observations Made upon the Bills of Mortality’. The starting point of
modern data begins in 1889 when a computing system was invented by Herman Hollerith in
an attempt to organize census data. The very first data-processing machine was named
‘Colossus’ and was developed by the British in order to decipher Nazi codes in World War II,
1943. The first data centre was built by the United States government in 1965 for the purpose
of storing millions of tax returns and fingerprint sets. This initiative was the starting point of
electronic big storage. In 2005, Yahoo created the now open-source Hadoop with the intention
of indexing the entire World Wide Web as people began to realise how much data each day is
generated through social media and internet platform. NoSQL also began to gain popularity
during this time. Although it seems like big data has been around for a long time now and that
we are getting closer to the pinnacle, big data may just be at its formidable stages. Big data in
the near future may end up making big data now seem like a poultry amount.
3. What is the difference between structured, un-structured and semi-
structured data?
Structured data: Data that is the easiest to search and organize, because it is usually contained
in rows and columns and its elements can be mapped into fixed pre-defined fields, is known as
structured data. In structured data, entities can be grouped together to form relations. This
makes structured data easy to store, analyze and search and until recently was the only data
easily usable for businesses. It is often manage by SQL (Structured Query Language).
Examples of structured data include financial data such as accounting transactions, address
details, demographic information, star ratings by customers, machines logs, location data from
smart phones and smart devices, etc.
Unstructured Data: Unstructured data is data that cannot be contained in a row-column
database and doesn’t have an associated data model. Most of the present data in world is
unstructured. The lack of structure made unstructured data more difficult to search, manage
and analyse, which is why companies have widely discarded unstructured data, until the recent
proliferation of artificial intelligence and machine learning algorithms made it easier to
process. Instead of spreadsheets or relational databases, unstructured data is usually stored
in data lakes, NoSQL databases, applications and data warehouses. Example: photos, video
and audio files, text files, social media content, satellite imagery, presentations, PDFs, open-
ended survey responses, websites etc.
Semi-structured Data: The third category is semi-structured data. It has some defining or
consistent characteristics but doesn’t conform to a structure as rigid as is expected with a
relational database. Therefore, there are some organizational properties such as semantic tags
or metadata to make it easier to organize, but there’s still fluidity in the data. Email messages
are a good example. While the actual content is unstructured, it does contain structured data
such as name and email address of sender and recipient, time sent, etc.
4. In the lecture, we have discussed various business are taking
advantage of Big Data. Please provide any 1 use-case for retail and
banking Industry?
Retail Industry:.
Personalizing Customer Experience
For retailers, big data can create opportunities to provide better customer experiences. Costco
uses their data collection to keep customers healthy. When a California fruit packing company
warned Costco about the possibility of listeria contamination in fruits like peaches and plums,
Costco was able to email specific customers who had purchased the items affected by the
contamination instead of a blanket email to their lists.
Forecasting Demand in Retail
In addition to big data, some algorithms analyze social media and web browsing trends to
predict the next big thing in the retail industry. Perhaps one of the most interesting data points
for forecasting demand is the weather.
Brands like Walgreens and Pantene worked with the Weather Channel to account for weather
patterns in order to customize product recommendations for consumers. Walgreens and
Pantene anticipated increases in humidity–a time when women would be seeking anti-frizz
products–and served up ads and in-store promotions to drive sales.
Banking Industry:
Business intelligence (BI) tools are capable of identifying potential risks associated with money
lending processes in banks. With the help of big data analytics, banks can analyze the market
trends and decide on lowering or increasing interest rates for different individuals across
various regions.
Data entry errors from manual forms can be reduced to a minimum as big data point out
anomalies in customer data too.
With fraud detection algorithms, customers who have poor credit scores can be identified so
that banks don’t loan money to them. Yet another big application in banking is limiting the
incidences of fraudulent or dubious transactions that could promote anti-social activities or
terrorism.
5. In the lecture, we have observed various challenges associated with Big
Data, but what are the biggest challenge associated with “Big Data” in
your terms?
Lack of understanding: Frequently, organizations neglect to know what big data really is,
what are its advantages, what infrastructure is required, and so on. Without reasonable
comprehension, a big data deployment project is a danger to be destined to disappointment.
Big data, being an enormous change for an organization, ought to be acknowledged by top
management first and afterward down the stepping ladder. To guarantee big data
comprehension and acknowledgment at all levels, organizations need to compose various
training and workshops.
Quality of data: Big data must be cleaned, prepared, verified, reviewed for compliance and
constantly maintained. The issue with these tasks is that information comes in so quick
organizations think that it’s hard to play out the majority of the data preparation activities to
guarantee ideal data quality.
Security: Big data deployment projects put security checking at later stage which is not
advisable.
Big data technologies are progressing, but their safety highlights are still being overlooked as
it is optimistic that security will be enabled at the application level.
6. Develop a use case to implement big data in your assignment and
address following questions.
a. What are the challenges of gathering "Big Data"?
Banking Industry:
Legal and regulatory challenges are the prime one which banking industry is
facing in implementation of big data. It has complexities and limitations due to sheer
size. Many companies already have control and data management procedures in place
for small data—and a comfort level that those controls are appropriate. Given the
growing impacts of regulation and oversight, Banks are steering clear of Big Data—
or at least proceeding judiciously—simply because of the risks.
Privacy and security
Big Data offers great potential to provide major steps forward for Banks, but it also comes with
a large red flag concerning privacy and intrusion. The potential for abuse of this data is
significant, but Banks need to get it right and use it only for increasing customer satisfaction
level. ‘
Organisation Mindset:
Many Banks are still driven by mostly Past Experience, Intuition, SME knowledge and
Customer Experience. They need to have more Data Curiosity, Data Driven thinking and need
to invest more in acquiring, storing and analyzing data.
Talent management:
Blending data scientists and visualization teams is a new workforce management paradigm.
Big Data Specialist need to have solid business understanding, SAS/R/SQL/Python
programming and statistical knowledge along with Visualization skill.
Data Quality:
Data Quality attributes — validity, accuracy, timeliness, reasonableness, completeness, and so
forth — must be clearly defined, measured, recorded, and made available to end users. For Big
Data Quality and Data Management, Banks need to create Data Quality metadata that includes
Data Quality attributes, measures, business rules, mappings, cleansing routines, data element
profiles, and controls.
b. What benefit you can derive from data analysis?
Risk management: The banking industry is built on risk, so every loan and
investment needs to be evaluated. Big data can give banks new insights into their
systems, transactions, customers and environments to help them avoid certain risks.
Marketing automation: With the volumes of data available today, banks can gather
previously unimaginable information about each of their customers. This gives them
a better understanding of customers’ needs and helps them to address these needs
proactively. It also allows different departments within a bank, such as marketing,
sales and IT, to work more cohesively as a single unit.
Transaction Channel Identification: The banks benefit greatly by understanding if
their customers withdraw in cash all the sum available on the payday, or if they prefer
to keep their money on the credit/debit card. Obviously, the latter customers can be
approached with the offers to invest in short-term loans with high payout rates, etc.
Fraud management: Knowing the usual spending patterns of an individual helps
raise a red flag if something outrageous happens. If a cautious investor who prefers to
pay with his card attempts to withdraw all the money from his account via an ATM,
this might mean the card was stolen and used by fraudsters. A call from a bank
requesting a clearance for such operation helps easily understand if it is a legitimate
claim or a fraudulent behavior the cardholder does not know of.
Assignment – 5
Please mention 5 differences between HBASE vs MongoDB vs Cassandra
Ans:
Parameter HBase MongoDB Cassandra
Protocol HTTP/REST
(also Thrift)
Custom binary
(BSON)
CQL3 and
Thrift
Server OS Linux, Unix,
Windows
Linux, OS X,
Solaris,
Windows
FreeBSD,
Linux, OS X,
Windows
Replication Master-Slave
Replication
Master-Slave
Replication
Master less
Ring
Key Point Billions of rows
and million of
columns
Retain amicable
properties like
query and
index.
Store large data
set in almost
SQL
Popular Use Cases Online Log
Analytics,
Hadoop, Write
Operational
Intelligence,
Product Data
Sensor Data,
Messaging
Systems, E-
Heavy
Applications,
MapReduce
Management,
Content
Management
Systems, IoT,
Real-Time
Analytics
commerce
Websites,
Always-On
Applications,
Fraud Detection
for Banks

More Related Content

Big data assignment

  • 1. BIG DATA ASSIGNMENT Submitted to: Mr. Vivek Gautam Submitted by: Anuja Chatterjee Roll No. 19DM039 PGDM Birla Institute of Management Technology December,2019
  • 2. Assignment – 1 Questions for “Big Data Analytics” Course Big Data:: Introduction to Big Data, its origination, explosion and Challenges 1. How will you define “Big Data”? Ans: Big data represents the data assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value. In traditional way it cannot be assessed. 2. What lead to the origination of Big Data? The term Big Data was coined by Roger Mougalas back in 2005. However, in 1663, John Graunt provided the world with the first statistical analysis of data ever recorded in his book ‘Natural and Political Observations Made upon the Bills of Mortality’. The starting point of modern data begins in 1889 when a computing system was invented by Herman Hollerith in an attempt to organize census data. The very first data-processing machine was named ‘Colossus’ and was developed by the British in order to decipher Nazi codes in World War II, 1943. The first data centre was built by the United States government in 1965 for the purpose of storing millions of tax returns and fingerprint sets. This initiative was the starting point of electronic big storage. In 2005, Yahoo created the now open-source Hadoop with the intention of indexing the entire World Wide Web as people began to realise how much data each day is generated through social media and internet platform. NoSQL also began to gain popularity during this time. Although it seems like big data has been around for a long time now and that we are getting closer to the pinnacle, big data may just be at its formidable stages. Big data in the near future may end up making big data now seem like a poultry amount.
  • 3. 3. What is the difference between structured, un-structured and semi- structured data? Structured data: Data that is the easiest to search and organize, because it is usually contained in rows and columns and its elements can be mapped into fixed pre-defined fields, is known as structured data. In structured data, entities can be grouped together to form relations. This makes structured data easy to store, analyze and search and until recently was the only data easily usable for businesses. It is often manage by SQL (Structured Query Language). Examples of structured data include financial data such as accounting transactions, address details, demographic information, star ratings by customers, machines logs, location data from smart phones and smart devices, etc. Unstructured Data: Unstructured data is data that cannot be contained in a row-column database and doesn’t have an associated data model. Most of the present data in world is unstructured. The lack of structure made unstructured data more difficult to search, manage and analyse, which is why companies have widely discarded unstructured data, until the recent proliferation of artificial intelligence and machine learning algorithms made it easier to process. Instead of spreadsheets or relational databases, unstructured data is usually stored in data lakes, NoSQL databases, applications and data warehouses. Example: photos, video and audio files, text files, social media content, satellite imagery, presentations, PDFs, open- ended survey responses, websites etc. Semi-structured Data: The third category is semi-structured data. It has some defining or consistent characteristics but doesn’t conform to a structure as rigid as is expected with a relational database. Therefore, there are some organizational properties such as semantic tags
  • 4. or metadata to make it easier to organize, but there’s still fluidity in the data. Email messages are a good example. While the actual content is unstructured, it does contain structured data such as name and email address of sender and recipient, time sent, etc. 4. In the lecture, we have discussed various business are taking advantage of Big Data. Please provide any 1 use-case for retail and banking Industry? Retail Industry:. Personalizing Customer Experience For retailers, big data can create opportunities to provide better customer experiences. Costco uses their data collection to keep customers healthy. When a California fruit packing company warned Costco about the possibility of listeria contamination in fruits like peaches and plums, Costco was able to email specific customers who had purchased the items affected by the contamination instead of a blanket email to their lists. Forecasting Demand in Retail In addition to big data, some algorithms analyze social media and web browsing trends to predict the next big thing in the retail industry. Perhaps one of the most interesting data points for forecasting demand is the weather. Brands like Walgreens and Pantene worked with the Weather Channel to account for weather patterns in order to customize product recommendations for consumers. Walgreens and Pantene anticipated increases in humidity–a time when women would be seeking anti-frizz products–and served up ads and in-store promotions to drive sales.
  • 5. Banking Industry: Business intelligence (BI) tools are capable of identifying potential risks associated with money lending processes in banks. With the help of big data analytics, banks can analyze the market trends and decide on lowering or increasing interest rates for different individuals across various regions. Data entry errors from manual forms can be reduced to a minimum as big data point out anomalies in customer data too. With fraud detection algorithms, customers who have poor credit scores can be identified so that banks don’t loan money to them. Yet another big application in banking is limiting the incidences of fraudulent or dubious transactions that could promote anti-social activities or terrorism. 5. In the lecture, we have observed various challenges associated with Big Data, but what are the biggest challenge associated with “Big Data” in your terms? Lack of understanding: Frequently, organizations neglect to know what big data really is, what are its advantages, what infrastructure is required, and so on. Without reasonable comprehension, a big data deployment project is a danger to be destined to disappointment. Big data, being an enormous change for an organization, ought to be acknowledged by top management first and afterward down the stepping ladder. To guarantee big data
  • 6. comprehension and acknowledgment at all levels, organizations need to compose various training and workshops. Quality of data: Big data must be cleaned, prepared, verified, reviewed for compliance and constantly maintained. The issue with these tasks is that information comes in so quick organizations think that it’s hard to play out the majority of the data preparation activities to guarantee ideal data quality. Security: Big data deployment projects put security checking at later stage which is not advisable. Big data technologies are progressing, but their safety highlights are still being overlooked as it is optimistic that security will be enabled at the application level. 6. Develop a use case to implement big data in your assignment and address following questions. a. What are the challenges of gathering "Big Data"? Banking Industry: Legal and regulatory challenges are the prime one which banking industry is facing in implementation of big data. It has complexities and limitations due to sheer size. Many companies already have control and data management procedures in place for small data—and a comfort level that those controls are appropriate. Given the growing impacts of regulation and oversight, Banks are steering clear of Big Data— or at least proceeding judiciously—simply because of the risks.
  • 7. Privacy and security Big Data offers great potential to provide major steps forward for Banks, but it also comes with a large red flag concerning privacy and intrusion. The potential for abuse of this data is significant, but Banks need to get it right and use it only for increasing customer satisfaction level. ‘ Organisation Mindset: Many Banks are still driven by mostly Past Experience, Intuition, SME knowledge and Customer Experience. They need to have more Data Curiosity, Data Driven thinking and need to invest more in acquiring, storing and analyzing data. Talent management: Blending data scientists and visualization teams is a new workforce management paradigm. Big Data Specialist need to have solid business understanding, SAS/R/SQL/Python programming and statistical knowledge along with Visualization skill. Data Quality: Data Quality attributes — validity, accuracy, timeliness, reasonableness, completeness, and so forth — must be clearly defined, measured, recorded, and made available to end users. For Big Data Quality and Data Management, Banks need to create Data Quality metadata that includes Data Quality attributes, measures, business rules, mappings, cleansing routines, data element profiles, and controls.
  • 8. b. What benefit you can derive from data analysis? Risk management: The banking industry is built on risk, so every loan and investment needs to be evaluated. Big data can give banks new insights into their systems, transactions, customers and environments to help them avoid certain risks. Marketing automation: With the volumes of data available today, banks can gather previously unimaginable information about each of their customers. This gives them a better understanding of customers’ needs and helps them to address these needs proactively. It also allows different departments within a bank, such as marketing, sales and IT, to work more cohesively as a single unit. Transaction Channel Identification: The banks benefit greatly by understanding if their customers withdraw in cash all the sum available on the payday, or if they prefer to keep their money on the credit/debit card. Obviously, the latter customers can be approached with the offers to invest in short-term loans with high payout rates, etc. Fraud management: Knowing the usual spending patterns of an individual helps raise a red flag if something outrageous happens. If a cautious investor who prefers to pay with his card attempts to withdraw all the money from his account via an ATM, this might mean the card was stolen and used by fraudsters. A call from a bank requesting a clearance for such operation helps easily understand if it is a legitimate claim or a fraudulent behavior the cardholder does not know of.
  • 9. Assignment – 5 Please mention 5 differences between HBASE vs MongoDB vs Cassandra Ans: Parameter HBase MongoDB Cassandra Protocol HTTP/REST (also Thrift) Custom binary (BSON) CQL3 and Thrift Server OS Linux, Unix, Windows Linux, OS X, Solaris, Windows FreeBSD, Linux, OS X, Windows Replication Master-Slave Replication Master-Slave Replication Master less Ring Key Point Billions of rows and million of columns Retain amicable properties like query and index. Store large data set in almost SQL Popular Use Cases Online Log Analytics, Hadoop, Write Operational Intelligence, Product Data Sensor Data, Messaging Systems, E-