Before we dig deep to Hadoop lets first learn what is Big Data
What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
Examples Of Big Data
The New York Stock Exchange generates about one terabyte of new trade data per day.
500+terabytes of new data get ingested into the databases of Facebook, every day in terms of photo and video uploads, message exchanges comments etc.
Jet Airways can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.
Data that is being collected can be Structured, Semi-structured or Unstructured data
Now lets focus on how Hadoop solve the problems related to Big Data
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
Why is Hadoop important?
- Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that’s a key consideration.
- Computing power. Hadoop’s distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
- Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.
- Flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
- Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
Top 10 Big Data Applications Across Industries
Healthcare : Big data in healthcare is transforming the way we identify and treat illnesses, improve quality of life and avoid preventable deaths.Big Data techniques have been used to monitor the babies’ heartbeats and breathing patterns. Using this data, the unit was able to develop algorithms that predict infections 24 hours before any physical symptoms occur.
Retail : The way we buy and sell is evolving fast. Both online and offline, those retailers that are embracing a data-first strategy towards understanding their customers, matching them to products and parting them from their cash are reaping huge rewards.
Manufacturing : Advances in robotics and increasing levels of automation are dramatically changing the face of manufacturing.
Education : Increasingly large amounts of data are being generated about how we learn, and education establishments are now beginning to turn this data into insights that can identify better teaching strategies, highlight areas where students may not be learning efficiently, and transform the delivery of education.
Transportation, supply chain management and logistics : In warehouses, digital cameras are routinely used to monitor stock levels and the data provides alerts when restocking is needed.
Sports : A set of cameras installed around the stadiums now track every player using pattern recognition generating over 25 data points per player every second and sensors on shoulder pads to gather data on their performance.
How MNC’s are using Big Data Analytics
Facebook : In Facebook servers 500+ terabytes of data is uploaded per day. To process such large chunks of data, Facebook uses Hive for parallel map-reduce opertions and Hadoop for its data storage. Employees also use Cassandra which is fault-tolerant, distributed storage system aiming to manage large amount of structured data across variety of commodity servers. Facebook also uses Scuba to carry out real-time ad-hoc analysis on massive data sets. Hive is used to store large data in Oracle data warehouse. Prism is used to bring out and manage multiple namespaces instead of a single one managed by Hadoop. Facebook also uses many other big data technologies such as Corona, Peregine, among many others.
Oracle : Oracle users use Oracle Advanced Analytics which requires Oracle database to be loaded with data. Oracle advanced analytics provides functionalities such as text mining, predictive analytics, statistical analysis and interactive graphics among many others. HDFS data can be loaded into an Oracle data warehouse using Oracle Loader for Hadoop. This feature is used to link data and search query results from Hadoop to Oracle data warehouse. Oracle Exadata Database Machine provides scalable and high-end performance for all database applications. Oracle is leveraging big data to mainly expand its business in Database management systems.
Google : Google derives search results from knowledge graph database, indexed pages and Google bots crawling over a plethora of web pages. The user requests are processed in Google’s application servers. The application server searches results in GFS (Google File System) and logs the search queries in logs cluster for quality testing. Google uses Dremel which is a query execution engine to run almost near real-time, ad-hoc queries from search engines. This kind of advantage is not present in MapReduce. Google launched BigQuery which runs queries based on aggregation over billions row tables in a matter of seconds. Google is really advanced in its implementation of big data technologies.
Microsoft : Using Hortonworks Data Platform, big data solutions based on Hadoop is used by Microsoft. Microsoft uses big data on its components like SQL server, HDInsight to better its applications like Excel, SQL Server Reporting Services (SSRS).