Google, Facebook, Twitter, Linkedin, Netflix, etc. all have already advanced their business models based on Big Data. Now it is also your turn. In this article you will discover the Big Data frameworks available and that better fits to your Big Data strategy.
If you want to learn more about Digital Transformation take our FREE Online Course to become a Digital Transformation Manager in 10 Days! We have also available a big data certification training course.
FREE Online Course: Digital Transformation Manager™ »
What You Will Learn
Big Data Stack & Lanscape
There are different components/products in the value chain of Big Data (Ingestion ▷ Processing ▷ Analytics), and each of them fits in one of the layers of the framework (Data Sources / Data Lake / Data Warehouse / User Interface). Following you can find where do fit these components:
Big Data Technology Frameworks
There are plenty of options for processing within a Big Data system. For instance, Apache Hadoop can be considered a processing framework with MapReduce as its default processing engine.
Engines and frameworks can often be swapped out or used in tandem. For instance, Apache Spark, another framework, can hook into Hadoop to replace MapReduce. This interoperability between components is one reason that big data systems have great flexibility.
Batch-only frameworks
For batch-only workloads that are not time-sensitive, Apache Hadoop technology is a good choice that is likely less expensive to implement than some other solutions.
Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Stream-only frameworks
For stream-only workloads, Apache Storm technology has wide language support and can deliver very low latency processing, but can deliver duplicates and cannot guarantee ordering in its default configuration. Apache Samza technology integrates tightly with YARN and Kafka in order to provide flexibility, easy multi-team usage, and straightforward replication and state management.
Hybrid frameworks
For mixed workloads, Apache Spark technology provides high speed batch processing and micro-batch processing for streaming. It has wide support, integrated libraries and tooling, and flexible integrations. Apache Flink technology provides true stream processing with batch processing support. It is heavily optimized, can run tasks written for other platforms, and provides low latency processing, but is still in the early days of adoption.
The best fit for your situation will depend heavily upon the state of the data to process, how time-bound your requirements are, and what kind of results you are interested in. There are trade-offs between implementing an all-in-one solution and working with tightly focused projects, and there are similar considerations when evaluating new and innovative solutions over their mature and well-tested counterparts.
Big Data Governance Frameworks
With this integration of disparate data systems comes the 5th V – Veracity, i.e. the correctness and accuracy of information. Behind any information management practice lies the core doctrines of Data Quality, Data Governance, and Metadata Management, along with considerations for Privacy and Legal concerns. Big Data needs to be integrated into the entire information landscape, not seen as a stand-alone effort or a stealth project done by a handful of Big Data experts.
I hope you have found value in this article and have learned the foundation of Big Data Frameworks. If you like this article. Please share it! Thanks!
References:
https://en.wikipedia.org/wiki/Apache_Hadoop
http://enterprisearchitects.com/the-5v-s-of-big-data/
https://www.slideshare.net/adersberger/big-data-landscape-2016-58917032
https://www.linkedin.com/pulse/iot-big-data-analytics-tech-stack-mahesh-lalwani
https://www.linkedin.com/pulse/overview-apache-flink-4g-big-data-analytics-slim-baltagi
https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared