- Pig Design Patterns
- Pradeep Pasupuleti
- 1314字
- 2021-07-16 12:07:56
What this book covers
Chapter 1, Setting the Context for Design Patterns in Pig, lays a basic foundation for design patterns, Hadoop, MapReduce and its ecosystem components gradually exposing Pig, its dataflow paradigm, and the language constructs and concepts with a few basic examples that are required to make Pig work. It sets the context to understand the various workloads Pig is most suitable for and how Pig scores better. This chapter is more of a quick practical reference and points to additional references if you are motivated enough to know more about Pig.
Chapter 2, Data Ingest and Egress Patterns, explains the data ingest and egress design patterns that deal with a variety of data sources. The chapter includes specific examples that illustrate the techniques to integrate with external systems that emit multistructured and structured data and use Hadoop as a sink to ingest. This chapter also explores patterns that output the data from Hadoop to external systems. To explain these ingest and egress patterns, we have considered multiple filesystems, which include, but are not limited to, logfiles, JSON, XML, MongoDB, Cassandra, HBase, and other common structured data sources. After reading this chapter, you will be better equipped to program patterns related to ingest and egress in your enterprise context, and will be capable of applying this knowledge to use the right Pig programming constructs or write your own UDFs to accomplish these patterns.
Chapter 3, Data Profiling Patterns, focuses on the data profiling patterns applied to a multitude of data formats and realizing these patterns in Pig. These patterns include different approaches to using Pig and applying basic and innovative statistical techniques to profile data and find data quality issues. You will learn about ways to program similar patterns in your enterprise context using Pig and write your own UDFs to extend these patterns.
Chapter 4, Data Validation and Cleansing Patterns, is about the data validation and cleansing patterns that are applied to various data formats. The data validation patterns deal with constraints, regex, and other statistical techniques. The data cleansing patterns deal with simple filters, bloom filters, and other statistical techniques to make the data ready for transformations to be applied.
Chapter 5, Data Transformation Patterns, deals with data transformation patterns applied to a wide variety of data types ingested into Hadoop. After reading this chapter, you will be able to choose the right pattern for basic transformations and also learn about widely used concepts such as creating joins, summarization, aggregates, cubes, rolling up data, generalization, and attribute construction using Pig's programming constructs and also UDFs where necessary.
Chapter 6, Understanding Data Reduction Patterns, explains the data reduction patterns applied to the already ingested, scrubbed, and transformed data. After reading this chapter, you will be able to understand and use patterns for dimensionality reduction, sampling techniques, binning, clustering, and irrelevant attribute reduction, thus making the data ready for analytics. This chapter explores various techniques using the Pig language and extends Pig's capability to provide sophisticated usages of data reduction.
Chapter 7, Advanced Patterns and Future Work, deals with the advanced data analytics patterns. These patterns cover the extensibility of the Pig language and explain with use cases the methods of integrating with executable code, map reduce code written in Java, UDFs from PiggyBank, and other sources. Advanced analytics cover the patterns related to natural language processing, clustering, classification, and text indexing.
Motivation for this book
The inspiration for writing this book has its roots in the job I do for a living, that is, heading the enterprise practice for Big Data where I am involved in the innovation and delivery of solutions built on the Big Data technology stack.
As part of this role, I am involved in the piloting of many use cases, solution architecture, and development of multiple Big Data solutions. In my experience, Pig has been a revelation of sorts, and it has a tremendous appeal for users who want to quickly pilot a use case and demonstrate value to business. I have used Pig to prove rapid gains and solve problems that required a not-so-steep learning curve. At the same time, I have found out that the documented knowledge of using Pig in enterprises was nonexistent in some cases and spread out wide in cases where it was available. I personally felt the need to have a use case pattern based reference book of knowledge. Through this book, I wanted to share my experiences and lessons, and communicate to you the usability and advantages of Pig for solving your common problems from a pattern's viewpoint.
One of the other reasons I chose to write about Pig's design patterns is that I am fascinated with the Pig language, its simplicity, versatility, and its extensibility. My constant search for repeatable patterns for implementing Pig recipes in an enterprise context has inspired me to document it for wider usage. I wanted to spread the best practices that I learned while using Pig through contributing to a pattern repository of Pig. I'm intrigued by the unseen possibilities of using Pig in various use cases, and through this book, I plan to stretch the limit of its applicability even further and make Pig more pleasurable to work with.
This book portrays a practical and implementational side of learning Pig. It provides specific reusable solutions to commonly occurring challenges in Big Data enterprises. Its goal is to guide you to quickly map the usage of Pig to your problem context and to design end-to-end Big Data systems from a design pattern outlook.
In this book, a design pattern is a group of enterprise use cases logically tied together so that they can be broken down into discrete solutions that are easy to follow and addressable through Pig. These design patterns address common enterprise problems involved in the creation of complex data pipelines, ingress, egress, transformation, iterative processing, merging, and analysis of large quantities of data.
This book enhances your capability to make better decisions on the applicability of a particular design pattern and use Pig to implement the solution.
Pig Latin has been the language of choice for implementing complex data pipelines, iterative processing of data, and conducting research. All of these use cases involve sequential steps in which data is ingested, cleansed, transformed, and made available to upstream systems. The successful creation of an intricate pipeline, which integrates skewed data from multiple data platforms with varying structure, forms the cornerstone of any enterprise, which leverages Big Data and creates value out of it through analytics.
This book enables you to use these design patterns to simplify the creation of complex data pipelines using Pig, ingesting data from multiple data sources, cleansing, profiling, validating, transformation and final presentation of large volumes of data.
This book provides in-depth explanations and code examples using Pig and the integration of UDFs written in Java. Each chapter contains a set of design patterns that pose and then solve technical challenges that are relevant to the enterprise's use cases. The chapters are relatively independent of each other and can be completed in any order since they address design patterns specific to a set of common steps in the enterprise. As an illustration, a reader who is looking forward to solving a data transformation problem, can directly access Chapter 5, Data Transformation Patterns, and quickly start using the code and explanations mentioned in this chapter. The book recommends that you use these patterns for solving the same or similar problems you encounter and create your own patterns if the design pattern is not suitable in a particular case.
This book's intent is not to be a complete guide to Pig programming but to be more of a reference book that brings in the design patterns' perspective of applying Pig. It also intends to empower you to make creative use of the design patterns and build interesting mashups with them.