- Pig Design Patterns
- Pradeep Pasupuleti
- 693字
- 2021-07-16 12:07:58
The scope of design patterns in Pig
This book deals with patterns that were encountered while solving real-world, recurrent Big Data problems in an enterprise setting. The need for these patterns takes root in the evolution of Pig to solve the emerging problems of large volumes and a variety of data, and the perceived need for a pattern catalog to document their solutions.
The emerging problems of handling large volumes of data, typically deal with getting a firm grip on understanding whether the data can be used or not to generate analytical insights and, if possible, how to efficiently generate these insights. Imagine yourself to be in the shoes of a data scientist who has been given a massive volume of data that does not have a proper schema, is messy, and has not been documented for ages. You have been asked to integrate this with other enterprise data sources and generate spectacular analytical insights. How do you start? Would you start integrating data and fire up your favorite analytics sandbox and begin generating results? Would it be handy if you knew beforehand the existence of design patterns that can be applied systematically and sequentially in this kind of scenario to reduce the error and increase the efficiency of Big Data analytics? The design patterns discussed in this book will definitely appeal to you in this case.
Design patterns in Pig are geared to enhance your ability to take a problem of Big Data and quickly apply the patterns to solve it. Successful development of Big Data solutions using Pig requires considering issues early in the lifecycle of development, and these patterns help to uncover those issues. Reusing Pig design patterns helps identify and address such subtleties and prevents them from growing into major problems. The by-product of the application of the patterns is readability and maintainability of the resultant code. These patterns provide developers a valuable communication tool by allowing them to use a common vocabulary to discuss problems in terms of what a pattern could solve, rather than explaining the internals of a problem in a verbose way. Design patterns for Pig are not a cookbook for success; they are a rule of thumb. Reading specific cases in this book about Pig design patterns may help you recognize problems early, saving you from the exponential cost of reworks later on.
The popularity of design patterns is very much dependent on the domain. For example, the state patterns, proxies, and facades of the Gang of Four book are very common with applications that communicate a lot with other systems. In the same way, the enterprises, which consume Big Data to understand analytical insights, use patterns related to solving problems of data pipelines since this is a very common use case. These patterns specifically elaborate the usage of Pig in data ingest, profiling, cleansing, transformation, reduction, analytics, and egress.
A few patterns discussed in Chapter 5, Data Transformation Patterns and Chapter 6, Understanding Data Reduction Patterns, adapt the existing patterns to new situations, and in the process modify the existing pattern itself. These patterns deal with the usage of Pig in incremental data integration and creation of quick prototypes.
These design patterns also go deeper and enable you to decide the applicability of specific language constructs of Pig for a given problem. The following questions illustrate this point better:
- What is the recommended usage of projections to solve specific patterns?
- In which pattern is the usage of scalar projections ideal to access aggregates?
- For which patterns is it not recommended to use
COUNT
,SUM
, andCOUNT_STAR
? - How to effectively use sorting in patterns where key distributions are skewed?
- Which patterns are related to the correct usage of spill-able data types?
- When not to use multiple
FLATTENS
operators, which can result inCROSS
on bags? - What patterns depict the ideal usage of the nested
FOREACH
method? - Which patterns to choose for a
JOIN
operation when one dataset can fit into memory? - Which patterns to choose for a
JOIN
operation when one of the relations joined has a key that dominates? - Which patterns to choose for a
JOIN
operation when two datasets are already ordered?