Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets. Hadoop and Big Data are all the rage now. But Hadoop does solve a real problem and it is a safe bet that it is here to stay.


Avro is a project for data serialization in formats. It is similar to Thrift or protocol Buffers. It’s expressive. You can handle terms of records, ranges, unions, enums. It’s reliable so it has a compact binary representation. Among the benefits of logging an Avro is that you get much smaller sized data files. All the traditional aspects of Hadoop data formats, like compressible or splittable data, are true of Avro.

Among the reasons Doug Cutting (creator of the Hadoop project) developed the Avro project was that a lot of the formats in Hadoop were Java just. It is essential for Avro to be interoperable– to a lot of different languages like Java, C, C++, C#, Python, Ruby, and so on– and to be functional by a lot of tools.

Among the objectives for Avro is a set of formats and serialization that’s functional throughout the data platform that you’re making use of, not just in a subset of the components. So MapReduce, Pig, Hive, Crunch, Flume, Sqoop, and so on all support Avro.

Avro is vibrant and among its neat functions is that you can check out and compose data without generating any code. It will make use of reflection and take a look at the schema that you have actually given it to produce courses on the fly. That’s called Avro-generic formats. You can also specify formats for which Avro will generate optimum code.

Avro was developed with expectation that you would change your schema with time. That’s an important quality in a big-data system due to the fact that you produce lots of data, and you do not want to continuously recycle it. You’re going to produce data at one time and have tools process that data perhaps 2, three, or four years down the line. Avro has the ability to negotiate distinctions between schemata so that new tools can check out old data and vice versa.

Avro forms an important basis for the list below projects.


You’re probably acquainted with Pig and Hive and how to process data with them and incorporate valuable devices. Nevertheless, not all data formats that you use will certainly fit Pig and Hive.

Pig and Hive are fantastic for a great deal of logged data or relational data, but other data types do not fit as well. You can still process poorly fitting data with Pig and Hive, which don’t force you to a relational model or a log structure, however you need to do a lot of work around it. You might find yourself writing awkward user-defined features or doing things that are not natural in the language. Individuals, occasionally, simply give up and start writing raw Java MapReduce programs since that’s easier.

Crunch was created to fill this space. It’s a higher-level API than MapReduce. It’s in Java. It’s lower level than, say, Pig, Hive, Cascade, or other structures you might be utilized to. It’s based upon a paper that Google released called FlumeJava. It’s an extremely similar API. Crunch has you combine a small number of primitives with a small number of types and successfully enable the user to produce actually light-weight UDS, which are simply Java techniques and courses to develop complicated data pipelines.

Crunch has a variety of benefits.

  • It’s just Java. You have access to a full programs language.
  • You do not need to discover Pig.
  • The type system is well-integrated. You can use Java POJOs, but there’s likewise a native support for Hadoop Writables in Avro. There’s no impedance mismatch in between the Java codes you’re writing and the data that you’re analyzing.
  • It’s built as a modular library for reuse. You can record your pipelines in Crunch code in Java and afterwards combine it with arbitrary machine discovering program later, so that somebody else can recycle that algorithm.

The basic structure is a parallel collection so it’s a dispersed, unordered collection of elements. This collection has a parallel do operator which you can picture become a MapReduce job. So if you had a bunch of data that you wish to run in parallel, you can make use of a parallel collection.

And there’s something called the parallel table, which is a subinterface of the collection, and it’s a distributed sorted map. It also has a group by operators you can make use of to aggregate all the values for a given key. We’ll go through an example that shows how that works.

Lastly, there’s a pipeline class and pipelines are actually for collaborating the execution of the MapReduce jobs that will actually do the back-end processing for this Crunch program.

Let’s take an example for which you’ve probably seen all the Java code before, word count, and see exactly what it appears like in Crunch.

It’s a lot smaller sized and easier. The first line creates a pipeline. We produce a parallel collection of all the lines from a provided file by utilizing the pipeline course. Then we get a collection of words by running the parallel do operator on these lines.

We have actually got a specified confidential feature right here that basically processes the input and word count splits on the word and gives off that word for each map task.

Finally, we want to aggregate the counts for each word and compose them out. There’s a line at the bottom, pipeline run. Crunch’s coordinator does lazy evaluation. We’re going to develop and run the MapReduce jobs up until we have actually gotten a complete pipeline together.

If you’re utilized to programming Java and you’ve seen the Hadoop examples for composing word count in Java, you can inform that this is a more natural means to reveal that. This is amongst the easiest pipelines you can produce, and you can picture you can do many more complicated things.

If you want to go even one step much easier than this, there’s a wrapper for Scala. This is really comparable idea to Cascade, which was built on Google FlumeJava. Since Scala works on the JVM, it’s an obvious natural fit. Scala’s type inference actually ends up being actually powerful in the context of Crunch.

This is the same program but written in Scala. We have the pipeline and we can utilize Scala’s integrated features that map truly nicely to Crunch– so word count becomes a one-line program. It’s quite cool and really effective if you’re writing Java code currently and wish to do complex pipelines.

Cloudera ML

Cloudera ML (machine learning) is an open-source library and devices to help data researchers perform the everyday jobs, mostly of data preparation to design examination.

With built-in commands for summing up, sampling, stabilizing, and pivoting data, Cloudera ML has just recently added a built-in clustering algorithm for k-means, based on an algorithm that was simply established a year or more back. There are a couple of other implementations too. It’s a house for devices you can utilize so you can concentrate on data analysis and modeling instead of on building or wrangling the tools.

It’s built making use of Crunch. It leverages a great deal of existing projects. As an example, the vector formats: a great deal of ML involves transforming raw data that’s in a record format to vector formats for machine-learning algorithms. It leverages Mahout’s vector interface and classes for that purpose. The record format is simply a thin wrapper in Avro, and HCatalog is record and schema formats so you can quickly integrate with existing data sources.

For more details on Cloudera ML, check out the tasks’ GitHub page; there’s a lot of examples with datasets that can get you started.

Cloudera Development Kit

Like Cloudera ML, Cloudera Development Kit a set of open-source libraries and tools that make writing applications on Hadoop simpler. Unlike ML though, it’s not focused on using machine learning like a data scientist. It’s directed at designers trying to build applications on Hadoop. It’s really the plumbing of a lot of various frameworks and pipelines and the integration of a great deal of various parts.

The function of the CDK is to offer higher level APIs on top of the existing Hadoop components in the CDH stack that codify a great deal of patterns in common use cases.

CDK is prescriptive, has an opinion en route to do things, and tries to make it simple for you to do the best thing by default, however it’s architect is a system of freely coupled modules. You can make use of modules independent of each other. It’s not an uber-framework that you have to adopt in whole. You can embrace it piecemeal. It doesn’t force you into any specific programs paradigms. It does not force you to embrace a lots of dependencies. You can embrace just the dependencies of the modules you really want.

Let’s take a look at an example. The first module in CDK is the data module, and the goal of the data module is to make it easier for you to deal with datasets on Hadoop file systems. There are a lot of gory information to clean up to make this work in practice; you have to worry about serialization, deserialization, compression, partitioning, directory site layout, driving, getting that directory site layout, partitioning to other people who want to consume the data, and so on

The CDK data module deals with all this for you. It instantly serializes and deserializes data from Java POJOs, if that’s what you have, or Avro records if you use them. It has integrated compression, and integrated policies around file and directory designs so that you do not have to duplicate a great deal of these decisions and you get smart policies from the box. It will instantly partition data within those layouts. It lets you concentrate on dealing with a dataset on HDFS instead of all the application information. It also has plugin service providers for existing systems.

Imagine you’re already utilizing Hive and HCatalog as a metadata repository, and you’ve already got a schema for what these files appear like. CDK incorporates with that. It doesn’t require you to specify all your metadata for your whole data repository from scratch. It integrates with existing systems.

You can learn more about the numerous CDK modules and how to utilize them in the documentation.


In summary, dealing with data from different sources, preparing and cleaning data and processing them through Hadoop includes a great deal of work. Tools such as Crunch, Cloudera ML and CDK make it simpler to do this and leverage Hadoop more effectively.


VizTeams has over 300 experts with the history of successfuly delivering over 500 projects. VizTeams serves cllient inside North America specifically USA and Canada while physically serving clients in the cities of Seattle, Toronto, Buffalo, Ottawa, Monreal, London, Kitchener, Windsor, Detroit. Feel free to contact us or Drop us a note for any help or assistance.


Drop Us A Note