AWS EMR, Data Aggregation and its best practices

6 min readDec 12, 2020

What is Big Data ?

Big data is a field about collecting, storing, processing, extracting, and visualizing massive amounts of knowledge in order that companies can distill knowledge from it, derive valuable business insights from that knowledge.While managing such data analysis platforms different kinds of challenges such as installation and operational management, whereas dynamically allocating processing capacity to accommodate for variable load, and aggregating data from multiple sources for holistic analysis needs to be faced. The Open Source Apache Hadoop and its ecosystem of tools help solve these problems because Hadoop can expand horizontally to accommodate growing data volume and may process unstructured and structured data within the same environment.

What is AWS EMR ?

Amazon Elastic MapReduce (Amazon EMR) simplifies running Hadoop. It helps in running big data applications on AWS efficiently. It replaces the need of managing the Hadoop installation which is very daunting task. This suggests any developer or business has the facility to try to to analytics without large capital expenditures. Users can easily start a performance-optimized Hadoop cluster within the AWS cloud within minutes. The service allows users to effortlessly expand and shrink a running cluster on demand. They can analyze and process vast amounts of knowledge by using Hadoop’s MapReduce architecture to distribute the computational work across a cluster of virtual servers running within the AWS cloud so that it can be processed, analyzed to gain additional knowledge which involves data collection, migration and optimization.

What is Data Aggregation ?

Data aggregation refers to techniques for gathering individual data records (for example log records) and mixing them into an outsized bundle of knowledge files. For example, one log file records all the recent visits in a web server log aggregation. AWS EMR is very helpful when utilized for aggregating data records

It reduces the time required to upload data to AWS. As a result, it increases the data ingest scalability. In other words, we are uploading larger files in small numbers instead of uploading many small files. It reduces the amount of files stored on Amazon S3 (or HDFS), which eventually assists in providing a better performance while processing data on EMR. As a result, there is a much better compression ratio. It is always an easy task to compress a large, highly compressible files as compared to compressing an out-sized number of smaller files.

Data Aggregation Best Practices

Hadoop will split the data into multiple chunks before processing them. Each map task process each part after it is splitted. The info files are already separated into multiple blocks by HDFS framework. Additionally, since your data is fragmented, Hadoop uses HDFS data blocks to assign one map task to every HDFS block.

While an equivalent split logic applies to data stored on Amazon S3, the method may be a different. Hadoop splits the info on Amazon S3 by reading your files in multiple HTTP range requests because the info on Amazon S3 is not separated into multiple parts on HDFS. This is often simply how for HTTP to request some of the file rather than the whole file (for example, GET FILE X Range: byte=0–10000). To read file from Amazon S3, Hadopp ussed different split size which depends on the Amazon Machine Image (AMI) version. The recent versions of Amazon EMR have larger split size than the older ones. For instance , if one file on Amazon S3 is about 1 GB, Hadoop reads your file from Amazon S3 by issuing 15 different HTTP requests in parallel if Amazon S3 split size is 64 MB (1 GB/64 MB = ~15). Irrespective of where the data is stored, if the compression algorithm does leave splitting then Hadoop will not split the file. Instead, it will use one map task to process the compressed file.Hadoop processes the file with one mapper in casse the GZIP file size is of 1 GB. On the opposite hand, if your file are often split (in the case of text or compression that permits splitting, like some version of LZO) Hadoop will split the files in multiple chunks. These chunks are processed in parallel.

Figure 3 : AWS EMR pulling compressed data from S3

Figure 4 : AWS EMR using HTTP Range Requests

Best Practice 1: Aggregated Data Size

The suitable aggregated file size depends on the compression algorithm you’re using. As an example , if your log files are compressed with GZIP, it’s often best to stay your aggregated file size to 1–2 GB. The primary principle is that since we cannot split GZIP files, each mapper is assigned by Hadoop to process your data. Since one thread is restricted to what proportion data it can pull from Amazon S3 at any given time, the method of reading the whole file from Amazon S3 into the mapper becomes the main drawback in your processing workflow. On the opposite hand, if your data files are often split, one mapper can process your file. The acceptable size is around 2GB to 4GB for such kind of data files.

Best Practice 2: Controlling Data Aggregation Size

In case a distributed log collector is being used by the customer then he/she is limited to data aggregation based on time. Let’s take an example. A customer uses Flume to aggregate his organisation’s important data so that later he can export it to AWS S3. But, due to time aggregation, the customer will not be able to control the size of the file created. It is because the dimensions of any aggregated files depend upon the speed at which the file is being read by the distributed log collector.Since many distributed log collector frameworks are available as open source, it is possible that customers might write special plugins for the chosen log collector to introduce the power to aggregate based on file size.

Best Practice 3: Data Compression Algorithms

As our aggregated data files are depending on the file size, the compression algorithm will become an important choice to select. For example, GZIP compression is accepted if the aggregated data files are between 500 MB to 1 GB. However, if your data aggregation creates files larger than 1 GB, its best to select a compression algorithm that supports splitting.

Best Practice 4: Data partitioning

Data partitioning is an important optimization to your processing workflow. Without any data partitioning in situ , your processing job must read or scan all available data sets and apply additional filters so as to skip unnecessary data. Such architecture might work for a coffee volume of knowledge , but scanning the whole data set may be a very time consuming and expensive approach for larger data sets. Data partitioning allows you to create unique buckets of knowledge and eliminate the necessity for a knowledge processing job to read the whole data set. Three considerations determine how you partition your data:
1. Data type (time series)

2. Data processing frequency (per hour, per day, etc.)

3. Different query pattern and Data access (query on time vs. query on geo location)