Data Aggregation in Java using Collections Framework
Let’s look at the problem before understanding what data aggregation is. You are given daily consumption of different types of candies by a candy-lover. The input is given as follows.
Desired Output Table:
Above is the aggregated data. Now let’s understand What is Data Aggregation?
What is Data Aggregation?
Let’s understand it with an example. For instance, in the input table, we are given the consumption amount of different kinds of candies each day. For example, on Aug 28, 2022, the volunteer consumed only 2 kinds of candies which are KitKat and Hershey’s. Whereas, on Aug 29, 2022, the volunteer consumed 4 kinds of candies which are KitKat, Skittles, Alpen Liebe and Cadbury. Now, we are also given the consumption amount. For instance, the most eaten candy on Aug 29, 2022, is Cadbury. Whereas, the most eaten candy on Aug 28, 2022, is KitKat. However, looking at the input table, we cannot directly answer the following question: Which candy is the most popular each day (or even which candy is popular overall)?
Now, it seems like looking at the above input table we can answer the question by immediately looking over the data for matching dates. But imagine, we run the survey for a month or a quarter and we now introduce 100 more brands of candies for the volunteer to choose and eat from. The size of the data will grow so quickly, that it would be almost impossible to answer the question just by looking at the table. There’s even another possibility where the data is scattered such that the data collected for a specific date is not shown consecutively as shown in the input table above. In that case, it would become even more complicated to directly look at the raw data and answer.
Now to answer such statistical questions in an efficient manner, we would need to organize our data. We would need to categorize the data in such a way that by looking at our transformed data, we can immediately answer the question that is:- which candy is more popular each day? For instance, by looking at the data after aggregation, we can say that on August 28, KitKat is more eaten and on Aug 29, Cadbury is more eaten just by looking at the column under each date. Not only that, but we can now also answer the following questions:
- On what date was a particular kind of candy eaten more? (By looking at the row of that candy)
- Which candy is popular overall? (By looking at the last “Total” column).
- Which day witnessed the most candy consumption? (By looking at the last “Total” row)
- Alpenliebe was eaten more on Aug 27 and Aug 29.
- Kitkat on the other hand is the overall popular candy.
- Aug 29, turned out to be the day when most candies were consumed. Maybe, we can declare it “Candy Day”.
So, we are experiencing the benefits of aggregating the data. It’s a technique of summarizing the data we have for the purpose of analyzing it, making the raw data more meaningful. We are now in a more efficient position to answer the above questions.
We are required to transform the given input table of candy consumption on a specific date into an aggregated table where data collected for each candy should be aggregated into a value for a day. (Refer to the output table above). Following is the code for the above problem:
Date Candy Consumption 27-08-2022 skittles 20 27-08-2022 Kitkat 10 27-08-2022 Alpenliebe 20 28-08-2022 Kitkat 30 28-08-2022 Hershey's 25 29-08-2022 Kitkat 30 29-08-2022 skittles 15 29-08-2022 Alpenliebe 20 29-08-2022 Cadbury 45 After Aggregation Candy/Date 27-08-2022 28-08-2022 29-08-2022 Total Kitkat 10 30 30 70 Cadbury 0 0 45 45 Alpenliebe 20 0 20 40 Hershey's 0 25 0 25 skittles 20 0 15 35 Total 50 55 110 215