Aggregating Data With Bash

Bash scripting is not always the best way to work with data. There are many Technologies that might work better for the particular project or data that you are working with. When most people hear about data mining they think of tools like SQL, Python, R, etc. which are all great and flexible tools for working with data. However, sometimes you are stuck with a purely Bash/Unix environment and need to be able to implement some Bash scripting to work with text based data.

There is a lot that can be done with basic Bash tools and scripts, however, there are some hurtles that you will need to over come to provide data analysis through Bash scripting. Bash of course is the terminal shell that comes with many Unix operating systems. If you have little or no experience using a terminal for development, I would highly recommend going through some tutorials to familiarize yourself with the basics.

Bash provides many built in tools that will make analyzing raw text data much easier. However, to work with data you are really going to need tools that will allow you to aggregate and overlap segments of data. Luckily Bash provides some programs that will allow us to create these tools on our own. In this post I’d like to talk about aggregating data, which is a little easier and more straight forward than overlapping data. Aggregating is also important for getting high level counts on different cuts of your dataset.

For this post I am going to be using AWK. AWK is a very useful text programming language, that will be the main technology allowing us to sum up segmented data. I am choosing AWK because it comes with most Unix/Linux distros and is an easy tool to illustrate how to build a basic data aggregator. If you were building this tool for more day to day data analysis I would recommend a language like Python which will give you a little more flexibility. I have a more realistic Python example here if you are interested.

We will want an aggregator that scales well with large amounts of data, so for my design I chose to expect sorted input so that we do not have to load more than two rows of the data into memory. This will create more processing overhead because we will need to sort our input, but it will avoid memory issues when you are working with very large datasets. The basic logic behind the script will be that as the program loops though each line of input, it will check that row’s groupBy column, if that column is the same as the last, add the sum column to a counter variable. If it is not, print the output and start counting again if you are not at the end of the file. Here is my data aggregator in AWK:

BEGIN{
     # Expect tab delimited data
     FS="\t"; OFS="\t";
     # Initialize variables
     holder=$groupBy; counter=0;
}
{
     # Print data if the groupBy column changes
     if ($groupBy != holder && NR != 1) {
          print holder,counter;
          holder=$groupBy;
          counter=$sum;
     }
     # Sum up columns
     else {
          holder=$groupBy;
          counter=counter+$sum;
     }
}
END{
     # Output final record
     print holder,counter;
}

This script will take in the group by column and the sum column as AWK variables. It will then output totals for each group or segment to the terminal. Now that we have our aggregation script in order we need some data to work with. I am going to use a modified version of the sample Coffee Chain business dataset that comes with Tableau desktop to work through how to sum and analyze data. Here is a link to the tab delimited dataset that I am using. You can also get this data set in a slightly different format from Tableau’s website.

One of the major draw backs to using Bash is that we will have to remember what column represents what data as we perform our analysis. This is not an issue if you are using something like SQL because the fields in each record of the dataset are already named. Here are the columns that we will be working with:

1       Area Code
2       Date
3       Market
4       Market Size
5       Product
6       Product Line
7       Product Type
8       State
9       Type
10      Budget COGS
11      Budget Margin
12      Budget Profit
13      Budget Sales
14      COGS
15      Inventory
16      Margin
17      Marketing
18      Number of Records
19      Profit
20      Sales
21      Total Expenses

For my examples, I will choose to ignore most of these columns in favor of generating basic high level cuts and counts of the underlying data. Lets take a look at the Profit, Sales, and Total Expenses cut by Market, State, and Product. We can get all of those totals with a few simple Bash for loops:

# Loop through each dimension: Market, Product, State
for group in 3 5 8
do
     # Print group header
     cat coffee_chain_data.txt | awk -v group=${group} 'BEGIN{FS=OFS="\t"}{if (NR==1) print "\n"$group":"}'
     # Loop through sum columns: Profit, Sales, Total Expenses
     for sumCol in 19 20 21
     do
          # Print sum header
          cat coffee_chain_data.txt | awk -v sumCol=$sumCol 'BEGIN{FS=OFS="\t"}{if (NR==1) print "\n"$sumCol":"}'
          # Aggregate the data
          cat coffee_chain_data.txt | sed '1d' \
          | sort -t$'\t' -k$group,$group | awk -f sum.awk -v groupBy=$group -v sum=$sumCol
     done
done

The output of this command will look like this:

Market:

Profit:
Central    96844
East    83083
South    34762
West    110838

Sales:
Central    265045
East    178576
South    103926
West    272264

Total Expenses:
Central    69033
East    50113
South    31002
West    79514

Product:

Profit:
Amaretto    9602
Caffe Latte    11883
Caffe Mocha    31790
...

We can improve on efficiency if we modified our aggregation script to sum multiple columns at once. This would allow us to read through the entire data set only three times instead of six. If you find yourself aggregating data in Bash quite a bit, a more flexible summing script is essential. However, for the sake of simplicity I will leave it as is. Feel free to modify a script like this to fit your own specific needs.

You can also cut the data by month (or any other dimension for that matter) by adding that field to your groupBy column. Here is an example of total Market Sales by month:

# Print aggregated data grouped by Date and Market
cat coffee_chain_data.txt | awk 'BEGIN{FS="\t";OFS="\t"}{if (NR!=1) { $3=$2"-"$3; print $0; }}' \
| sort -t$'\t' -k3,3 | awk -f sum.awk -v groupBy=3 -v sum=20
01/01/2012-Central    10346
01/01/2012-East    6780
01/01/2012-South    3976
01/01/2012-West    10453
02/01/2012-Central    10503
02/01/2012-East    6920
02/01/2012-South    4082
02/01/2012-West    10587
...

This would allow you to trend the aggregated Market counts over a timeline. You could use the same method to get counts by Market Product, by State and Type or any other aggregation of the dataset that you can think of.

This may not be the best way to work with this type of data but if you find it necessary to use Bash for data analysis this method can go a long way. If you are doing any work with Bash for data analysis, I hope that you have found this post useful. I plan to get another post up about overlapping various segments soon, so keep an eye out for that one. If you guys have any comments or questions or just want to let me know how you work with data, don’t hesitate to leave a comment below.