Group and sum a value Groupby and functools in Python

5 min readOct 15, 2023

I couldn’t resist writing this blog post after getting myself involved in the world of time series data!! I want to share this exercise (and make a note to my future self), while playing a role of a fictional cookie shop business owner. The goal for this exercise is to create a dictionary that records total cookies sold on each day, which then storing them in a database. We choose to work with Cassandra, as the database. Cassandra is a NoSQL (non-relational database), which is a different topic for another day. So, why did we use timeseries? Well, maybe because imagine if we need to pinpoint the exact date and time of our store experiences its highest sales (to plan on the crew shifts schedules), or simply just wanting to know our sales figures daily (again, this is just for the sake of pretending and dreaming).

Let’s start with dictionary in Python. According to ChatGPT, dictionary is “an unordered and mutable collection of key-value pairs. Each key in a dictionary is unique and associated with a specific value.”
Characteristics of dictionary are:

  1. It has key-value pairs
  2. It is unordered (but you can sort it, which I will show you how)
  3. Mutable (it can be modified → add, update, delete)
  4. It can contain various types (from primitive type like string, int, float, etc to object)
  5. You can access the value associated with specific key using square bracket ([])

It looks like JSON, but they are not the same thing. Let’s pretend we have a booth in one of the Christmas markets in the city, and we need to keep track of how many cookies were sold in certain date. Here is the example of data showing how many cookies were sold in each time stamp (assuming we got this number based on the till report):

Img 1. Data containing number of cookies sold at certain timestamp

Now, what does this number mean, you asked? The value of time from data above is the Epoch timestamp. It is measured in miliseconds. If you ever wonder how to convert Epoch time to human-readable date, just use this converter.

Back to our store problem, img 1 shows that we have a dictionary called data. Next, we need to know when the first cookie was sold and when we closed the last sale. So, we need to put a date on that dictionary (you can double check the date conversion using converter above). let’s sort the data based on timestamp, so we know when the event starts and ends. Luckily, in order to use function sorted(), we don’t need to import anything, as it is a built-in function that is available by standard Python installation.

Sorted function to take arguments of data and key as the basis of sorting, which in this case, time. The last line prints the first value of time (in ms) and convert the date using datetime, which we need to import in the beginning.

We can refactor the above function to make the code cleaner, so let’s define a function to validate the date (I should definitely refactor all my variable later..It’s kinda confusing at the moment).

data has unsorted timeline, while sorted_data has ordered timeline based on ascending value

Voila, we know when the first sale and the last sale was (in term of date). Next, we want to aggregate the data based on the date. Once we group them into the same date, we want to calculate the sum of the cookies sold daily. In this example, I only have 5days of data, just to make things simple, with the first date to be on Aug 29, 2023 at 11:51:48 pm MST time (for some reason, we decided to have our store opening at 11:50 pm!!Whoops; otherwise, we can move our store to London, UK, where it was dated for Aug 30 at 5:51 am..Does it make more sense???Not really, but it’s okay, some coffee shops open as early as 5:30 am anyway).

We will be using itertools library and import groupby to our data into the same date. What itertools does is to iterate for efficient looping. This library can be used to store the grouped data in a list, while taking a map as a key argument, to access the value of each key. Additionally, you can combine functools library to keep adding the value (in our case, the total number of cookies), and save it in a total variable. Here is the example of how to use group by with functools, reduce function:

groupby was used in for loop to iterate data in sorted_data, through a key of validate.

As we see in above code, we have for loop that uses groupby to iterate sorted_data variable, and group the value based on the key of validate (which is in above definition, it is a sorted date in the format of YYYY-MM-DD).

The functools.reduce function takes in 3 parameters: (lambda) function, a list, and initial value of a variable (initializer). In the code above, the lambda function adds 2 values together, starts from initiliazer, and keep adding number from ‘values’ variable, until end of the list. In short, it works like this code ‘sum =+ values[i]’. The last line prints the result of ‘key:value’ which looks like below:

result of print (f”{key}:{sum}”)

Additionally, ‘values’ groups the data by date, and keep all ‘sold’ number in an array, if we have more than 1 data in one date. If we print values, through each iteration, it will look like this:

values keeps all data for each date, from the top, date recorded was 2023–08–29, to the end, date recorded was 2023–09–02.

Hope this helps you learning a bit of python library!
Until next time :)

--

--

Julie Gunawan
Julie Gunawan

Written by Julie Gunawan

Tech enthusiast, interested in backend and DevOps

No responses yet