collections package in python

collections module is one of the favorite and most used packages in python, what do you think? I use a couple of them a lot, and the rest whenever I see an opportunity. There are a lot of good resources on collections if you search on the internet, but I still wanted to share my perspective on them along with some practical scenarios where they can be used and benefitted. So, here we go 🚀.

Why do we need these additional data types?

list, tuple, and dictionaries are the default container data types in python. They are extremely useful data types, each one with a unique purpose when it comes to holding data. They have a few limitations, not exactly limitations but aspects that may become a bottleneck in a few practical scenarios. For example, in the case of lists, the time taken to insert an object at the beginning of the list will be O(n) + 1.

Another reason why default data types may be a second choice of option is because of the readability and maintainability of code. An example would be, counting a particular item in an iterable is a very common use case. We can write custom logic, but over a while, there are chances for it to become spaghetti 🍝.

This is where Python's collections module shines without a doubt.

Different containers in the collections module

Below are the types present in the collections module:

defaultdict - This one is similar to a dictionary with an additional property of not throwing an exception if a key is missing. It also initializes a dictionary with a default value provided.
Counter - A very handy type that can count unique items in an iterable or a sequence.
ChainMap - This one is again similar to a dictionary but can combine multiple dictionaries into a single object.
namedtuple - Overcomes the drawback of tuple where items can be accessed by index only. How does it do that? namedtuple provides an option to name fields but still allows to access fields by index.
deque - This one is similar to a list, but allows to add and remove of items from both ends of the sequence. It reduces time constraints notably, especially with large sequences.
UserString, UserList, UserDict - These are wrapper classes around String, List and Dict objects which enable subclassing.

Let's now take a look at each, and see where it makes sense to use them. A word of caution, these are some hypothetical examples that I created, just to kick-start thinking along those lines.

We shall use the image on the cover as a starting point 🚗. "Cold Wheels" is a chain of stores that sells die-cast cars. It has retail outlets all over the world. The core development team of Cold Wheels exposes multiple APIs that are consumed by their Customer support team as well as the Sales team.

defaultdict

The Customer support team spends time understanding the feedback that they receive. Let's assume all feedback is stored in a table. The management wants to know the sentiment of each region. For the moment, let's assume the development team is yet to build sophisticated models for sentiment analysis. Instead, they use a custom piece of logic to understand whether the feedback is positive or negative, by simply counting the number of positive adjectives versus negative.

Sample Feedback
Great store! People are friendly. Fun place to visit. Pro tip: Park at the Webster bank over the bridge and walk over!
Nice hobby and toy store friendly staff and owner.
Great place to go and have some fun..staff are great patient and always willing to help..I highly recommend this place for everone kids all ages....
Nice family run business....Fantastic customer service....Great toys for everyone and assortment of all scales of model trains...just to name a few...

Let's use defaultdict to store the word count. We can initialize it as int so that it will default to 0 if a particular word is not present as a key. If it were a normal dictionary, it will throw a KeyError, and we need to handle it explicitly. defaultdict neatly removes those hassles.

from collections import defaultdict

feedback = [
  "Great store! People are friendly. Fun place to visit. Pro tip: Park at the Webster bank over the bridge and walk over!",
  "Nice hobby and toy store friendly staff and owner.",
  "Great place to go and have some fun..staff are great patient and always willing to help..I highly recommend this place for everone kids all ages",
  "Nice family run business....Fantastic customer service....Great toys for everyone and assortment of all scales of model trains...just to name a few."
]

# We first split each word in each sentence
split_feedback = [feedback.split() for feedback in feedback]

# Iterate through each sublist (feedback) in the list of lists
for feedback in split_feedback:
    # Iterate through each word in the feedback
    for word in feedback:
        # Convert the word to lowercase to ensure case-insensitivity
        word = word.lower()
        # Increment the word count
        word_count[word] += 1

# Print the word count dictionary
print(dict(word_count))

Isn't it cool 😎?

Counter

Let's consider the same use case of understanding the sentiment, but let's use Counter this time. This time, we just need one line of code to perform the counting.

from collections import Counter

feedback = [
  "Great store! People are friendly. Fun place to visit. Pro tip: Park at the Webster bank over the bridge and walk over!",
  "Nice hobby and toy store friendly staff and owner.",
  "Great place to go and have some fun..staff are great patient and always willing to help..I highly recommend this place for everone kids all ages",
  "Nice family run business....Fantastic customer service....Great toys for everyone and assortment of all scales of model trains...just to name a few."
]

# We first split each word in each sentence
split_feedback = [feedback.split() for feedback in feedback]

# We then flatten it as a single list
flat_list = [
  word.lower() for each_feedback in split_feedback
  for word in each_feedback
]

# Create a Counter object with the flattened list
# This one line is sufficient to count all the occurences.
# We can then write some custom logic to filter out 
# positive and negative words from an imaginary master list
word_count = Counter(flat_list)

Feedback courtesy: https://amatosnewbritain.com/testimonials

ChainMap

We shall take the feedback example further to show the usefulness of a ChainMap. We receive feedback at each store level that gets pushed to the central table on a daily basis. Each store sends the feedback as a dictionary.

When the Customer Service team wants to analyze all the feedback to have a holistic view, they accumulate all feedback in one data store so that it's easier to perform any operation.

ChainMap is a handy one for such an operation. Let's see it in action.

from collections import ChainMap

feedback_storeA = {
  "Store A": [
    "Great store! People are friendly. Fun place to visit. Pro tip: Park at the Webster bank over the bridge and walk over!",
    "Nice hobby and toy store friendly staff and owner.",
    "Great place to go and have some fun..staff are great patient and always willing to help..I highly recommend this place for everone kids all ages",
    "Nice family run business....Fantastic customer service....Great toys for everyone and assortment of all scales of model trains...just to name a few."
  ]
}

feedback_storeB = {
  "Store B": [
    "Great store! People are friendly. Fun place to visit. Pro tip: Park at the Webster bank over the bridge and walk over!",
    "Nice hobby and toy store friendly staff and owner.",
    "Great place to go and have some fun..staff are great patient and always willing to help..I highly recommend this place for everone kids all ages",
    "Nice family run business....Fantastic customer service....Great toys for everyone and assortment of all scales of model trains...just to name a few."    
  ]
}

# Combine all feedback to a single dictionary
all_feedback = ChainMap(feedback_storeA, feedback_storeB)

# An empty list to store just the feedback
feedback_without_storename = []

# Iterate through each value list and 
# append it to the above list
for d in all_feedback.maps:
  for value in d.values():
    feedback_without_storename.extend(value)

print(feedback_without_storename)

# Use either defaultdict or Counter as explained above
# to understand sentiment.

So far I used these 3 data types quite a bit and found them useful. The other three, I use it quite rarely.

Can you share your experiences using deque, namedtuple, and the remaining ones? I'd be curious to see some real examples 😀. Otherwise, hope you had a good read 📜