Whether you work as an application developer or a data engineer, there will always be a need to clean, beautify, and bring data to a prescribed format. At such times, the following functions prove to be a lot useful.
I use these functions quite a lot, and it saves me a ton of effort. It keeps my code clean, and less bug prone. All four fall under the category of pure functions.
First of all, what does a pure function mean? It just means that how many ever times you pass a set of inputs to a function, it should return the same output. For example, I have a function that returns the input in uppercase. Irrespective of the number of times I pass "sam", it will always return "SAM". Well, on the surface most functions behave like that except for a few functions like rand() which returns a different value each time, and time() which returns a different time each time it is called. These are impure.
Another constraint of pure functions is that they should not introduce any side effects. What it means is that it should not change the value of any variable outside of the function, and it should not output anything to another source (eg., printing a value to a screen or writing to a log file).
All that a pure function does is return a value(s). It helps keep the code clean and maintainable, as one function will just concisely do one thing without affecting the outer world.
Let's move on to digging the handy functions that I mentioned at the start. To explain them in detail, let's consider a hypothetical scenario but I'm sure there are good chances that you come across such a case in your work life a lot.
I used a spreadsheet that has employee sample data from https://www.thespreadsheetguru.com/blog/sample-data in order to walk through each of the 4 functions.
Let's use pandas to read the spreadsheet into a dataframe. We then print the columns present to understand what's in there.
import pandas as pd # Read data from spreadsheet and store it in a dataframe employee = pd.read_excel("employee-sample.xlsx") # Columns present in dataframe print(employee.columns.values.tolist())
Whenever there is a need to perform an action to a set of data, you can think of using map.
The basic syntax of a map is map(function, \iterables)*. Here the function is what takes an action, and data in iterables is what gets passed to the function one by one.
To visualize the usage of the map, let's print the values in the Gender column. We notice that the values are either 'Male' or 'Female'. Let's assume that there is a need to return a shortened form 'M' for 'Male' and 'F' for 'Female'. map comes in very handy.
# Sample values of Gender column in dataframe print(employee['Gender'].head()) # Use case 1 for map function # For each element passed, the function # returns the shortened form def gender_shortform(index): return 'M' if index == 'Male' else 'F' # Store values of Gender column in a variable employee_gender = employee['Gender'] print(list(map(gender_shortform, employee_gender)))
In the above example, gender_shortform is the function that takes in value from employee_gender, checks the value and returns the appropriate short form. Likewise, we come across many such scenarios where a column of data needs to be massaged or translated, especially with numerical and string data. A map would be an ideal candidate for these.
As the name suggests, we use filter when we need a subset of records/data based on a condition. filter takes in a function and an iterable similar to map. Th function returns a boolean value based on one or more conditions. In our employee spreadsheet, assume we need employees who have had a bonus. This can be handled easily using a filter function like so.
# Example usage of filter function # Return items whose value is greater than 0.00 def filter_employee_with_bonus(item): return item > 0.00 employee_bonus = employee['Bonus %'] # Find the number of employees who received Bonus print(len(list(filter(filter_employee_with_bonus, employee_bonus))))
zip is a function that is used to combine iterables. It can be applied to a lot of many scenarios where it can be used to create a new data structure. For example, if I want to analyze an employee's job title and department, I can quickly use zip function to create a new data structure.
# In order to create a new tuple with few # details about the employee print(list(zip(employee['Full Name'], \ employee['Job Title'], \ employee['Department'])))
reduce is another function that's commonly used when there is a need to calculate a result based on a column of values. reduce should be imported from the functools package. The syntax is quite similar to map and filter, where the first argument is the function name (or a lambda function), followed by an iterable, and the third argument is an initializer. The initializer argument is unique to reduce. It just means that the third argument becomes the second when the function is applied for the first time.
Let's assume we need the sum of Annual Salary of all employees plus an operational cost of 100000. We then write a one-liner reduce function like so.
def calculate_ctc(a, b): return a + b ctc = reduce(calculate_ctc, employee['Annual Salary'], 100000) print(ctc)
The prime advantage I find in using these functions is the reduced code, and hence managing and troubleshooting becomes a lot easier. The examples mentioned above may not be the best ones to apply these functions, it was more to showcase how to make use of them, to kindle some ideas. It is purely a developer's choice when it comes to the implementation of logic.
Here is the complete code in replit.
Share your experiences using these pure functions, have you seen any performance gains or faced critics from co-workers due to usage of these?