Frequency Distribution Analysis using Python Data Stack – Part 2

This post was originally published on this site


This is the continuation of the Frequency Distribution Analysis using Python Data Stack – Part 1 article. Here we’ll be analyzing real production business surveys for your review.

Application Configuration File

The configuration (config) file config.py is shown in Code Listing 3. This config file includes the general settings for Priority network server activities, TV Network selection and Hotel Ratings survey. It’s important to know and understand that using config file is an excellent tool to store local and global application settings without hardcoding them inside in the application code. While google searching you may find bad practices of hardcoding in Python programs.

Many Data Science programs require the default value of the algorithm parameters. Python Developer has been hardcoded these parameters even in the constructors (__in__()) of the class objects. This is a very bad Continuous Integration and implementation practices.


  1. # network activities log file settings
  2. network_activities_file = “network_activities.csv”
  3. # network activities log file settings
  4. data_file_network_activities = “network_activities.csv”
  5. column_name_priority = “Priority”
  6. column_name_message = “Message”
  7. network_activities_image1 = “network_activities1.png”
  8. network_activities_image2 = “network_activities2.png”
  9. network_activities_image3 = “network_activities3.png”
  10. info_class = “Info”
  11. error_class = “Error”
  12. warning_class = “Warning”
  13. alert_class = “Alert”
  14. plot_style = “ggplot”
  15. # priority column settings
  16. plot_x_label_priority = “Priority”
  17. plot_y_label_priority = “Percent Frequency”
  18. plot_title_priority = “Priority Percent Frequency Distribution”
  19. plot_legend_priority = “Priority Amount of Messages”
  20. # error column settings
  21. plot_x_label_error = “Percent Frequency”
  22. plot_y_label_error = “Error Message”
  23. plot_title_error = “Error Percent Frequency Distribution”
  24. plot_legend_error = “Error Amount of Messages”
  25. # warning column settings
  26. plot_x_label_warning = “Percent Frequency”
  27. plot_y_label_warning = “Warning Message”
  28. plot_title_warning = “Warning Percent Frequency Distribution”
  29. plot_legend_warning = “Warning Amount of Messages”
  30. # tv network settings
  31. data_file_tv_networks = “tv_networks.csv”
  32. image_file_tv_networks = “tv_networks_image.png”
  33. column_name_tv_networks = “Network”
  34. plot_x_label_tv_networks = “TV Networks”
  35. plot_y_label_tv_networks = “Percent Frequency”
  36. plot_title_tv_networks = “TV Network Percent Frequency Distribution”
  37. plot_legend_tv_networks = “Amount of TV Networks”
  38. # hotel ratings settings
  39. data_file_hotel_ratings = “hotel_ratings.csv”
  40. image_file_hotel_ratings = “hotel_ratings_image.png”
  41. column_name_hotel_ratings = “Rating”
  42. plot_x_label_hotel_ratings = “Hotel Rating”
  43. plot_y_label_hotel_ratings = “Percent Frequency”
  44. plot_title_hotel_ratings = “Hotel Rating Percent Frequency Distribution”
  45. plot_legend_hotel_ratings = “Amount of Hotel Rating”

Code Listing 3. Program configuration file (config.py).

Frequency Distribution Subclass

The frequency_distribution_subclass.py module contains the FrequencyDistribution(FrequencyDistributionLibrary) subclass object inherited from FrequencyDistributionLibrary(object). The code is provided in Code Listing 4. As you can see the constructor initialize frequency distribution in the superclass library, make sure to get a project directory path and set a data file path. The frequency_distribution_target() function prints frequency summary tables, builds and saves vertical bar chart images for any target columns. The frequency_distribution_message() function does the same but builds and saves a horizontal bar chart image.


  1. from frequency_distribution_superclass import FrequencyDistributionLibrary
  2. class FrequencyDistribution(FrequencyDistributionLibrary):
  3. frequency distribution subclass inhered from frequency distribution superclass library
  4. def __init__(self, data_file):
  5. frequency distribution class constructor
  6. :param data_file: data file name
  7. # initialize frequency distribution superclass library
  8. FrequencyDistributionLibrary.__init__(self)
  9. # get project directory path
  10. self._project_directory_path = self.get_project_directory_path()
  11. # set data file path
  12. self._data_file_path = os.path.join(self._project_directory_path, data_file)
  13. def frequency_distribution_target(self, image_file, column_target, plot_x_label, plot_y_label, plot_title, plot_legend):
  14. print frequency summary table, build and save vertical bar chart image
  15. :param image_file: image file name
  16. :param column_target: column target name
  17. :param plot_x_label: plot x-axis label name
  18. :param plot_y_label: plot y-axis label name
  19. :param plot_title: plot title
  20. :param plot_legend: plot legend
  21. :return none
  22. # set image file path
  23. image_file_path = os.path.join(self._project_directory_path, image_file)
  24. # get x and y axises list data
  25. x_axis, y_axis = self.load_x_y_axis_data(self._data_file_path, column_target)
  26. # print frequency summary table
  27. self.print_frequency_summary_table(plot_x_label, plot_y_label, x_axis, y_axis)
  28. # build and save vertical bar chart image
  29. self.build_bar_chart_vertical(x_axis, y_axis, image_file_path, plot_x_label, plot_y_label, plot_title, plot_legend)
  30. def frequency_distribution_message(self, image_file, column_message, column_target, column_target_class, plot_x_label, plot_y_label, plot_title, plot_legend):
  31. print frequency summary table, build and save horizontal bar chart image
  32. :param image_file: image file name
  33. :param column_message: column message name
  34. :param column_target: column target name
  35. :param column_target_class: column target class name
  36. :param plot_x_label: plot x-axis label name
  37. :param plot_y_label: plot y-axis label name
  38. :param plot_title: plot title
  39. :param plot_legend: plot legend
  40. :return none
  41. # set image file path
  42. image_file_path = os.path.join(self._project_directory_path, image_file)
  43. # get x and y axises list error message data
  44. x_axis, y_axis = self.load_x_y_axis_data(self._data_file_path, column_message, column_target, column_target_class)
  45. # print error message summary table
  46. self.print_frequency_summary_table(plot_x_label, plot_y_label, x_axis, y_axis)
  47. # build and save error message vertical bar chart image
  48. self.build_bar_chart_horizontal(x_axis, y_axis, image_file_path, plot_x_label, plot_y_label, plot_title, plot_legend)

Code Listing 4. Frequency distribution subclass FrequencyDistribution(FrequencyDistributionLibrary).

Now that we have defined and organized the superclass and subclass objects, we are ready to create the client program to do the Frequency Distribution Analysis for the network server activities log file. The code shown in Code Listing 5 provides a complete analysis including:

  • Print priority frequency summary table, build and save vertical bar chart image for priority column (Table 2 and Figure 1)
  • Print error frequency summary table, build and save horizontal bar chart image for error column (Table 3 and Figure 2)
  • Print warning frequency summary table, build and save horizontal bar chart image for warning column (Table 4 and Figure 3)

You can see how clean this code is. All the work is done by the subclass FrequencyDistribution(FrequencyDistributionLibrary), and of course nothing has been hardcoded. Any necessary change made to the config file parameters will not affect the program production code. Moving default algorithm parameter values to the config file is necessary for any Data Analytics program deployment and maintenance.


  1. import time
  2. from frequency_distribution_subclass import FrequencyDistribution
  3. import config
  4. def main():
  5. # create frequency distribution class object
  6. frequency_distribution = FrequencyDistribution(config.data_file_network_activities)
  7. # print frequency summary table, build and save vertical bar chart image for priority column
  8. frequency_distribution.frequency_distribution_target(config.network_activities_image1, config.column_name_priority, config.plot_x_label_priority, config.plot_y_label_priority, config.plot_title_priority, config.plot_legend_priority)
  9. # print frequency summary table, build and save horizontal bar chart image for error column
  10. frequency_distribution.frequency_distribution_message(config.network_activities_image2, config.column_name_message, config.column_name_priority, config.error_class, config.plot_x_label_error, config.plot_y_label_error, config.plot_title_error, config.plot_legend_error)
  11. # print frequency summary table, build and save horizontal bar chart image for warning column
  12. frequency_distribution.frequency_distribution_message(config.network_activities_image3, config.column_name_message, config.column_name_priority, config.warning_class, config.plot_x_label_warning, config.plot_y_label_warning, config.plot_title_warning, config.plot_legend_warning)
  13. if __name__ == ‘__main__’:
  14. start_time = time.time()
  15. end_time = time.time()
  16. print(“Program Runtime: “ + str(round(end_time – start_time, 1)) + ” seconds” + n)

Code Listing 5. Program code for Network Server Activities Frequency Distribution Analysis.

Table 2 shows the Priority frequency summary table. The Priority contains four classes: Info, Error, Warning and Alert.

Priority Percent Frequency
Info 33.90%
Error 30.70%
Warning 26.80%
Alert 8.70%


Table 2. Priority frequency summary table.

The bar chart of the Priority Percent Frequency Distribution is shown below in Figure 1. This figure shows 30.7% of occurred Errors (red light) and 26.8% of Warning (yellow light) messages. The system admin team would like to know these messages for network server maintenances and optimization.

Figure 1. Priority Percent Frequency Distribution.

Table 3 shows the Error Message frequency summary table. Nine classes has been defined.

Error Message Percent Frequency
Interface X0 Link Is Down 30.80%
Interface X8 Link Is Down 20.50%
Interface X5 Link Is Down 17.90%
Interface X4 Link Is Down 15.40%
Interface X9 Link Is Down 5.10%
Interface X2 Link Is Down 2.60%
Interface X3 Link Is Down 2.60%
Interface X6 Link Is Down 2.60%
Interface X7 Link Is Down 2.60%


Table 3. Error Message frequency summary table.

The horizontal bar chart of the Error Percent Frequency Distribution is shown in Figure 2. The figure shows that the most occurred Error messages are: Interface X0 Link Is Down (30%), Interface X8 Link Is Down (20%), Interface X5 Link Is Down (17%) and Interface X4 Link Is Down (15%). This information is very important for the network admin team to start looking for the right solutions to fix these server occurred errors.

Figure 2. Error Percent Frequency Distribution.

Table 4 shows the Warning Message frequency summary table. Seven classes has been defined.

Warning Message Percent Frequency
Interface X0 Link Is Up 32.40%
Interface X5 Link Is Up 20.60%
Interface X8 Link Is Up 20.60%
Interface X4 Link Is Up 17.60%
Interface X1 Link Is Up 2.90%
Interface X9 Link Is Up 2.90%
Wan IP Changed 2.90%


Table 4. Warning Message frequency summary table.

The horizontal bar chart of the Warning Message Frequency Distribution is shown in Figure 3. As the figure shows the most occurred Warning messages are: Interface X0 Link Is Up (32%), Interface X8 Link Is Up (20%), Interface X5 Link Is Up (20%) and Interface X4 Link Is Up (17%). The message Interface X0 Link Is Up, is critical and requires some attention by the network admin team.

Figure 3. Warning Percent Frequency Distribution.

The Network Server Activities Frequency Distribution Analysis can provided to the admin team the following conclusions:

  1. The server Error and Warning messages are the second and third occurred with 30.7% and 26.8% respectively.
  2. The Error messages occurred in the following percent order:
  • Interface X0 Link Is Down (30%)
  • Interface X8 Link Is Down (20%)
  • Interface X5 Link Is Down (17%)
  • Interface X4 Link Is Down (15%)
  1. The Warning messages occurred in the following percent order:
  • Interface X0 Link Is Up (32%)
  • Interface X8 Link Is Up (20%)
  • Interface X5 Link Is Up (20%)
  • Interface X4 Link Is Up (17%).
  1. The Error messages have the same class sequence as the Warning messages.
  2. The Warning messages Interface X0 Link Is Up (32%), Interface X8 Link Is Up (20%) and Interface X5 Link Is Up (20%) are most likely to occur at this point.

Let’s apply this logic to real business examples so you can see how quickly to you can develop and analyze any frequency business data.

Example 1. TV Networks Frequency Distribution Analysis

Table 5 shows ten rows of the TV Network data file (tv_networks.csv). This file contains the most-watched television networks for a period of time.

Network
CBS
CBS
ABC
CBS
FOX
NBC
CBS
NBC
FOX
NBC


Table 5. Ten rows of the TV Network data file.

The config file session ‘tv network settings’ (Code Listing 6.) includes the required seven parameters settings.


  1. # tv network settings
  2. data_file_tv_networks = “tv_networks.csv”
  3. image_file_tv_networks = “tv_networks_image.png”
  4. column_name_tv_networks = “Network”
  5. plot_x_label_tv_networks = “TV Networks”
  6. plot_y_label_tv_networks = “Percent Frequency”
  7. plot_title_tv_networks = “TV Network Percent Frequency Distribution”
  8. plot_legend_tv_networks = “Amount of TV Networks”

Code Listing 6. Config file session for ‘tv network settings’.

The main program is shown in Code Listing 7 and provides the complete Frequency Distribution Analysis for the TV Network data file. This code includes:

  • Create frequency distribution class object
  • Print frequency summary table, build and save vertical bar chart image for TV Networks column

  1. import time
  2. from frequency_distribution_subclass import FrequencyDistribution
  3. import config
  4. def main():
  5. # create frequency distribution class object
  6. frequency_distribution = FrequencyDistribution(config.data_file_tv_networks)
  7. # print frequency summary table, build and save vertical bar chart image for tv networks column
  8. frequency_distribution.frequency_distribution_target(config.image_file_tv_networks, config.column_name_tv_networks, config.plot_x_label_tv_networks, config.plot_y_label_tv_networks, config.plot_title_tv_networks, config.plot_legend_tv_networks)
  9. if __name__ == ‘__main__’:
  10. start_time = time.time()
  11. end_time = time.time()
  12. print(“Program Runtime: “ + str(round(end_time – start_time, 1)) + ” seconds” + n)

Code Listing 7. Main program for TV Network Frequency Distribution Analysis.

The program generates the TV Network frequency summary table (Table 6) and the percent frequency distribution vertical bar chart (Figure 4).

TV Network Percent Frequency
CBS 30.00%
NBC 27.00%
FOX 24.00%
ABC 19.00%


Table 6. TV Network frequency summary table.

As you can see based on the TV Network data file the most-watched is CBS with 30%, following with NBC with 27%.

Figure 4. TV Network Percent Frequency Distribution.

Example 2. Hotel Ratings Frequency Distribution Analysis

This example shows ten records of a Hotel Rating survey from the data file hotel_ratings.csv (Table 7)

Rating
Poor
Very Good
Excellent
Poor
Poor
Excellent
Average
Very Good
Average
Very Good


Table 7. Ten rows of the Hotel Rating survey data file.

The config file session ‘hotel rating settings’ (Code Listing 8.) includes the required seven parameters settings.


  1. # hotel ratings settings
  2. data_file_hotel_ratings = “hotel_ratings.csv”
  3. image_file_hotel_ratings = “hotel_ratings_image.png”
  4. column_name_hotel_ratings = “Rating”
  5. plot_x_label_hotel_ratings = “Hotel Rating”
  6. plot_y_label_hotel_ratings = “Percent Frequency”
  7. plot_title_hotel_ratings = “Hotel Rating Percent Frequency Distribution”
  8. plot_legend_hotel_ratings = “Amount of Hotel Rating”

Code Listing 8. Main program for Hotel Ratting Frequency Distribution Analysis.

The main program shown in Code Listing 9 provides the complete Frequency Distribution Analysis for the Hotel Ratting survey data file. This code includes:

  • Create frequency distribution class object
  • Print frequency summary table, build and save vertical bar chart image for Hotel Ratting column

  1. import time
  2. from frequency_distribution_subclass import FrequencyDistribution
  3. import config
  4. def main():
  5. # create frequency distribution class object
  6. frequency_distribution = FrequencyDistribution(config.data_file_hotel_ratings)
  7. # print frequency summary table, build and save vertical bar chart image for hotel ratings column
  8. frequency_distribution.frequency_distribution_target(config.image_file_hotel_ratings, config.column_name_hotel_ratings, config.plot_x_label_hotel_ratings, config.plot_y_label_hotel_ratings, config.plot_title_hotel_ratings, config.plot_legend_hotel_ratings)
  9. if __name__ == ‘__main__’:
  10. start_time = time.time()
  11. end_time = time.time()
  12. print(“Program Runtime: “ + str(round(end_time – start_time, 1)) + ” seconds” + n)

Code Listing 9. Main program for Hotel Ratting Frequency Distribution Analysis.

The program generates the Hotel Ratting frequency summary table (Table 8) and the percent frequency distribution vertical bar chart (Figure 4).

Hotel Rating Percent Frequency
Very Good 38.80%
Excellent 28.80%
Average 16.50%
Poor 9.60%
Terrible 6.30%


Table 8. Hotel Ratting frequency summary table.

As you can see based on the Hotel Ratting survey data file 38.8% of the guests rated ‘Very Good’ and 28.8% rated ‘Excellent’. The guest messages for the ‘Poor’ and ‘Terrible’ rates were not provided.

Figure 4. Hotel Ratting Percent Frequency Distribution.

Recommendations

I would like to recommend reading the following OOP materials? applied to Data Analytics projects:

Based on the info provided in this paper we can summarize the following two conclusions:

Conclusions

  1. Applying Object-Oriented programming to your Data Analytics make your code more organize, reusable and robust. It will allow an easy implementation of the required Continuous Integration and Unit Test standards methodologies.
  2. Store local and global Data Analytics program settings in a configuration file is a good designed and developed practice. Required default value of the algorithm parameters should be stored in the configuration file. Hardcoding these values in the program code is a bad programming and deployment practices.

Feel free to send to Ernest any questions about his paper.

Like this article? Subscribe to our weekly newsletter to never miss out!