How are Python graph libraries compared

›Demo-PY2: data management and visualization in Python with pandas

After in Demo-PY1: Python tutorial The first entry into the syntax has been made, in the next step Demo-PY2 shows how the data management and visualization with the help of the Python library Pandas is carried out. A time series data set on electricity consumption in Germany from the Open Power System Data (OPSD) platform serves as an application example. The data is stored in the interactive, web-based application environment Jupyter Notebook analyzed. In the last step of the demo, the data is visualized using the library Matplotlib.

motivation

In data analysis, reading in and processing the data is the first and most complex step, as the data is rarely available in the required form. In many cases the external data source (e.g. a CSV file) contains more data than is required, or individual data values ​​are missing or have the wrong format.

Python provides three powerful libraries for processing and visualizing data:Numpy, Pandas, Matplotlib, We illustrate their typical use with a specific application.

Why pandas?

Pandas is an open source library for the Python programming language and offers powerful, user-friendly data structures and functions for accessing data tables, see pandas.pydata.org. In this section, the Pandas library will play a central role, as it enables large CSV files with time series data to be converted into Pandas data structures ("DataFrames") and the data to be processed further. Pandas functions like iloc (), loc (), resample () are used to select rows / columns / cells and to group and aggregate time series data.

Why Matplotlib (and Seaborn)?

Matplotlib is a Python program library for the graphical representation of mathematical functions and its range of functions is comparable to MATLAB's plotting functions.

Seaborn is a matplotlib extension that is specially tailored to Panda's data frames. With the help of the seaborn functions, graphic representations can be embellished and designed more professionally, e.g. different color schemes can be selected.

The OPSD record

Open power system data is an open platform for energy researchers that provides energy data collected across Europe in the form of csv files and sqlite databases. The data can be downloaded and used free of charge. The platform is operated by a consortium (Europa-Universität Flensburg, TU Berlin, DIW Berlin and Neon Neue Energieökonomik) with the aim of supporting energy research.

For our demo we use one as a basis OPSD time series data set for electrical energy generation in Germany for the data period: 01.01.2016 - 31.12.2018 with an hourly resolution.

We want to answer questions such as: How do electricity consumption and capacity data for electricity as a whole, solar and wind energy change over the years? When in the course of a year is consumption highest?

Create Jupyter Notebook

A Jupyter notebook is created for data management and data visualization. We open the Jupyter Notebooks application via programs and create a new Python3 notebook with the name elab2go-Demo-PY2 using the "New" menu item.

The details of using Jupyter Notebooks are in the section Use Jupyter Notebooksdescribed.

Data management with pandas

Pandas has two data structures, Series and DataFrames, and a multitude of functions with which one can prepare and manipulate these data structures. The Pandas functions can be divided into functions for loading and saving data, functions for data preparation, functions for aggregating data, and functions for visualizing ("plotting") the data.

A Series is a one-dimensional indexable object that stores data in a row or column. The simplest series is made up of a list of dates:

A DataFrame represents a tabular data structure with an ordered collection of columns, each of which can have different data types. DataFrames can be generated in different ways, for example from several series or from a dictionary whose values ​​are lists of the same length, as in the following example:

Example: Generate DataFrame from Series or Dictionary


Preparation of the OPSD data record

The OPSD data set, which is in the form of a csv file time_series_60min.csv is available, is now read in, processed and output in tabular form with the help of the Pandas data structures and functions.

1. Import the program libraries

In the first code cell of the Jupyter Notebook we import the required program libraries: Pandas, Matplotlib and Seaborn. In Python you can use the import statement either import a complete program library, or only individual functions of the program library. During the import, alias names are assigned to the respective libraries or functions: for pandas we assign the alias pd, etc.

2. Read in data

The second code cell contains the OPSD data that are in the csv file time_series_60min_xs.csv are saved using the function read_csv () read into the working memory of the Python program from the Pandas library. The read_csv () function saves the data in a DataFrame with the name opsd_full.

  • Line 2: The function read_csv () is called with three parameters. We pass the name of the CSV file to be read in as the first parameter. With the second parameter index_col = 0 we specify that the first column of the DataFrame is interpreted as a row header. With the third parameter parse_dates = True specifies that the values ​​of the index column are interpreted and parsed as date values. This is important so that we can extract the year, month and day afterwards.
  • Line 4: The function head () shows the specified number of lines (here: 3) from the beginning of the data frame.

The output after executing this code block looks similar to the one shown. The index column of the DataFrame is the TimestampUTC column and is highlighted as the line heading.


3. Select data columns

We only need the columns TimestampUTC, Consumption, Solar and Wind from the downloaded csv file. The third code cell is the pandas function iloc () used to select the required columns from the data frame opsd_full and to copy them into a new data frame with the name opsd.

  • opsd_full.iloc [:, [2,4,5]] means that all rows but only columns 2, 4 and 5 are extracted.
  • .copy () means that a new DataFrame is created by copying. This ensures that the original DataFrame remains unchanged and is not changed unintentionally.

The output after executing this block of code looks like this:


4. Export the data frame to Excel

In the fourth code cell, the simplified data record is exported to a new csv file as a backup. The pandas function is used for export to_csv () used. The option "Index = True" means that the index column or line header (in our case: TimestampUTC) is also exported.

5. Divide columns of data by 1000

In the fifth data cell, we divide the consumption columns by 1000. This ensures that the consumption data is displayed in GWh instead of MWh. Pandas provides a number of arithmetic operations (add, sub, mul, div) that can be carried out on the elements of a DataFrame. Here we use the pandas function div ()to divide all elements of the selected columns by 1000.

The output after executing this block of code looks like this:


6. Extract the year, month and day from the date column

We add three new columns to the data set by extracting from the Index column of the DataFrame opsd extract the relevant components.

The output after executing this block of code looks like this:


7. Select date ranges (lines)

We choose with the help of the pandas function-loc () select two date ranges to be compared with each other. Similar to iloc (), the loc () function is used to select areas from a DataFrame. While the iloc () function uses the index of a column as a filter, with loc () the selection is made via row and column headings.

The output after executing this block of code looks like this:


8. Evaluations on a daily, weekly and monthly basis

So far, we have looked at the hourly consumption values ​​as they were made available in the downloaded file. To consider a larger period (months and years), we need the consumption values ​​per day, week and month. So the next thing we have to do is change the timescale, so-called resampling. With Resampling is meant the process that converts a time series from one frequency to another. The frequency conversion is done in pandas with the help of the method resample () carried out.

Example: Calculate daily sums, weekly averages, monthly sums

  • Row 1: Select the columns for which we want to change the time scale.
  • Row 4: Resample the selected columns. The function resample () is called with the parameter 'D', i.e. it is grouped according to days. The sum () function is executed on the grouped elements, i.e. the hourly values ​​of a day are added to a daily total.
  • Line 8: Here, the daily values ​​are resampled on a weekly basis (parameter 'W'), and the mean value is used as the aggregating function. That means we calculate the average weekly consumption.

Data visualization with matplotlib

In order to be able to use Matplotlib, the library must first be imported:

The syntax of Matplotlib is simple and comparable to the syntax of the MATLAB plot functionality. To create a diagram, a Figure object is first created, to which multiple subplots can be added. The most important commands for plotting are plot for one-dimensional and surf for multi-dimensional diagrams. The plot command receives the x and y coordinates of the data to be displayed as parameters, and optionally a string with formatting information, as in the following example:

example: Draw the sine and cosine functions on the same diagram.

  • Line 1: Import the NumPy library
  • Line 2: Generate a discretization x of the interval [0, 10] with 40 data points
  • Line 3: Calculate the function values ​​y1, y2 in the points of the array x. y1 and y2 are each 40 element arrays.
  • Line 4: Create a figure with a given size: 4x2 cm.
  • Line 5: Draw the two discretized functions on the same diagram. The first function is shown in red with an asterisk as a marker, the second function in blue with a + sign as a marker.

The output after executing this block of code looks similar to the one shown. The diagram can be further embellished by adding labels for the axes, a title, a legend, etc.


When using matplotlib with pandas, it is not necessary to explicitly specify the x and y coordinates of the diagram to be displayed, as in our mini example. The plot command is integrated as a method of the Pandas data structures, so that a DataFrame df directly with df.plot () can be visualized.

Data visualization 1: Development of daily electricity consumption

We first visualize the column "Consumption" of the DataFrame "opsd_tag", i.e. the daily electricity consumption in Germany from 01/01/2016 to 12/31/2018.

  • Line 1: The size of the diagram is set to 12x4 cm.
  • Line 2: The "Consumption" column is displayed graphically using the plot command. The values ​​for the x coordinates are implicitly taken from the index column of the DataFrame (here: TimestampUTC).
  • Lines 3-6: Define title, labels for the axes and the range of values ​​for the y-axis.

The output after executing this block of code looks like this:


Data visualization 2: Development of monthly electricity consumption

In the next step, we want to see how the monthly consumption of electricity, solar and wind energy has developed from 2016 to 2018. We now use the aggregated monthly data as the data source, i.e. the previously created DataFrame "opsd_month".

  • Row 2-3: Plot the columns "Consumption", "Solar" and "Wind" of the DataFrame with different colors and markers.
  • Line 5-8: Create legend, title and define axis labels.

The output after executing this block of code looks like this:


Compare the electricity consumption in 2017 vs. 2018

We use the plot () function of Matplotlib again to compare the electricity consumption in 2017 and 2018. The two graphs are placed side by side as subplots of the same figure. We also add titles and axis labels to the charts.

The output after executing this block of code looks like this:


YouTube video

The use of the created Jupyter Notebook is illustrated by a 5-minute video (screencast with additional explanations).