Python: Numpy array data manipulation
In the previous tutorial (please go to Python: Pandas DataFrame data manipulation), one method to manipulate a huge amount of data was explained. Now, in this tutorial, the second method of data manipulation will be explained. In this method, the library numpy is used which sorts data into arrays.
But...what is an array? An array is a data structure used to store data. To understand it better, let's remember what a matrix is. A matrix is a rectangular (sometimes square) array arranged in rows and columns. An example of a rectangular matrix can be the following:
1 | 2 | 3 | 4 |
---|---|---|---|
5 | 6 | 7 | 8 |
Now, let's copy the array into another array in order to keep the original one.
#Copying the array to another array array_copy = array_data.copy() print(array_copy)
Let's now learn how to delete specific rows and columns. For this purpose, the function numpy.delete() is needed.
#Removing one or multiple rows delete_rows = np.delete(array_data, [100, 525, 9461], axis=0) print(delete_rows)
#Removing one or multiple columns delete_columns = np.delete(array_data, [2,7,11], axis=1) print(delete_columns)
Please note that in the code above, axis=0 stands for deleting rows, while axis=1 stands for columns. After running the code, we will get the following:
Since it is not visible that we just deleted specific rows and columns, let's prove it! In order to prove it, we can ask Python to give us the dimensions of the new arrays ('delete_rows' and 'delete_columns') and the array 'array data'. For this, we will use the function numpy.size().
#Comparing arrays
#Getting the number of columns rows1 = np. size(array_data, 1) print(rows1) rows2 = np. size(delete_columns, 1) print(rows2)
#Getting the number of rows columns1 = np. size(array_data, 0) print(columns1) columns2 = np. size(delete_rows, 0) print(columns2)
In the picture above, variables 'rows1' and 'columns1' are the dimensions of the original array 'array data', while the variables 'rows1' and 'columns1' are the dimensions of the same array after deleting rows [100, 525, 9461], and columns [2,7,11]. Notice that the number of rows and columns has decreased as it should since we have deleted 3 rows and 3 columns.
Now, let's get specific values according to a certain condition. For example, we will get the data only for one specific country: Peru. We can use the function input in order for the user to enter his/her desired country.
#Getting data per country (Selecting all rows where the first column is equal to 'country') country = input('Please type your country of interest:') array_data_country = array_data[array_data[:,0] == country] print(array_data_country)
Like in the previous tutorial about the pandas library, we will export the user-input array into Excel. The easiest way to achieve this is by converting the array into DataFrame (using the pandas.DataFrame() function), and then exporting it to Excel (using the df.to_excel() function) as shown in the previous tutorial.
However, there is another way to export a numpy array, which will be shown in this tutorial. The first thing we must do is to import the library xlswriter. Then, we should 'open' an Excel workbook and sheet using Python coding. For this, there are special functions in Python as shown:
#Exporting array to Excel import xlsxwriter
#Creating the Excel file
#Creating the workbook workbook = xlsxwriter.Workbook('Vaccinations by country.xlsx') #Creating the worksheet worksheet = workbook.add_worksheet()
After creating the Excel file, before starting exporting the array, we should first set a header for it. For this, we create a list in Python with the different column names for the header, and then we must iterate through each element of that list and write it in the first row in Excel.
#Iterating over the header to create the first row in Excel row_header = 0 for column_header, data_header in enumerate(header): worksheet.write(row_header, column_header, data_header)
Please do not forget that Python starts counting from 0! That is why we set the row equal to zero. The function enumerate(header) tells Python to iterate through each element of the list header. Then, the function worksheet.write() will write the iterated data in the first row in Excel. Now, it is time to do the same, but for the data itself! We will apply the same logic as for the header.
#Iterating over the array data to export to Excel column = 0
for row, data in enumerate(array_data_country): try: worksheet.write_row(row+1, column, data) except: pass
The final step to do is to close the Excel file using the function workbook.close().
workbook.close()
After running this code, you will find the created Excel file in the same directory as your Python file.
The complete code will look like this:
#Importing the library import numpy as np import pandas as pd import xlsxwriter
#Importing the .txt file df = pd.read_csv('vaccinations.txt', header = None, skiprows = (1), sep = ',', quotechar = None, quoting = 3) print(df)
#Converting the DataFrame into array array_data = np.array(df) print(array_data)
#Selecting the first 5 rows (from 0 to 4) first_rows = array_data[:5,:] print(first_rows)
#Selecting the first 4 columns (from 0 to 3) first_columns = array_data[:,:4] print(first_columns)
#Copying the array to another array array_copy = array_data.copy() print(array_copy)
#Removing one or multiple columns delete_columns = np.delete(array_data, [2,7,11], axis=1) print(delete_columns)
#Removing one or multiple rows delete_rows = np.delete(array_data, [100, 525, 9461], axis=0) print(delete_rows)
#Comparing arrays
#Getting the number of columns rows1 = np. size(array_data, 1) print(rows1) rows2 = np. size(delete_columns, 1) print(rows2)
#Getting the number of rows columns1 = np. size(array_data, 0) print(columns1) columns2 = np. size(delete_rows, 0) print(columns2)
#Getting data per country (Selecting all rows where the first column is equal to 'country') country = input('Please type your country of interest:') array_data_country = array_data[array_data[:,0] == country] print(array_data_country)
#Exporting array to Excel
#Creating the Excel file
#Creating the workbook workbook = xlsxwriter.Workbook('Vaccinations by country.xlsx') #Creating the worksheet worksheet = workbook.add_worksheet()
#Creating a list for the header in Excel header = ['Country','Country iso code','Date','Total vaccinations','People vaccinated','People fully vaccinated','Total boosters','Daily vaccinations raw','Daily vaccinations','Total vaccinations per hundred','People vaccinated per hundred','People fully vaccinated per hundred','Total boosters per hundred','Daily vaccinations per million','Daily people vaccinated','Daily people vaccinated per hundred']
#Iterating over the header to create the first row in Excel row_header = 0 for column_header, data_header in enumerate(header): worksheet.write(row_header, column_header, data_header) #Iterating over the array data to export to Excel column = 0
for row, data in enumerate(array_data_country): try: worksheet.write_row(row+1, column, data) except: pass workbook.close()
Congratulations! Now you became an expert in data manipulation with Pandas DataFrame and Numpy array!
Comments
Post a Comment