Python: Numpy array data manipulation

 In the previous tutorial (please go to Python: Pandas DataFrame data manipulation), one method to manipulate a huge amount of data was explained. Now, in this tutorial, the second method of data manipulation will be explained. In this method, the library numpy is used which sorts data into arrays

But...what is an array? An array is a data structure used to store data. To understand it better, let's remember what a matrix is. A matrix is a rectangular (sometimes square) array arranged in rows and columns. An example of a rectangular matrix can be the following:


1
234
5678


As seen above, the matrix has 2 rows and 4 columns. In other words, it has dimensions 2x4. In case the number of rows and columns are the same, the matrix becomes square. In programming, if the matrix has only one row, it is commonly known as a row-vector, and if it has only one column, it is a column-vector.

You can get one specific element of the matrix by knowing its position. For example, the element in position 2, 4 is the number 8. How to understand this? Well, it is very simple! The first number 2 means the row (second row), and the number 4 is the column (fourth column) where the element is located. Said this, when you work with an array in Python, think about matrices! Let's start coding!

For this exercise, we will work with the data we downloaded from the web (please go to Python: Downloading data from the web). The first thing we must do is to import the text file into Python. We will use the library pandas for this.


#Importing library
import pandas as pd

#Importing the .txt file
df = pd.read_csv('vaccinations.txt', header = None, skiprows = (1), sep = ',', quotechar = None, quoting = 3)
print(df)



Since our objective is to work with the numpy library, we will convert the DataFrame to an array.


 #Converting the DataFrame into array
array_data = np.array(df)
print(array_data)  




If you pay attention to the figure above, you will notice that every line is inside the square brackets ('[ ]') and each element is separated by the delimiter comma (',')). This means that everything which is inside the square brackets represents a row, and each element inside represents a column. 

Python starts counting the number of rows and columns from 0. This means, for example, the element in that array with position 2, 1 will be 'AFG', or the element with position 0, 2  will be '2021-02-22'.

Now, let's start manipulating data. In a similar way to DataFrame, we can get the first 5 rows (with all columns) and the first 5 columns (with all rows).


#Selecting the first 5 rows (from 0 to 4)
first_rows = array_data[:5,:]
print(first_rows)

#Selecting the first 4 columns (from 0 to 3)
first_columns = array_data[:,:4]
print(first_columns)







But...how to understand the code above? Well, it is very simple! For example, in the first case, [:5,:] the comma ',' delimiters the rows and columns. In order to select all rows or columns, the colon ':' is used. In the case of the rows, :5 means that all first 5 rows will be taken into account. Since Python starts counting from 0, the first rows will be 0, 1, 2, 3, and 4. Please notice that row 5 is not included!

Now, let's copy the array into another array in order to keep the original one.


#Copying the array to another array
array_copy = array_data.copy()
print(array_copy) 


Let's now learn how to delete specific rows and columns. For this purpose, the function numpy.delete() is needed.



#Removing one or multiple rows
delete_rows = np.delete(array_data, [100, 525, 9461], axis=0)
print(delete_rows)

#Removing one or multiple columns
delete_columns = np.delete(array_data, [2,7,11], axis=1)
print(delete_columns)


Please note that in the code above, axis=0 stands for deleting rows, while axis=1 stands for columns. After running the code, we will get the following:






Since it is not visible that we just deleted specific rows and columns, let's prove it! In order to prove it, we can ask Python to give us the dimensions of the new arrays ('delete_rows' and 'delete_columns') and the array 'array data'. For this, we will use the function numpy.size().


#Comparing arrays

#Getting the number of columns
rows1 = np. size(array_data, 1)
print(rows1)
rows2 = np. size(delete_columns, 1)
print(rows2)

#Getting the number of rows
columns1 = np. size(array_data, 0)
print(columns1)
columns2 = np. size(delete_rows, 0)
print(columns2)





In the picture above, variables 'rows1' and 'columns1' are the dimensions of the original array 'array data', while the variables 'rows1' and 'columns1' are the dimensions of the same array after deleting rows [100, 525, 9461], and columns  [2,7,11]. Notice that the number of rows and columns has decreased as it should since we have deleted 3 rows and 3 columns.

Now, let's get specific values according to a certain condition. For example, we will get the data only for one specific country: Peru. We can use the function input in order for the user to enter his/her desired country.


#Getting data per country (Selecting all rows where the first column is equal to 'country')
country = input('Please type your country of interest:')
array_data_country = array_data[array_data[:,0] == country]
print(array_data_country)




Like in the previous tutorial about the pandas library, we will export the user-input array into Excel. The easiest way to achieve this is by converting the array into DataFrame (using the pandas.DataFrame() function), and then exporting it to Excel (using the df.to_excel() function) as shown in the previous tutorial.

However, there is another way to export a numpy array, which will be shown in this tutorial. The first thing we must do is to import the library xlswriter. Then, we should 'open' an Excel workbook and sheet using Python coding. For this, there are special functions in Python as shown:


#Exporting array to Excel
import xlsxwriter

#Creating the Excel file

#Creating the workbook
workbook = xlsxwriter.Workbook('Vaccinations by country.xlsx')
#Creating the worksheet
worksheet = workbook.add_worksheet()



After creating the Excel file, before starting exporting the array, we should first set a header for it. For this, we create a list in Python with the different column names for the header, and then we must iterate through each element of that list and write it in the first row in Excel. 


#Iterating over the header to create the first row in Excel
row_header = 0
for column_header, data_header in enumerate(header):
    worksheet.write(row_header, column_header, data_header)


Please do not forget that Python starts counting from 0! That is why we set the row equal to zero. The function enumerate(header) tells Python to iterate through each element of the list header. Then, the function worksheet.write() will write the iterated data in the first row in Excel. Now, it is time to do the same, but for the data itself! We will apply the same logic as for the header.


#Iterating over the array data to export to Excel
column = 0

for row, data in enumerate(array_data_country):
    try:
        worksheet.write_row(row+1, column, data)
    except:
        pass


The final step to do is to close the Excel file using the function workbook.close().


workbook.close()


After running this code, you will find the created Excel file in the same directory as your Python file.





The complete code will look like this:



#Importing the library
import numpy as np
import pandas as pd
import xlsxwriter

#Importing the .txt file
df = pd.read_csv('vaccinations.txt', header = None, skiprows = (1), sep = ',', quotechar = None, quoting = 3)
print(df)

#Converting the DataFrame into array
array_data = np.array(df)
print(array_data)                 

#Selecting the first 5 rows (from 0 to 4)
first_rows = array_data[:5,:]
print(first_rows)

#Selecting the first 4 columns (from 0 to 3)
first_columns = array_data[:,:4]
print(first_columns)

#Copying the array to another array
array_copy = array_data.copy()
print(array_copy)

#Removing one or multiple columns
delete_columns = np.delete(array_data, [2,7,11], axis=1)
print(delete_columns)

#Removing one or multiple rows
delete_rows = np.delete(array_data, [100, 525, 9461], axis=0)
print(delete_rows)

#Comparing arrays

#Getting the number of columns
rows1 = np. size(array_data, 1)
print(rows1)
rows2 = np. size(delete_columns, 1)
print(rows2)

#Getting the number of rows
columns1 = np. size(array_data, 0)
print(columns1)
columns2 = np. size(delete_rows, 0)
print(columns2)

#Getting data per country (Selecting all rows where the first column is equal to 'country')
country = input('Please type your country of interest:')
array_data_country = array_data[array_data[:,0] == country]
print(array_data_country)

#Exporting array to Excel

#Creating the Excel file

#Creating the workbook
workbook = xlsxwriter.Workbook('Vaccinations by country.xlsx')
#Creating the worksheet
worksheet = workbook.add_worksheet()

#Creating a list for the header in Excel
header = ['Country','Country iso code','Date','Total vaccinations','People vaccinated','People fully vaccinated','Total boosters','Daily vaccinations raw','Daily vaccinations','Total vaccinations per hundred','People vaccinated per hundred','People fully vaccinated per hundred','Total boosters per hundred','Daily vaccinations per million','Daily people vaccinated','Daily people vaccinated per hundred']

#Iterating over the header to create the first row in Excel
row_header = 0
for column_header, data_header in enumerate(header):
    worksheet.write(row_header, column_header, data_header)
    
#Iterating over the array data to export to Excel
column = 0

for row, data in enumerate(array_data_country):
    try:
        worksheet.write_row(row+1, column, data)
    except:
        pass
    
workbook.close()


Congratulations! Now you became an expert in data manipulation with Pandas DataFrame and Numpy array!

Comments

Popular posts from this blog

Python: Tracking any phone number

Python: Pandas DataFrame data manipulation

Python Machine Learning: Linear Regression (I)