What’s up Internet! I am Manish from RebellionRider.com Welcome back once again In my previous video we did the data analysis of worldwide covid-19 data But today we will be going a little bit more specific and analyzing country specific corona virus data And, here is what you gonna learn in this tutorial you will learn – • How to collect data using web scrapping? • How to fix the bad header of a data frame by giving it proper column names • How to deal with extra rows and columns? • How to fix a value using Regular expressions or RegEx • How to change the data type of an entire column • How to sort the data of a data frame in ascending order • And in the end how to save this cleaned data in Microsoft excel and CSV file Make sure to watch this video till the end because projects like this could help you build your Resume Anyways, The country whose data we will be using today for the demonstration is no other than my own – India But you can follow the steps for any other country as the process is the same Now, that being said let’s start the tutorial! But before that, subscribe to my channel if you haven’t done so, also press the bell icon to get notified And, don’t forget to share and like this video Project Plans 1 Collect the Data The technique that we will be using for collecting data is going to be “Web Scrapping” Data Collection is always the first step for any data processing project In today’s project data collection is not going to be a problem Mainly because on my previous video Mr. Tejas Mane commented saying he was doing this same project but having errors He also mentioned the website that he was using That website is Wikipedia and we will use it as well The data from this website is not the cleanest But it looks consistent and it’s free for all of us to use So, it fits the bill 2 Data Cleaning This step is directly dependent on the first step “Data Collection” So, we will see how much we need to clean the data when coding 3 Data Storage In the end we will save this data into either a Microsoft excel file or a CSV file My aim behind doing these tutorials is to show you how easy it is to process a data and make it ready for any type of data analysis Now that I have showed you the road map of this project, let’s jump onto the computer and start writing the Python code And, yes I will leave the link of my GitHub page in the description of this video From there you can download all the resources that I will be using in this tutorial including the python script that I will be creating in this tutorial Project Setup First, let’s create a folder which will work as our project directory I will be creating this folder on my desktop and I am going to name it ‘Rebellion Rider’ Done Now let’s open up the editor My choice of editor for this project is again Spyder Now, let’s create write a script and save it into our project directory which we just created Let’s name this script main.py Let’s also set this directory as our console directory Click here and select ‘Set Console Working Directory’ Done! Now let’s write the code Data Collection Data collection using web scrapping! Since we are taking data from a website therefore the technique that we will be using for collecting data is going to be “Web Scrapping” What Python library we will need for the web scrapping completely depends on the structure of the data Therefore, we will first go to the website There we will find the covid-19 specific data More importantly we will look for the structure of the corona virus specific data Then we decide which Python library will best suit our project So here we are – Let’s search the data that we need – Here is our desired data So, the data is stored in a table format The python library that is the best fit for scrapping such data is Python Pandas Therefore, here we will use Python Pandas library in combination with requests library Let’s do that – First let’s clean this script by deleting whatever we have in it – Great! import requests import pandas as pd Here I just imported two libraries that we are going to use The first library that I imported is “requests” The only purpose of this library is to fetch the data from the web page The second library that I imported here is the “pandas” I also gave this import an alias which is “pd” Anyways, we are going to use it to do all the heavy lifting Let’s do that #Step 1 Data Collection

url = ‘https://en.wikipedia.org/wiki/COVID-19_pandemic_in_India’ req = requests.get(url) Before I explain you, this code let me quickly execute it Done! Now Here I created a variable url into which I saved the URL of the webpage that we are using Next variable that I created is the “req” This is the request variable It is going to hold the request response returned by the get( ) function of the requests library In simple words, right now this variable is holding the entire data including HTML, CSS and all the metadata of the URL If you will scroll down a little bit and double click on this attribute “text” and then you can see the data of the website Let’s close this window Ok Now we will have to get out the corona virus data that we need from this attribute and for that we need its path value Which you can get from here Req.text is the one that we need Now let’s write the code and get the desired data data_list = pd.read_html(req.text) Here I called the read_html function of the pandas library and as an argument to this function I have passed the path value of the text attribute The same value that I mentioned earlier On execution it will fetch all the desire data and save it into the variable data_list Let’s execute this code and see where have we reached? Execution successful Now, let’s come to the variable explorer Here you can see the datatype of our variable data_list is list Let’s double click on this variable and see what is it holding right now? So, right now this variable is holding a list of all the data frames Each of these data frames contain the data that was saved in a table on the web page that we are using From this list we have to find that one data frame that has the corona virus data For you I have already done that search The data frame corresponding to index number 6 has our data Let me show you – [open the df 6] Here it is – Now let’s get this data frame out from this list target_df = data_list[6] On execution of this statement, data frame against index number 6 will be fetched and saved into a new variable target_df Let’s execute this code and see if we have our targeted data separated from the list or not Execution done Variable target_df is created Let’s open this variable and see what it contains? Finally, we have successfully fetched our desired data That is how we do the web scraping in just 6 lines of code using Python Pandas Now let’s move on to the next step of the data processing which is Data Cleaning Data Cleaning Data Cleaning is the most important step in data processing Here we make the data ready for further processing So take your time and clean your data as much as possible Anyways, now let’s take a look at our data frame and see what refinements we need to do with it Ok – First, we need to fix the header of our data frame and give these columns some logical names Second, in this data frame we have total 9 columns But we need data only from first four columns Thus, we will get rid of the all the extra ones Third we will remove some extra rows of data Let me tell you what am I talking about? Scroll down and come to the end of the data frame Here as you can see, we have two rows that we need to delete Let’s first fix these three issues and then we’ll look for others – Ok let’s close this Now let’s write some code #Data Clearning #1 Column Names target_df.columns = [‘Country Name’,’Total Cases’, ‘Total Deaths’, ‘Total Recoveries’,’Col5′,’Col6′,

‘Col7′,’Col8′,’Col9’] Just a second Great! Now it’s looking much better Anyways Here I have assigned logical names only to those columns that we need I have assigned generic names to the rest of the other columns because we will be getting rid of them anyways Let’s execute and see the result Execution done and As you can see, we have fixed the header Now our columns have better names Right on your screen is the before and after of the data frame As you can see it’s already looking much better Now let’s fix the next issue that is the extra columns The easiest way to fix this issue is to extract all the columns that we need Let’s do that – #2 Extra Columns target_df = target_df[[‘Country Name’,’Total Cases’, ‘Total Deaths’, ‘Total Recoveries’]] Here in this statement I am extracting those 4 columns whose data we need and saving it into the variable target_df On execution our variable target_df will have only our desired data Let’s execute the code and see – Execution done Now let’s see the data frame Great, now our data frame is only holding the data that we need Now let’s fix the third issue that is extra rows from the bottom of our data frame #3 Extra Rows target_df = target_df.drop([36, 37]) Let’s find out the index number of those two rows 36 and 37 You know what, We cannot simply delete these two rows like this Say you write this statement to drop these two rows using their index numbers – First, this is a valid statement On execution rows against index number 36 and 37 will be removed There is no problem with it Unless Wikipedia decides to add some extra rows in between In that case this statement will delete rows corresponding to index numbers that we have specified regardless whether those index numbers are of last two rows that we want to delete or not So, how we can deal with this problem? Let’s assume that, the last two rows will always contain useless data In this case, we can fetch the index number of last row dynamically instead of hard coding it Let me show you how to do that Let me first comment this statement Done last_idx = target_df.index[-1] Here target_df is the variable name that has our data frame and index is the keyword Followed by it, I have specified -1 inside a square bracket On execution this statement will return the index number of the last row of the data frame target_df and save it into the variable last_idx Let me execute this statement and show you Execution done Here you can see, the variable last_idx is successfully created and the value that is stored inside it is 37 Which is the index number of the last row Now we can use this variable to drop the last two rows like this – target_df = target_df.drop([last_idx, last_idx-1]) This statement will always drop the last two rows of the data frame You just future proofed this script by changing a static statement with a dynamic one See how easy it is Let’s execute this Python Program – Done, Now let’s see our data frame Last two rows gone The data frame has already started looking clean Nevertheless, let’s see what else do we need to fix here As we can see, majority of the values stored into the last three columns ‘Total Cases’, ‘Total Deaths’ and ‘Total Recoveries’ look like integers, but they are actually of string object data type

We can check that like this print(target_df.dtypes) As you can see none of the columns of the data frame has integer data type So, we need to fix that Let’s come back to our data frame first Great! Though we can easily convert the data type of the last column ‘Total Recoveries’ from String object to integer as every single value in this column complies with all the norms of being an integer But same is not true with rest of the two columns For example In column ‘Total Death’ row number 16 we have a value 97[c] C is a character and square brackets are special character which cannot be converted into integer Similarly in column ‘Total Cases’ we have a value 52,870[b] in this value comma and square brackets are special characters and b is an alphabetic character which again cannot be changed into integer Therefore, before converting the data type of these columns we first need to fix these issues And, to do that we are going to take help of regular expression Let me show you how – Let’s first fix the alphabetic characters and the square brackets of both these cells and after that we will deal with this comma target_df[‘Total Cases’] = target_df[‘Total Cases’].str.replace(‘\[.*\]’,”) target_df[‘Total Deaths’] = target_df[‘Total Deaths’].str.replace(‘\[.*\]’,”) Here I wrote two similar statements Though these statements are similar looking, but they are targeting two different columns The first statement will remove the unwanted characters from all the values stored in column ‘Total Cases’ whereas the second statement will do the same with all the values stored in column ‘Total Deaths’ Now let’s come to the function that I have used in these statements which is ‘Replace’ Here I have used ‘replace’ function of the pandas library This function takes two parameters First one is the old value that you want to replace The second one is the new value with which you want to replace the old one In our case we want to replace all the occurrence of character enclosed within a square bracket with an empty string To accomplish this, I have taken help from Regular Expression The regEx that best suits the string that we want to replace is ‘\[*\]. This regular expression will search for all the strings that has anything written inside a square bracket regardless of its position If it finds any such occurrence, then it will replace it with the second argument of the function call which is an empty string Just a pair of single quotes, nothing between them, not even space Now let’s get back to the editor and execute this code Here we are Let’s run this program once again Done Let’s check the data frame and see if it has done it’s work or not Values are looking good No more unwanted square brackets or anything Now let’s get rid of this comma To do that, let’s copy this replace function call where we targeted the column ‘Total Cases’ and modify it target_df[‘Total Cases’] = target_df[‘Total Cases’].str.replace(‘\[.*\]’,”) The only thing that we need to change here is the first parameter of the function call As this time we are targeting the unwanted comma thus the first parameter is going to be a comma Second parameter is going to be the same that is an ‘empty string’ target_df[‘Total Cases’] = target_df[‘Total Cases’].str.replace(‘,’,”) Done! Let’s execute the code and see the result Execution done! Let’s quickly check the data frame Comma also gone Now these columns are all set to get their data type converted So let’s do that – As we want to change the data type of these columns from string objects to integer thus we will use to_numeric function of the Python Pandas library Like this Let’s close this data frame first target_df[‘Total Cases’]=pd.to_numeric(target_df[‘Total Cases’])

On execution this statement will change the data type of the column ‘Total Cases’ from object to integer Before I execute the code Let me write the statements for converting the data types of the rest two columns which are ‘Total Deaths’ and ‘Total Recoveries’ target_df[‘Total Deaths’]=pd.to_numeric(target_df[‘Total Deaths’]) target_df[‘Total Recoveries’]=pd.to_numeric(target_df[‘Total Recoveries’]) Done! Now let’s execute the code! Execution is successful Let’s check the data set Wow, what just happened here? The data type of the first column is still the same and we got an extra new column added to our data frame I must have messed up somewhere Let’s check Ohk, silly me Python is a case sensitive language And, I misspelled the column name The name of the column is Total Cases with capital C here I spelled Total cases with small c Let’s change it Great Now let’s try to execute this code once again Ok execution done Now let’s check the data frame Already looking pretty Here is the before and after of this data set Hope you can see the difference I just noticed that I mistakenly named this column ‘Country Names’ though these are the names of states Thus this column should be name ‘States’ not ‘Country Names’ So let’s change it Done! Let’s execute this code Now let’s check the data frame Great, Now it’s correct Ok Now! Everything is good with this data set except that it is sorted using the states let’s sort this data using total cases So that we can see which state is most affected by the corona virus Close this window Great! We can sort the data of any column using the sort_values function of Python Pandas library Like this #7Sort The Data target_df = target_df.sort_values(by=’Total Cases’,ascending = False) On execution this function call will sort the data in descending order using the values stored in the column ‘Total Cases’ Let’s see Execution successful Let’s check the data again As you can see we have successfully sorted the data in descending order Comment and let me know where in this list is your state Now let’s get back to the editor and do one last thing which is exporting the data and then we will wrap up this tutorial Done! Export Data Now that we have cleaned the data, next we have to save it so that we can further process it The two most widely used file formats for saving the data is CSV or XLS – The Microsoft Excel file format You use this format xlx for exporting the data when your dataset is relatively small Like ours When you have a huge dataset then you save it as CSV or JSON Or you save it directly into a database Anyways, now let me show you how you can export this data in CSV as well as XLS format But first again close this window Great target_df.to_excel(‘covid_wiki_india.xlsx’) target_df.to_csv(‘covid_wiki_india.csv’) Here I have written down two statements You can use either of those The first statement will export the dataset into Microsoft excel format whereas the second statement will export our data into a csv format On execution these statements will create the file into the base directory where you have saved this Python script For example, in my case I have saved this Python script into a folder named ‘RebellionRider’ which is placed on my desktop Here it is Anyways If I execute this code right now then a file with the name covid_india.xls or covid_india.csv will be created and saved into that folder RebellionRider Let’s say I want to export my file in excel format

Therefore, I will comment this last statement which is for saving the file in csv format Done! Now I will execute now Seems like my internet is not working Let’s try to execute this code once again Ok My internet stopped working again It’s kind of a usual problem here Anyway, I will try to run this code again when I have the internet 2hrs later Ok I should have internet now Let’s try to run this code Voila, execution done Now, let’s go to our project directory and see if we have our desired file or not Here we are As you can see here we have our excel file That means we have successfully exported the data That’s how we process the data for Analysis or Analytics In case you don’t know the difference between Analysis and Analytics then I have done a video on it Whose link is on your screen and as well as in the description Anyways, hope you liked this video If so, please hit the thumbs up button Also please share it with your friends on your Most importantly subscribe to the channel if you haven’t already Thanks for watching This is Manish from RebellionRider.com