In this post, we’ll see how to create Web scraping Python Projects with code. So, we are going to Scrap Covid-19 statistics data and Convert the scraped Html table to python dict/json using Beautifulsoup, List Comprehension, Python Requests and lxml library.

 

If you like videos like this consider donating $1, or simply turn off AdBlocker. Either helps me to continue making tutorials.



Code below follows the video to help :

Transcript / Cheat Sheet : 
 
Step 1: Search for the Required Link:
 
So searching a relevant link plays a important role in our Web scraping Python Projects as whole code depends on the link to be scrapped. If in later stage you decides the change the link then there will lot of mess.
Best Practices :
  • Authenticity of Link to be scrapped.
  • Data you wanna scrap shouldn’t be in encrypted form in Inspect Element/ Page Source.
  • Site shouldn’t block you for scraping (major issue) : If site blocks you need to find a workaround by modifying the requests with User Agent and Adding Proxy.
Step 2: Analyze the Data :
 

Once we got required link, next step is to analyze the classes, tags and id’s where our data resides. For this post we are scraping the data from Html table. So we are interested in finding the id attached to html table. Later we’re going to use this Html table id for filtering out the table from whole page source and Convert the Html Table to JSON / Python Dict using list comprehension and beautifulsoup.

 
Step 3: Install the required Libraries:
 
  • pip install requests
  • pip install bs4
  • pip install lxml
Step 3: Get your Hands dirty with Coding : 
import requests
import bs4

page=requests.get("https://covid-19.hackanons.com/test.html")

soup= bs4.BeautifulSoup(page.text,'lxml')

table=soup.find('table',id="main_table_countries_today")

headers=[ heading.text.replace(",Other","") for heading in table.find_all('th')]

table_rows=[ row for row in table.find_all('tr')]

results=[{headers[index]:cell.text for index,cell in enumerate(row.find_all("td")) }for row in table_rows]

for i in results:
    if "Country" in i:
        if i["Country"]=="India":
            print(i)

Categorized in: