Web Scraping in Python – Need, Introduction, Use cases, Tools


Today, "Data" is nothing for the user and very important for companies and organizations. But, in reality, "Data" so crucial for the user and nothing for companies and organizations. Day by day the data is significantly increasing by the user from the different platforms. As per the one survey, there are approximately 2.5 quintillion bytes of data are generated every day. It estimated that around 1.7 MB of data is generating over every second by each user in the world.

As a data scientist, it's their job to use this data for making user experience batter. While performing tasks as a data scientist it is very common to use data from the internet. We can access this data in the form of CSV (Comma Separated Value) or can use that platform's API (Application Programming Interface).

But, sometimes APIs not allows us to access many private pieces of information like messages, product details, etc. At that time, we can use the cool technique called "Web Scraping" on a particular site or page.

In this article, we'll see What is Web Scraping? Why we are using it? Which tools we can use for this? Also, I'll mention the platform, where you can start your journey toward data science.

What is Web Scraping?

Web Scraping is the technique employed to extract large amounts of data from websites, whereby the data is extracted and saved to a local file to your computer or to a database. 

We can extract the data and able to store it in the format of (.xml), (.csv), and in the database.

The need for Web scraping

As we saw earlier, one of the reasons to use this technique because of APIs' limited access. The second thing is that if we want some information from any website then we cannot do copy and paste the data displayed on the page. It is very tedious job that may take many hours or sometimes days to complete.

Use Cases

  • E-commerce Portals
  • Market Research/Analysis
  • Alternative Data for Finance
  • Business Automation
  • Price Intelligence
  • Travel Websites
  • Social Websites

Packages used for Web scraping

Mostly, developers used python programming language for this because it allows various packages for web scraping. Another reason is, in python string manipulation is very easy so that we can easily work on data.

  1. Pattern

  2. Scrapy

  3. Mechanize

  4. Beautiful Soup

  5. Requests

Where the last two packages are most popular.

So, let's start scraping data from Amazon.

Step:1  Here, I'm using BeautifulSoup and requests packages and this page for scraping data.

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
my_url = "https://www.amazon.in/s?k=iphones&ref=nb_sb_noss_2"

Step:2 

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

The uReq opens the connection. It grabs the webpage and tightly loads the data. The .read() function used to dump the data which is gathered. After that, we parse the HTML webpage.

Step:3  Now, we want a container that holds all the details, which we need. for this follow the steps:

Go to the link or on whichever page you want perform.

Find the HTML division tag that contains all your required data.

Copy class name from that division tag.


Put this class name in findAll function. Sometimes instead of 'Class' there is a 'id'. So, use accordingly.


So, if you run this cell you will get the number which indicates that how many products are on that URL's page. In our case, there 16 iPhone's details are available to the first page.

Step:4  Now fetch the HTML code so that you know from where you have to takes values.
print(soup.prettify(containers[0]))


Do the same for price and rating like you did in a container.


Great, now you have all values what you want. But, it only for one container. Do the same things for all the items listed on the page. Create .csv file and store all the values.


Get the full code at my GitHub account

If you're thinking to make your carrier as a Data Scientist and Machine Learning developer then this is the best start to know how you can make the dataset using Web Scraping. Machine Learning is all about 90% of data per-processing and 5% of data modeling. So, it is very important to you that you start in an appropriate way.

I would like to recommend you Coding Blocks, which is a great and perfect institute for the one who really wants to learn somethings. They provide the best data science courses for beginner, medium, and advanced level programmers. 


In addition, they have the best train mentors, who personally take care of your progress. You'll also get verified certificate, goodies, and gifts by the Coding Blocks community. The main part of joining CB is they give placement cell at the top companies in the world to their students. So, I think this is the best place for you.
You also get additional discount by using the CBCA760 code at the payment page.

Bharat Vora
(Campus Ambassador of CB)

If you really like this💯, then follow🌈 me by Clicking Follow💥 button next to comment section.🤩🥰
Stay Connect with me 😃
Thank you 💙😇

2 Comments

Thank you for visiting my blog. My team is here to help you. Let us know if you have any doubts.

Post a Comment
Previous Post Next Post