Today, "Data" is nothing for the user and very important for companies and organizations. But, in reality, "Data" so crucial for the user and nothing for companies and organizations. Day by day the data is significantly increasing by the user from the different platforms. As per the one survey, there are approximately 2.5 quintillion bytes of data are generated every day. It estimated that around 1.7 MB of data is generating over every second by each user in the world.
As a data scientist, it's their job to use this data for making user experience batter. While performing tasks as a data scientist it is very common to use data from the internet. We can access this data in the form of CSV (Comma Separated Value) or can use that platform's API (Application Programming Interface).
But, sometimes APIs not allows us to access many private pieces of information like messages, product details, etc. At that time, we can use the cool technique called "Web Scraping" on a particular site or page.
In this article, we'll see What is Web Scraping? Why we are using it? Which tools we can use for this? Also, I'll mention the platform, where you can start your journey toward data science.
What is Web Scraping?
Web Scraping is the technique employed to extract large amounts of data from websites, whereby the data is extracted and saved to a local file to your computer or to a database.
We can extract the data and able to store it in the format of (.xml), (.csv), and in the database.
The need for Web scraping
As we saw earlier, one of the reasons to use this technique
because of APIs' limited access. The second thing is that if we want some
information from any website then we cannot do copy and paste the data
displayed on the page. It is very tedious job that may take many hours or
sometimes days to complete.
- E-commerce Portals
- Market Research/Analysis
- Alternative Data for Finance
- Business Automation
- Price Intelligence
- Travel Websites
- Social Websites
Packages used for Web scraping
Mostly, developers used python programming language for this
because it allows various packages for web scraping. Another reason is, in
python string manipulation is very easy so that we can easily work on data.
-
Pattern
-
Scrapy
-
Mechanize
-
Beautiful Soup
-
Requests
Where the last two packages are most popular.
So, let's start scraping data from Amazon.
Step:1 Here, I'm using BeautifulSoup and requests packages and this page for scraping data.
from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq
my_url = "https://www.amazon.in/s?k=iphones&ref=nb_sb_noss_2"
uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser")
The uReq opens the connection. It grabs the webpage and tightly loads the data. The .read() function used to dump the data which is gathered. After that, we parse the HTML webpage.
print(soup.prettify(containers[0]))
I would like to recommend you Coding Blocks, which is a great and perfect institute for the one who really wants to learn somethings. They provide the best data science courses for beginner, medium, and advanced level programmers.
You also get additional discount by using the CBCA760 code at the payment page.
Well done
ReplyDeleteVery informative
ReplyDeleteKeep going ��