Simple Web Scraping Using Python 3 (Part 1)
Requirements
- Python 3.6 Version or latest (Anaconda Recommended)
- Pip
- Beautiful Soup
- Scrapy ( For Part 2)
- git
- Know how to use python
- HTML Basic Knowledge
Installation
In your Terminal Type
easy_install pippip install beautifulsoup4
or
python -m pip install beautifulsoup4
Version Check
- For Mac users, Python is pre-installed in OS X. Open up Terminal and type
python --version
. You should see your python version is 3.7.x. - For Windows users, please install Python through the official website.
Basic HTML
1. <!DOCTYPE html>
: HTML documents must start with a type declaration.
2. The HTML document is contained between <html>
and </html>
.
3. The meta and script declaration of the HTML document is between <head>
and </head>
.
4. The visible part of the HTML document is between <body>
and </body>
tags.
5. Title headings are defined with the <h1>
through <h6>
tags.
6. Paragraphs are defined with the <p>
tag.
Other useful tags include <a>
for hyperlinks, <table>
for tables, <tr>
for table rows, and <td>
for table columns.
Also, HTML tags sometimes come with id
or class
attributes. The id
attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class
attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.
For more information on HTML tags, id and class, please refer to W3Schools Tutorials.
Web Scraping Rules & Regulations
- Firstly Read Website Term & Condition.
- Do not request data again and again from same website.
Acknowledgment
I Created my simple web page for Web Scraping Tutorial My Favourite Quotes but currently this website is secure and any web scraping attempt will be blocked so for practical work please create your own simple web page .Please follow following steps to scrap website data .
Inspecting the Page
Step 1
Press Ctrl + Shift + C to open Inspect Element and then Point at the place you want to scrap .
<div class="quotes">
<p class="aquote">
I hear and i forget.<br> I see and i remember.<br> I do and i understand.
</p>
<p class="author">
Confucious
</p>
</div>
As we can see we have Div , Class , Paragraph <p> Here.
Code Begins
Step 1
Import Python libraries.
from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq
Step 2:
Save the page link in some variable .
page='http://adilshehzad.me/my-fav-quotes-for-webcrawling-test'
Step 3
request url using uReq
client=uReq(page)
Step 4
Read url using python built in command .read()
page_html=client.read()
Step 5
close the url file saved in client variable and then use the beautiful soup command
page_soup=soup(page_html,"html.parser")
Step 6
Now we need to add some your HTML Knowledge here . i already explained that we have just <p>,<div>, class here and having named quote, aquote and author.
quotes=page_soup.findAll("div" ,{"class":"quotes"})
We Finally Scrap one quote with author name but we want to scrap complete data so we use loop here .
for quote in quotes:
fav_quote=quote.findAll("p" ,{"class":"aquote"})
aquote=fav_quote[0].text.strip()
fav_author=quote.findAll("p" ,{"class":"author"})
author=fav_author[0].text.strip()
print(aquote)
print(author)
Step 7
Now we finally Scrap data from website but still we have scrap data printed on cmd black screen so we have to save data in .txt file
file1 = open('output.txt', 'a')
print(aquote, file=file1)
print(author, file=file1)
Finally We did it :)
Wrapping Up
Wrapping up Part 1 of Web Scrapping and second part will be begin with Scrapy .