Simple Web Scraping Using Python 3 (Part 1)

Adil Shehzad
3 min readFeb 23, 2020

--

Requirements

  • Python 3.6 Version or latest (Anaconda Recommended)
  • Pip
  • Beautiful Soup
  • Scrapy ( For Part 2)
  • git
  • Know how to use python
  • HTML Basic Knowledge

Installation

In your Terminal Type

easy_install pippip install beautifulsoup4

or

python -m pip install beautifulsoup4

Version Check

  • For Mac users, Python is pre-installed in OS X. Open up Terminal and type python --version. You should see your python version is 3.7.x.
  • For Windows users, please install Python through the official website.

Basic HTML

1. <!DOCTYPE html>: HTML documents must start with a type declaration.
2. The HTML document is contained between <html> and </html>.
3. The meta and script declaration of the HTML document is between <head>and </head>.
4. The visible part of the HTML document is between <body> and </body>tags.
5. Title headings are defined with the <h1> through <h6> tags.
6. Paragraphs are defined with the <p> tag.

Other useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.

Also, HTML tags sometimes come with id or class attributes. The id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.

For more information on HTML tags, id and class, please refer to W3Schools Tutorials.

Web Scraping Rules & Regulations

  1. Firstly Read Website Term & Condition.
  2. Do not request data again and again from same website.

Acknowledgment

I Created my simple web page for Web Scraping Tutorial My Favourite Quotes but currently this website is secure and any web scraping attempt will be blocked so for practical work please create your own simple web page .Please follow following steps to scrap website data .

Inspecting the Page

Step 1

Press Ctrl + Shift + C to open Inspect Element and then Point at the place you want to scrap .

<div class="quotes">
<p class="aquote">
I hear and i forget.<br> I see and i remember.<br> I do and i understand.
</p>
<p class="author">
Confucious
</p>
</div>

As we can see we have Div , Class , Paragraph <p> Here.

Code Begins

Step 1

Import Python libraries.

from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq

Step 2:

Save the page link in some variable .

page='http://adilshehzad.me/my-fav-quotes-for-webcrawling-test'

Step 3

request url using uReq

client=uReq(page) 

Step 4

Read url using python built in command .read()

page_html=client.read()

Step 5

close the url file saved in client variable and then use the beautiful soup command

page_soup=soup(page_html,"html.parser")

Step 6

Now we need to add some your HTML Knowledge here . i already explained that we have just <p>,<div>, class here and having named quote, aquote and author.

quotes=page_soup.findAll("div" ,{"class":"quotes"})

We Finally Scrap one quote with author name but we want to scrap complete data so we use loop here .

for quote in quotes:
fav_quote=quote.findAll("p" ,{"class":"aquote"})
aquote=fav_quote[0].text.strip()

fav_author=quote.findAll("p" ,{"class":"author"})
author=fav_author[0].text.strip()
print(aquote)
print(author)

Step 7

Now we finally Scrap data from website but still we have scrap data printed on cmd black screen so we have to save data in .txt file

file1 = open('output.txt', 'a')
print(aquote, file=file1)
print(author, file=file1)

Finally We did it :)

Wrapping Up

Wrapping up Part 1 of Web Scrapping and second part will be begin with Scrapy .

--

--