Simple Web Scraping Using Python 3 (Part 1)
- Python 3.6 Version or latest (Anaconda Recommended)
- Beautiful Soup
- Scrapy ( For Part 2)
- Know how to use python
- HTML Basic Knowledge
In your Terminal Type
easy_install pippip install beautifulsoup4
python -m pip install beautifulsoup4
- For Mac users, Python is pre-installed in OS X. Open up Terminal and type
python --version. You should see your python version is 3.7.x.
- For Windows users, please install Python through the official website.
<!DOCTYPE html>: HTML documents must start with a type declaration.
2. The HTML document is contained between
3. The meta and script declaration of the HTML document is between
4. The visible part of the HTML document is between
5. Title headings are defined with the
6. Paragraphs are defined with the
Other useful tags include
<a> for hyperlinks,
<table> for tables,
<tr> for table rows, and
<td> for table columns.
Also, HTML tags sometimes come with
class attributes. The
id attribute specifies a unique id for an HTML tag and the value must be unique within the HTML document. The
class attribute is used to define equal styles for HTML tags with the same class. We can make use of these ids and classes to help us locate the data we want.
For more information on HTML tags, id and class, please refer to W3Schools Tutorials.
Web Scraping Rules & Regulations
- Firstly Read Website Term & Condition.
- Do not request data again and again from same website.
I Created my simple web page for Web Scraping Tutorial My Favourite Quotes but currently this website is secure and any web scraping attempt will be blocked so for practical work please create your own simple web page .Please follow following steps to scrap website data .
Inspecting the Page
Press Ctrl + Shift + C to open Inspect Element and then Point at the place you want to scrap .
I hear and i forget.<br> I see and i remember.<br> I do and i understand.
As we can see we have Div , Class , Paragraph <p> Here.
Import Python libraries.
from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq
Save the page link in some variable .
request url using uReq
Read url using python built in command .read()
close the url file saved in client variable and then use the beautiful soup command
Now we need to add some your HTML Knowledge here . i already explained that we have just <p>,<div>, class here and having named quote, aquote and author.
We Finally Scrap one quote with author name but we want to scrap complete data so we use loop here .
for quote in quotes:
Now we finally Scrap data from website but still we have scrap data printed on cmd black screen so we have to save data in .txt file
file1 = open('output.txt', 'a')
Finally We did it :)
Wrapping up Part 1 of Web Scrapping and second part will be begin with Scrapy .