King Somto
12 Apr 2022
•
6 min read
In today’s article, we will learn how to use the BeautifulSoup library in Python to scrape data off the internet. Before we go into that, we need to build a strong foundation that will simplify the whole process. So let us get into what Data Scraping is and why it is an important skill to have.
Data Scraping
can simply is an automated way of pulling desired information off a website and inserting that information into your local file or spreadsheet. Data Scraping is also called Web Scraping, Web Data Extraction, and Web Harvesting.
Imagine you are writing a project and you need the names of all the vendors on an e-commerce website. Instead of going to the website and going through the arduous task of copying and pasting each vendor manually into your local file, you can simply employ Data Scraping to automatically do that for you.
Whether you need background data to gather information on your employees, or you want some real-time data on the latest roles for your job board, or stock prices for a particular stock, you can use data scraping to further quicken the search without the hassle of visiting said sites to check each time.
Some people use the powers of data scraping for bad reasons. Some hackers use data scrapers to get the email addresses of a lot of people and then proceed to use this information for malicious acts. Cybercriminals can use the information they get from data scraping for a wide range of fraudulent activities. Data scraping is a powerful tool but can be dangerous when in the wrong hands.
If you do not have Python installed on your computer, you can download it from https://www.python.org/downloads/. To be sure Python is installed on your computer, run the following command in your Command Prompt:
python -V
// Please note that the ‘V’ in ‘-V’ is case-sensitive.
This will show you the version of Python you have installed.
Now that Python has been installed, we need to install the libraries that are essential to the success of our project. For today, we will use Pip to install the external libraries we need. Pip
is a package management system for Python that is used to install and manage software packages.
To check if you have Pip, type this in Command Prompt:
pip -v
Now that pip is installed, you will need to install the required libraries. These are:
requests
: this library allows us to send HTTP(Hypertext Transfer Protocol) request to the URL(Uniform Resource Locator) of the webpage we want to access. It allows us to get the HTML content of the page.lxml
: this is a parser that provides more structure to the HTML we have imported.beautiful soup
: this allows us to search through our parsed HTML and retrieve the information we need.Now that you know the libraries you need and why you need them, all that is left is to type in these commands in Command Prompt:
pip install requests
pip install lxml
pip install bs4
You are all set to begin coding your very own web scraper!
Our goal is to build a web scraper that will return Billboard’s Hot 100 songs. The first block of our code will simply be made up of code that imports all the necessary libraries.
from bs4 import BeautifulSoup
import requests
import lxml
The next thing we want to do is to request the HTML content of the webpage we want to scrape. So we need to visit https://www.billboard.com then go on to visit the webpage that contains the Hot 100 which is https://www.billboard.com/charts/hot-100.So to request the HTML content of this webpage, we type in the following code:
URL = ('https://www.billboard.com/charts/hot-100/')
r = requests.get(URL)
First of all, we created a variable called URL which stores the webpage for us.
Then we sent an HTTP request to the URL and stored the responses from the server in a variable called r. Now we need to parse the HTML content into a more structured form.
doc = BeautifulSoup(r.content, 'lxml')
structured_doc = doc.prettify()
In the block of code above, we created an instance of BeautifulSoup and specified the parser we want to use. We passed two arguments into the object:
r.content
: the content of the webpage.lxml
: the parser we have chosen to use.The prettify function serves the purpose of making the HTML more readable. As the name implies, it prettifies the code making it easier to understand/break down.
We should create a list that we will use to store the songs we want to import from the webpage. An empty list is created by typing the code below.
list = []
We need to go to the webpage and ‘inspect’ the HTML
code that is responsible for the content we are concerned about.
From the picture above, we can see a list of The Hot 100 songs. If we look closely, we can see that the format of the first song differs from the others. In other words, its HTML code will differ from that of the others. We inspect it to see the exact code. To inspect it, we hover our pointer on what we want to inspect, right-click and select inspect and the result will be displayed on the side.
After inspecting, our browser should show us the code. It will be highlighted as shown below:
From the code in the top-right of the screen, we can see that the song ‘Heat Waves’ is in an H3 tag. To access it this will be our code:
first_song = doc.find('h3').text.strip()
This line of code is responsible for several things. Firstly, the find function goes through the parsed document looking for the first instance of the argument(‘h3’) and returns it. The text function returns only the elements without the tags and finally, the strip function removes all the whitespaces. The result is then stored in the variable song.
The next step is to append the song to our list. We do this by typing in the code below.
song = ('1. ' + (first_song))
list.append(song)
Now we need to go back to our browser and inspect the code for the second song. The second song based on the picture above is ‘We Don’t Talk About Bruno’ and just merely looking at the webpage we can tell that it is similarly formatted to all the subsequent songs. Inspecting the code, we will observe that all the songs from number 2 and below have these HTML tags/attributes in common:
<h3></h3>
class="c-title a-no-truncate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only
This observation is key to writing the next line(s) of code which we can see below.
ss = doc.findAll('h3', class_ = 'c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only')
The code above uses a method called findAll that returns all the results that have the predefined tags and stores them in our ss variable. Note that when we are passing the arguments we typed in ‘class_’ and not ‘class’, this is because the word ‘class’ is a special/reserved word in Python and so we need to put an underscore after to let Python know that we are not referring to the special/reserved word.
num = 2
for i in ss:
songs = (str(num) + '.' + ' ' + (i.text).strip())
list.append(songs)
num = num + 1
The code above simply uses a loop to append the songs that fulfilled the conditions in the code that found out all the songs that were surrounded by certain HTML tags. We use the ‘.text’ method to return only the attribute and not the tags and then we use the ‘.strip()’ function to remove excessive whitespaces before finally appending the value to our list.
Now all that is left is creating a file where we will store our list. To do this we type in the following code.
f = open('Billboard_Hot_100.txt', 'w+')
The code above creates a text file named Billboard_Hot_100.
f.write('BILLBOARD HOT 100\n')
The code above simply instructs Python to write ‘Billboard Hot 100’ in our newly created text file and then go to the next line.
for i in list:
f.write(i)
f.write('\n')
f.close()
The code above uses a for loop to instruct Python to write each member(song) in our list onto our text file in a new line. Finally, the file is closed. Let us check our computer to see if the file was created and if our mission to scrape The Hot 100 off Billboard’s website and into our locally created file was successful.
As you can see from the images above, our project has been completed successfully. So a big kudos to you all! Do you think you can use what you have learned to make the file more comprehensive? How about adding the names of the artists responsible for these songs? Or the number of weeks the songs have been charting for?
Here is the entire code Cheers!!!!...
from bs4 import BeautifulSoup
import requests
import lxml
URL = ('https://www.billboard.com/charts/hot-100/')
r = requests.get(URL)
doc = BeautifulSoup(r.content, 'lxml')
structured_doc = doc.prettify()
list = []
first_song = doc.find('h3').text.strip()
song = ('1. ' + (first_song))
list.append(song)
ss = doc.findAll('h3', class_ = 'c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only')
num = 2
for i in ss:
songs = (str(num) + '.' + ' ' + (i.text).strip())
list.append(songs)
num = num + 1
f = open('Billboard_Hot_100.txt', 'w+')
f.write('BILLBOARD HOT 100\n')
for i in list:
f.write(i)
f.write('\n')
f.close()
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!