Web Data Mining with Python: Discover and extract information from the web using Python by Dr. Ranjana Rajnish & Dr. Meenakshi Srivastava

Web Data Mining with Python: Discover and extract information from the web using Python by Dr. Ranjana Rajnish & Dr. Meenakshi Srivastava

Author:Dr. Ranjana Rajnish & Dr. Meenakshi Srivastava [Rajnish, Dr. Ranjana & Srivastava, Dr. Meenakshi]
Language: eng
Format: epub
ISBN: 9789355513663
Publisher: BPB Publications
Published: 2023-02-15T00:00:00+00:00


Handling images

Many times we may need to scrape images (maybe for preparing the dataset or for any other purpose); we can scrap the images from any Web page and store them in our hard drive. We will now see how Python (BeautifulSoup) can be used to scrape the images using the following code. We are using Web page https://rubikscode.net/ to extract all images.

Example 1:

1. import requests

2. from bs4 import BeautifulSoup

3. import os

4.

5. url='https://rubikscode.net/'

6.

7. ur=requests.get(url)

8.

9. soup=BeautifulSoup(ur.text, 'html.parser')

10.

11. images=soup.find_all('img')

12.

13. for image in images:

14. print(image['src'])

15.

To do so, we will import Beautiful Soup from the requests library. We are also importing “os” module as we need to store the images in the hard drive. “os” module is used whenever the code needs to interact with the underlying operating system. In Line 5, we are storing the URL of the website from which images need to be scraped into the variable “url”. Then in Line 7, we pass this “url” to the get() method of “requests”, to connect to and retrieve information from the given server using a given URL. This information is stored in the variable “ur”. At this stage, if you will get lots of information in the form of HTML text. Line 9 uses “BeautifulSoup” to create a parse tree from the page source to extract data in a hierarchical and more readable manner, which is stored in “soup” variable (Conventionally, we use the “soup” variable name, but any name can be used). At this stage, if you print the value of “soup” variable, you will see the entire Web page with HTML tags. Now, we want only the images from the page, and we have seen that each image link is specified in <img> tag. So, using find_all(‘img’) in Line 11, we find all image links from the soup object. We now iterate the list of links stored in “images” using a for loop in Line 13 and print each link in Line 14. You will get the output as follows:



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.