How to get the next page on BeautifulSoup?
In this article, we are going to see how to Get the next page on beautifulsoup.
Modules Needed
- BeautifulSoup: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. To install this module type the below command in the terminal.
pip install bs4
- requests: This library allows you to send HTTP/1.1 requests extremely easily. To install this module type the below command in the terminal.
pip install requests
Approach:
Get the next page on beautifulsoup means first we will scrap one-page content and if many links are given on the page, and we want to scrap them also. We can get the next page first we will scrap the sample website after that any other links find, and we will call again requests. Get method for that page and will create a soup of that also. So this way we can get to the next page on beautifulsoup.
Let’s execute the script step-by-step :
Step 1: Import all dependence
from bs4 import BeautifulSoup import requests
Step 2: We need to request the page URL with requests.
page=requests.get(sample_website)
Step 3: With the help of beautifulsoup method and HTML parser we will create a soup of the page.
soup = BeautifulSoup(page, 'html.parser')
Step 4:
We will search in the parse tree and find the link. If we want that URL, then with the help of the requests module and beautiful module we will again create the soup of the next page hence we can get the next page on beautifulsoup.
Python3
for i in soup.find_all( 'a' , href = True ): # check all link which is contain # "www.geeksforgeeks.org" string if ( "www.geeksforgeeks.org" in i[ 'href' ]): # call get method to request next url nextpage = requests.get(i[ 'href' ]) # create soup for next url nextsoup = BeautifulSoup(nextpage.content, 'html.parser' ) # we can scrap any thing of the # next page here we are scraping title of # nexturl page string print ( "next url title : " ,nextsoup.find( 'title' ).string) |
Below is the full Implementation:
Python3
from bs4 import BeautifulSoup import requests # sample website sample_website = 'https://www.geeksforgeeks.org/different-ways-to-remove-all-the-digits-from-string-in-java/' # call get method to request the page page = requests.get(sample_website) # with the help of BeautifulSoup # method and html parser created soup soup = BeautifulSoup(page.content, 'html.parser' ) # With the help of find_all # method perform searching in parser tree for i in soup.find_all( 'a' , href = True ): # check all link which is contain # "www.geeksforgeeks.org" string if ( "www.geeksforgeeks.org" in i[ 'href' ]): # call get method to request next url nextpage = requests.get(i[ 'href' ]) # create soup for next url nextsoup = BeautifulSoup(nextpage.content, 'html.parser' ) # we can scrap any thing of the # next page here we are scraping title of # nexturl page string print ( "next url title : " ,nextsoup.find( 'title' ).string) |
Output:
next url title : GeeksforGeeks | A computer science portal for geeks next url title : Analysis of Algorithms | Set 1 (Asymptotic Analysis) - GeeksforGeeks next url title : Analysis of Algorithms | Set 2 (Worst, Average and Best Cases) - GeeksforGeeks next url title : Analysis of Algorithms | Set 3 (Asymptotic Notations) - GeeksforGeeks next url title : Analysis of algorithms | little o and little omega notations - GeeksforGeeks next url title : Lower and Upper Bound Theory - GeeksforGeeks next url title : Analysis of Algorithms | Set 4 (Analysis of Loops) - GeeksforGeeks next url title : Analysis of Algorithm | Set 4 (Solving Recurrences) - GeeksforGeeks next url title : Analysis of Algorithm | Set 5 (Amortized Analysis Introduction) - GeeksforGeeks next url title : What does 'Space Complexity' mean? - GeeksforGeeks next url title : Pseudo-polynomial Algorithms - GeeksforGeeks next url title : Polynomial Time Approximation Scheme - GeeksforGeeks next url title : A Time Complexity Question - GeeksforGeeks .................................................................
Please Login to comment...