Extract Author’s information from Geeksforgeeks article using Python
In this article, we are going to write a python script to extract author information from GeeksforGeeks article.
Module needed
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
- requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests
Approach:
- Import module
- Make requests instance and pass into URL
- Initialize the article Title
- Pass URL into a getdata()
- Scrape the data with the help of requests and Beautiful Soup
- Find the required details and filter them.
Stepwise execution of scripts:
Step 1: Import all dependence
Python
# import module import requests from bs4 import BeautifulSoup |
Step 2: Create a URL get function
Python3
# link for extract html data # Making a GET request def getdata(url): r = requests.get(url) return r.text |
Step 3: Now merge the Article name into URL and pass the URL into the getdata() function and Convert that data into HTML code
Python3
# input article by geek article = "optparse-module-in-python" # url # pass the url # into getdata function htmldata = getdata(url) soup = BeautifulSoup(htmldata, 'html.parser' ) # display html code print (soup) |
Output:
Step 4: Traverse the author’s name from the HTML document.
Python
# traverse author name for i in soup.find( 'div' , class_ = "author_handle" ): Author = i.get_text() print (Author) |
Output:
kumar_satyam
Step 5: Now create a URL with author-name and get HTML code.
Python3
# now get author information # with author name # pass the url # into getdata function htmldata = getdata(profile) soup = BeautifulSoup(htmldata, 'html.parser' ) |
Step 6: Traverse the author’s information.
Python3
# traverse information of author name = soup.find( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText' ).get_text() author_info = [] for item in soup.find_all( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold' ): author_info.append(item.get_text()) print ( "Author name :" ) print (name) print ( "Author information :" ) print (author_info) |
Output:
Author name : Satyam Kumar
Author information :
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’]
Complete code:
Python3
# import module import requests from bs4 import BeautifulSoup # link for extract html data # Making a GET request def getdata(url): r = requests.get(url) return r.text # input article by geek article = "optparse-module-in-python" # url # pass the url # into getdata function htmldata = getdata(url) soup = BeautifulSoup(htmldata, 'html.parser' ) # traverse author name for i in soup.find( 'div' , class_ = "author_handle" ): Author = i.get_text() # now get author information # with author name # pass the url # into getdata function htmldata = getdata(profile) soup = BeautifulSoup(htmldata, 'html.parser' ) # traverse information of author name = soup.find( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText' ).get_text() author_info = [] for item in soup.find_all( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold' ): author_info.append(item.get_text()) print ( "Author name :" , name) print ( "Author information :" ) print (author_info) |
Output:
Author name : Satyam Kumar
Author information :
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’]
Please Login to comment...