Spoofing IP address when web scraping using Python
In this article, we are going to scrap a website using Requests by rotating proxies in Python.
Modules Required
- Requests module allows you to send HTTP requests and returns a response with all the data such as status, page content, etc.
Syntax:
requests.get(url, parameter)
- JSON JavaScript Object Notation is a format for structuring data. It is mainly used for storing and transferring data between the browser and the server. Python too supports JSON with a built-in package called json. This package provides all the necessary tools for working with JSON Objects including parsing, serializing, de-serializing, and many more.
Approach
- Manually create a set of http proxies if you don’t have use rapidapi.(Here create_proxy() function is used to generate a set of http proxies using rapidapi)
- Iterate the set of proxies and send a GET request using requests.get(url, proxies=proxies) to the website along with the proxies as parameters.
Syntax:
requests.get(url, proxies=proxies)
- If the proxy is working perfectly then it should return an object of the URL.
Apart from working with the code, there are few more set-ups that need to be done, and given below are the details of these setups.
Using Rapidapi to get a set of proxies:
- Firstly, you need to buy a subscription of this API from rapidapi and then go to dashboard and select Python and copy the api_key.
- Initialize the headers with the API key and the rapidapi host.
Syntax:
headers = {
‘x-rapidapi-key’: “paste_api_key_here”,
‘x-rapidapi-host’: “proxy-orbit1.p.rapidapi.com”
}
- Send a GET request to the API along with headers ,
Syntax:
response = requests.request(“GET”, url, headers=headers)
- This will return a JSON, parsing the text using json.loads(), we can find the proxy server address in the “curl” key.
Syntax:
response = json.loads(response.text)
proxy = response[‘curl’]
Sending Proxy in requests.get() as parameter:
Sending a GET request using requests.get() along with a proxy to this url which will return the proxy server address of current session.
Syntax:
# Note : Opening https://ipecho.net/plain in browser will show the current ip address of the session.
proxies = ‘http://78.47.16.54:80’
page = requests.get(‘https://ipecho.net/plain’, proxies={“http”: proxy, “https”: proxy})
print(page.text)
Program:
Python3
import requests import json # Gets proxies from rapidapi to create # a set of proxies. # Use this function only if you have rapidapi key. def create_proxy(): # Initialise the headers and paste the API key # of proxy-orbit1 from rapidapi. headers = { 'x-rapidapi-key' : "paste_api_key_here" , 'x-rapidapi-host' : "proxy-orbit1.p.rapidapi.com" } # Sends a GET request to the above url along with api # keys which returns an object containing data in json # format which is then parsed using json.loads. response = requests.request( "GET" , url, headers = headers) response = json.loads(response.text) # The proxy server ip address is present in 'curl' key. proxy = response[ 'curl' ] return proxy # Main Function if __name__ = = "__main__" : # Create an empty set and call the create_proxy() # function to generate a set of proxies from rapidapi. # Orbit proxy Rapid api key is required. proxies = set () print ( "Creating Proxy List" ) for __ in range ( 10 ): proxies.add(create_proxy()) # If you do not have rapidapi then create a set of # proxies manually. # proxies = {'http://78.47.16.54:80', # Iterate the proxies and check if it is working. for proxy in proxies: print ( "\nChecking proxy:" , proxy) try : # https://ipecho.net/plain returns the ip address # of the current session if a GET request is sent. proxies = { "http" : proxy, "https" : proxy}) print ( "Status OK, Output:" , page.text) except OSError as e: # Proxy returns Connection error print (e) |
Output:
Please Login to comment...