Hello Hackers !
I’m back for a new tutorial on scripting with python, here I will teach you how to do web scraping.
I needed to grab some proxy on internet to test proxychain so I decided to do a python script who will scrape the proxy for me.
ProxyScraping
First of all you’ll need python and some libraries
- BeautifulSoup for scraping web content
- requests for all the web request to the site
For this script I choose www.samair.ru/proxy/ because it’s an easy website to scrape.
Firs we import what we need
import requests
from bs4 import BeautifulSoup
When I looked to the source code of www.samair.ru, I saw that the link You can do it there redirected me to the page with all proxy and port that can be easily copied and pasted
Here on the screenshot below you can see that You can do it here is a link to another page
Here is the source code of the page with all proxies
We can see the ip and port is in the
<pre></pre>
So we need to tell beautiful soup follow the link you can do it there and When you find a pre get the content of it
So here the code :
#We define a global variable to be access later by other function
global url
#Here the url of the site we want to scrape
url = "http://www.samair.ru"
#Here we go to http://samair.ru/proxy/
r = requests.get(url + "/proxy")
#We dump dump all the page into the variable content
content = BeautifulSoup(r.text, "lxml")
#We dump all th links in the variale links
links = soup.find_all('a', href=True)
Now we define a function to follow the link You can do it there, the function will take the variale links as an argument
def followLink(links):
#We loop through the links
for i in Link:
if i.text == "You can do it there":
"""return the url + the link in href
ex :http://www.samair.ru/proxy/ip-port/668758911.html
"""
return url + i['href']
Now we do the function who will get all the ip and port
# The function take a url as an argument
def scrap(url):
#Here we go to the page with all the proxies
r = requests.get(url)
#We get the content of the page
soup = BeautifulSoup(r.text, "lxml")
#Here we dump the text who is in <pre></pre> in the variable pre
pre = soup.find("pre").text
#We return a list of proxies
return pre.split("\n")
Now We want to parse all the proxy to a file
# parse ip:port to file "http ip:port"
def parseToFile(proxies):
for i in proxies:
#We check if i is not empty, if i is not empty write proxies to a file
if i:
#We open the file proxy.txt and append the proxies
with open("proxy.txt","a") as file:
#Write the proxies : http ip:port
file.write("http " + i +"\n")
#and we close the file
file.close()
So we have all the function needed to scape the content of the page
we can define a main function and run it
def main():
url = followLink(links)
proxy=scrap(url)
parseToFile(proxy)
if __name__ == "__main__":
main()
The whole script will look like that
import requests
from bs4 import BeautifulSoup
#We define a global variable to be access later by other function
global url
#Here the url of the site we want to scrape
url = "http://www.samair.ru"
#Here we go to http://samair.ru/proxy/
r = requests.get(url + "/proxy")
#We dump dump all the page into the variable content
content = BeautifulSoup(r.text, "lxml")
#We dump all th links in the variale links
links = soup.find_all('a', href=True)
def followLink(links):
#We loop in the content
for i in Link:
if i.text == "You can do it there":
"""return the url + the link in href
ex :http://www.samair.ru/proxy/ip-port/668758911.html
"""
return url + i['href']
# The function take a url as an argument
def scrap(url):
#Here we go to the page with all the proxies
r = requests.get(url)
#We get the content of the page
soup = BeautifulSoup(r.text, "lxml")
#Here we dump the text who is in <pre></pre> in the variable pre
pre = soup.find("pre").text
#We return a list of proxies
return pre.split("\n")
# parse ip:port to file "http ip:port"
def parseToFile(proxies):
for i in proxies:
#We check if i is not empty, if i is not empty write proxies to a file
if i:
#We open the file proxy.txt and append the proxies
with open("proxy.txt","a") as file:
#Write the proxies : http ip:port
file.write("http " + i +"\n")
#and we close the file
file.close()
def main():
url = followLink(links)
proxy=scrap(url)
parseToFile(proxy)
if __name__ == "__main__":
main()
You can run this script you’ll have a proxy.txt file in your directory whith all the proxies scraped
Edit:Tutorial Updated