Python Scripting skill - Scraping

L3akM3-0day · May 5, 2016, 9:28pm

Hello Hackers !
I’m back for a new tutorial on scripting with python, here I will teach you how to do web scraping.
I needed to grab some proxy on internet to test proxychain so I decided to do a python script who will scrape the proxy for me.

ProxyScraping

First of all you’ll need python and some libraries

BeautifulSoup for scraping web content
requests for all the web request to the site

For this script I choose www.samair.ru/proxy/ because it’s an easy website to scrape.
Firs we import what we need

import requests
from bs4 import BeautifulSoup

When I looked to the source code of www.samair.ru, I saw that the link You can do it there redirected me to the page with all proxy and port that can be easily copied and pasted
Here on the screenshot below you can see that You can do it here is a link to another page
Source-code

Here is the source code of the page with all proxies
pre

We can see the ip and port is in the

<pre></pre>

So we need to tell beautiful soup follow the link you can do it there and When you find a pre get the content of it
So here the code :

#We define a global variable to be access later by other function
global url
#Here the url of the site we want to scrape
url = "http://www.samair.ru"
#Here we go to http://samair.ru/proxy/
r = requests.get(url + "/proxy")
#We dump dump all the page into the variable content
content = BeautifulSoup(r.text, "lxml")
#We dump all th links in the variale links
links = soup.find_all('a', href=True)

Now we define a function to follow the link You can do it there, the function will take the variale links as an argument

def followLink(links):
    #We loop through the links
    for i in Link:
                if i.text == "You can do it there":
                    """return the url + the link in href 
                    ex :http://www.samair.ru/proxy/ip-port/668758911.html
                    """
                    return url + i['href']

Now we do the function who will get all the ip and port

# The function take a url as an argument
def scrap(url):
    #Here we go to the page with all the proxies
    r = requests.get(url)
    #We get the content of the page
    soup = BeautifulSoup(r.text, "lxml")
    #Here we dump the text who is in <pre></pre> in the variable pre
    pre = soup.find("pre").text
    #We return a list of proxies
    return pre.split("\n")

Now We want to parse all the proxy to a file

# parse ip:port to file "http ip:port"
def parseToFile(proxies):
    for i in proxies:
        #We check if i is not empty, if i is not empty write proxies to a file
        if i:
            #We open the file proxy.txt and append the proxies
            with open("proxy.txt","a") as file:
                        #Write the proxies : http ip:port
                        file.write("http " + i +"\n")
                        #and we close the file
                        file.close()

So we have all the function needed to scape the content of the page
we can define a main function and run it

def main():
    url = followLink(links)
    proxy=scrap(url)
    parseToFile(proxy)

if __name__ == "__main__":
    main()

The whole script will look like that

import requests
from bs4 import BeautifulSoup

#We define a global variable to be access later by other function
global url
#Here the url of the site we want to scrape
url = "http://www.samair.ru"
#Here we go to http://samair.ru/proxy/
r = requests.get(url + "/proxy")
#We dump dump all the page into the variable content
content = BeautifulSoup(r.text, "lxml")
#We dump all th links in the variale links
links = soup.find_all('a', href=True)

def followLink(links):
    #We loop in the content    
    for i in Link:
                if i.text == "You can do it there":
                    """return the url + the link in href 
                    ex :http://www.samair.ru/proxy/ip-port/668758911.html
                    """
                    return url + i['href']

# The function take a url as an argument
def scrap(url):
    #Here we go to the page with all the proxies
    r = requests.get(url)
    #We get the content of the page
    soup = BeautifulSoup(r.text, "lxml")
    #Here we dump the text who is in <pre></pre> in the variable pre
    pre = soup.find("pre").text
    #We return a list of proxies
    return pre.split("\n")

# parse ip:port to file "http ip:port"
def parseToFile(proxies):
    for i in proxies:
        #We check if i is not empty, if i is not empty write proxies to a file
        if i:
            #We open the file proxy.txt and append the proxies
            with open("proxy.txt","a") as file:
                        #Write the proxies : http ip:port
                        file.write("http " + i +"\n")
                        #and we close the file
                        file.close()

def main():
    url = followLink(links)
    proxy=scrap(url)
    parseToFile(proxy)

if __name__ == "__main__":
    main()

You can run this script you’ll have a proxy.txt file in your directory whith all the proxies scraped

Edit:Tutorial Updated

@L3akM3-0day

oaktree · May 5, 2016, 9:36pm

Hi, @L3akM3-0day:

Would you mind more carefully explaining your code for everyone, please?

And this doesn’t seem like the best way to “learn python”. Take a look at a post like this one by @Defalt. He does a great job integrating Python skills into hacking.

Sea · May 5, 2016, 11:12pm

While I personally understand the code, I think you should add way more comments and elaborate more on each step, instead of throwing out a huge chunk of code.

L3akM3-0day · May 5, 2016, 11:15pm

That’s what I’ve done haha

Cromical · May 6, 2016, 12:23am

I like it mate! I’m glad to see more and more Python tut’s popping up, Python is not only my favorite scripting language, but also my first one (that I learned).

root_haxor · August 1, 2016, 2:44pm

Bro! very tuff to understand what you trying to say but i loved reading and trying this

root_haxor · August 3, 2016, 3:47pm

anybody can solve this issue
python proxyscrape.py

Traceback (most recent call last):
  File "proxyscrape.py", line 1, in <module>
    import requests
  File "C:\pentestbox\base\python\Lib\site-packages\requests\__init__.py", line 61, in <module>
    from .packages.urllib3.exceptions import DependencyWarning
  File "C:\pentestbox\base\python\Lib\site-packages\requests\packages\__init__.py", line 29, in <module>
    import urllib3
  File "C:\pentestbox\base\python\Lib\site-packages\urllib3\__init__.py", line 8, in <module>
    from .connectionpool import (
  File "C:\pentestbox\base\python\Lib\site-packages\urllib3\connectionpool.py", line 41, in <module>
    from .request import RequestMethods
  File "C:\pentestbox\base\python\Lib\site-packages\urllib3\request.py", line 7, in <module>
    from .filepost import encode_multipart_formdata
  File "C:\pentestbox\base\python\Lib\site-packages\urllib3\filepost.py", line 9, in <module>
    from .fields import RequestField
  File "C:\pentestbox\base\python\Lib\site-packages\urllib3\fields.py", line 2, in <module>
    import email.utils
ImportError: No module named utils

L3akM3-0day · August 3, 2016, 4:32pm

Can you paste your script here ?

pry0cc · August 3, 2016, 5:01pm

Try running the command prompt as Administrator, and then run pip install email

root_haxor · August 3, 2016, 6:20pm

i copied your code bro exact

root_haxor · August 3, 2016, 6:21pm

brother i tried this also

SoLux · November 1, 2017, 8:05am

if you’re on Linux:

sudo pip install utils

If that doesn’t seem to work, try:

python pip install utils

^ Will also work on windows.

As python should have the right permissions, even though sudo should work.

Perhaps you have python2 and python3 installed?
try installing with:

python3 pip install utils

or

python2 pip install utils

system · January 21, 2018, 12:30am

This topic was automatically closed after 30 days. New replies are no longer allowed.