I have been into web scraping for a long time, since the first time I realized people will pay for large neatly organized data. I came across something interesting and wanted to pass some knowledge onto others. Lynx is a command line based browser. I first discovered it when I had my first Linux install complete only to realize I needed a desktop and couldn’t quite figure out how to get it going (KDE or Gnome; back when you could get free Linux cd’s from computer shops).
Our script using Lynx and it’s dump option. The dump option pretty much “dumps” the text of the webpage (the stuff you see).
REGULAR EXPRESSIONS
Regular Expressions have been around a very long time, since the 1950’s. Many different variants of the “language” have come about. In working your way through tons of data looking for more specific data, “regex” (regular expressions) is the way to go. It’s not hard to learn regular expressions but it helps to know that the regex you use in PHP isn’t necessarily the regex you will use with grep on the command line.
BASH SCRIPTING
All you need is a little bit of bash scripting knowledge. You can find some nice bash tutorials over at Ryan’s bash tutorials website. The samair.ru website has a list of free proxies, not sure if they scan it themselves or rip them from somewhere else but we are gonna rip it from them (with our script at the end) with the help from bash.
We use a simple while loop that increments from 1 to 11. During the loop we will use the counter variable in our URL that we are dumping from Lynx. We also use a simple linux command to “bash” our file the proxy sites contents dump into, truncate. The truncate -s 0 command will empty our file called plist.
At the end of the script we simply cat out the plist file to our screen and grep the IP’s using our regular expression. There are many ways to go about this and as I sit here and write about it I am thinking of modifying it. It was a simple script and doesn’t need more attention, it does it’s job and the job is done. You can use this script to get the proxies from samair.ru from it’s 11 free pages it provides. Possibly good to use with proxychains.
It’s my first post, wasn’t quite sure if the formatting would be okay so I linked it. I added the code to the post now. Is there some kind of syntax highlighting I am missing in the little toolbar?
Nice post. But why would you do this, when you could parse HTML with nokogiri in ruby? Or beautifulsoup in python? This is indeed quick n dirty, and gets the job done, but if you haven’t checked out either of those, I strongly recommend you to check them.
Scraping is one of my favourite pass-times. I think ruby does it better @Joe_Schmoe@oaktree
I mean, BS4 is a big install and nokogiri takes more CPU/mem to compile than a DO droplet or an RPi typically has available. I can see why someone might want to use a pure bash solution.
WWW::Mechanize library is what I use in Perl. Basically WWW::Mechanize is kind of like a browser in a script. (@nugget please correct me if I’m wrong).
Documentation:
This is quick and dirty as @pry0cc stated, but I would think you could do more if you use Perl and WWW::Mechanize. For example, give the option of grabbing all the links in the page with Perl. Of course there are other ways of doing this like Python and ruby, but this is just my preference and my two cents.
Although, this is a really great post. Good Job! Kudos to you.