Simplified Scraping with Lynx, Regex, and Bash


(fxbg) #1

LYNX

I have been into web scraping for a long time, since the first time I realized people will pay for large neatly organized data. I came across something interesting and wanted to pass some knowledge onto others. Lynx is a command line based browser. I first discovered it when I had my first Linux install complete only to realize I needed a desktop and couldn’t quite figure out how to get it going (KDE or Gnome; back when you could get free Linux cd’s from computer shops).

Our script using Lynx and it’s dump option. The dump option pretty much “dumps” the text of the webpage (the stuff you see).

REGULAR EXPRESSIONS

Regular Expressions have been around a very long time, since the 1950’s. Many different variants of the “language” have come about. In working your way through tons of data looking for more specific data, “regex” (regular expressions) is the way to go. It’s not hard to learn regular expressions but it helps to know that the regex you use in PHP isn’t necessarily the regex you will use with grep on the command line.

BASH SCRIPTING

All you need is a little bit of bash scripting knowledge. You can find some nice bash tutorials over at Ryan’s bash tutorials website. The samair.ru website has a list of free proxies, not sure if they scan it themselves or rip them from somewhere else but we are gonna rip it from them (with our script at the end) with the help from bash.

We use a simple while loop that increments from 1 to 11. During the loop we will use the counter variable in our URL that we are dumping from Lynx. We also use a simple linux command to “bash” our file the proxy sites contents dump into, truncate. The truncate -s 0 command will empty our file called plist.

At the end of the script we simply cat out the plist file to our screen and grep the IP’s using our regular expression. There are many ways to go about this and as I sit here and write about it I am thinking of modifying it. It was a simple script and doesn’t need more attention, it does it’s job and the job is done. You can use this script to get the proxies from samair.ru from it’s 11 free pages it provides. Possibly good to use with proxychains.

THE SCRIPT

#!/bin/bash
truncate -s 0 plist

counter=1
while [ $counter -le 11 ]
do
  lynx -dump http://samair.ru/list/ip-port/$counter.htm >> plist
  ((counter++))
done
cat plist |grep -Eo ‘[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\:[0-9]{1,5}’

(oaktree) #2

Hiya there! Please paste in snippets in addition to (or instead of) just linking a pastebin.


(fxbg) #3

It’s my first post, wasn’t quite sure if the formatting would be okay so I linked it. I added the code to the post now. Is there some kind of syntax highlighting I am missing in the little toolbar?


(oaktree) #4

There is… If you see the edit I made you can see how to do it from now on.


(fxbg) #5

Excellent, thank you


(Command-Line Ninja) #6

Nice post. But why would you do this, when you could parse HTML with nokogiri in ruby? Or beautifulsoup in python? This is indeed quick n dirty, and gets the job done, but if you haven’t checked out either of those, I strongly recommend you to check them.

Scraping is one of my favourite pass-times. I think ruby does it better :stuck_out_tongue: @Joe_Schmoe @oaktree


(Full Snack Developer) #7

I mean, BS4 is a big install and nokogiri takes more CPU/mem to compile than a DO droplet or an RPi typically has available. I can see why someone might want to use a pure bash solution.


(A Scrub) #8

WWW::Mechanize library is what I use in Perl. Basically WWW::Mechanize is kind of like a browser in a script. (@nugget please correct me if I’m wrong).

Documentation:
http://search.cpan.org/~oalders/WWW-Mechanize-1.86/lib/WWW/Mechanize.pm

This is quick and dirty as @pry0cc stated, but I would think you could do more if you use Perl and WWW::Mechanize. For example, give the option of grabbing all the links in the page with Perl. Of course there are other ways of doing this like Python and ruby, but this is just my preference and my two cents.

Although, this is a really great post. Good Job! Kudos to you.

–Techno Forg–


(fxbg) #9

I normally just regex my way across the pages when scraping, with no libraries like beautifulsoup, I just figured this was more lightweight and hacky :slight_smile:


#10

This no longer works. The web site now uses JavaScript to defeat scrapers.

p.s. this could have been done in one line:

lynx -dump http://samair.ru/list/ip-port/{1,2,3,4,5,6,7,8,9,10,11}.htm |grep -Eo ‘[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}:[0-9]{1,5}’


(fxbg) #11

youre just missing the part where it bashes and saves to a file, plus I think your regex is broken. I just tested this script and it does still work.


(Ahmed) #12

hello,

thank you very much can you please add more bash scripting how-to ?

best regards,


(oaktree) #13

JavaScript as anti-scrape is fairly easy to thwart once you look for the in-page json.