[Python] Trouble Downloading PDFs with Selenium


#1

Objective

  • Scrape course evaluation PDFs from my university’s website.

Technologies

  • Python 2.7
  • Selenium Webdriver

Problem

This code allows me to download 101 PDFs with the name download.pdf, download(1).pdf, download(2).pdf, etc. It works up until download(100).pdf, and then the Chrome driver pops up a save as dialog and slowly starts to crash the program.

Attempted Solutions

  1. I’ve tried renaming the download.pdf file that gets downloaded and moving it, but for some reason the next download ends up being named download(1).pdf, even though download.pdf should be an available name.

  2. I’ve tried moving all files from the download directory to a permanent directory, but all that does is move the problem to another directory.

Code

#EDIT: Solution

Thanks to usandfriends and @oaktree for helping me figure out that I needed to use cookies. I rewrote the script and use urllib instead of selenium. Thanks guys!

Thank You!

Thank you for taking the time to read this. I’m new to 0x00sec and have really loved it so far. This is a great community and plan to stay. Thanks in advance for your help.


(oaktree) #2

Ya can’t just GET the pdf and write to disk?


#3

I probably could, but you can’t access the url until you’ve entered your student credentials. The reason why I was using Selenium instead of urllib was so that I could enter my username and password.


(oaktree) #4

Well once you’ve authed you have the cookie and can do whatever.


#5

How would I go about getting that cookie? Could I run wireshark while authing and get the cookie that way?


(oaktree) #6

Selenium doesn’t let you manipulate your cookies?

Disclaimer: Idk anything about selenium. I just know cookies exist.


#7

There’s a ton of ways, i think the easiest is opening up the dev console in your browser of choice and then just use copy and paste, but you could also use an intercept proxy to catch things on the fly (My Preferred Method). Additionally there exist browser plugins you can use to manage cookies way more easily than by default dev console shit.


(oaktree) #8