[Python] Trouble Downloading PDFs with Selenium

Scrim · November 12, 2017, 11:25pm

Objective

Scrape course evaluation PDFs from my university’s website.

Technologies

Python 2.7
Selenium Webdriver

Problem

This code allows me to download 101 PDFs with the name download.pdf, download(1).pdf, download(2).pdf, etc. It works up until download(100).pdf, and then the Chrome driver pops up a save as dialog and slowly starts to crash the program.

Attempted Solutions

I’ve tried renaming the download.pdf file that gets downloaded and moving it, but for some reason the next download ends up being named download(1).pdf, even though download.pdf should be an available name.
I’ve tried moving all files from the download directory to a permanent directory, but all that does is move the problem to another directory.

Code

gist.github.com

https://gist.github.com/zachbellay/d0449c67edd1db7dff61eeb47b7c8d76

download_pdf_from_id_list.py

from selenium import webdriver
from time import sleep
import yaml
import os
import shutil
import textract
import re
from selenium.common.exceptions import NoSuchElementException
from time import sleep

This file has been truncated. show original

#EDIT: Solution

Thanks to usandfriends and @oaktree for helping me figure out that I needed to use cookies. I rewrote the script and use urllib instead of selenium. Thanks guys!

gist.github.com

https://gist.github.com/zachbellay/46bf13c354f2e6acae873637be70e868

download_pdfs.py

import urllib2

with open('valid_ids.txt') as f:
	mylist = f.read().splitlines()
	for i in mylist:
		url = 'https://evaluations.scu.edu/?vclass=' + str(i) + '&vtrm=3820'
		opener = urllib2.build_opener()
		opener.addheaders.append(('Cookie', 'SimpleSAMLAuthToken=EXAMPLE; PHPSESSID=EXAMPLE'))
		response = opener.open(url)
		filename = str(i) + '.pdf'

This file has been truncated. show original

Thank You!

Thank you for taking the time to read this. I’m new to 0x00sec and have really loved it so far. This is a great community and plan to stay. Thanks in advance for your help.

oaktree · November 12, 2017, 11:57pm

Ya can’t just GET the pdf and write to disk?

Scrim · November 13, 2017, 12:00am

I probably could, but you can’t access the url until you’ve entered your student credentials. The reason why I was using Selenium instead of urllib was so that I could enter my username and password.

oaktree · November 13, 2017, 12:01am

Well once you’ve authed you have the cookie and can do whatever.

Scrim · November 13, 2017, 12:04am

How would I go about getting that cookie? Could I run wireshark while authing and get the cookie that way?

oaktree · November 13, 2017, 12:05am

Selenium doesn’t let you manipulate your cookies?

Disclaimer: Idk anything about selenium. I just know cookies exist.

Sirius · November 13, 2017, 6:35pm

There’s a ton of ways, i think the easiest is opening up the dev console in your browser of choice and then just use copy and paste, but you could also use an intercept proxy to catch things on the fly (My Preferred Method). Additionally there exist browser plugins you can use to manage cookies way more easily than by default dev console shit.