For my sins, I wrote a website for a friend’s company that relies on scraping information off another company’s website. The company I’m doing this for does have a paid account on the third party’s website so there’s nothing ethically dubious going on here – I’m basically taking off information my clients had put into the third party site.
I couldn’t figure out the third party site’s authentication system, so instead of pulling in a page and parsing it using BeautifulSoup, I use Selenium to attach to it like a web browser.
The third party site, however, is utterly terribly written. It’s full of tables nested within tables, and missing closing tags and everything else that reminds you of the old “FrontPage” designed sites that only worked on IE. They don’t consistently use ids or names or anything else to help me find the right bits of data, so I’ve had to wing it and parse things out using regular expressions all over the place. But worse is that every now and then they change things around a bit in a way that breaks my scraping.
The way I’ve been scraping in the past was I used the Selenium “standalone” jar, attaching to this java process that pretends to be a web browser without actually being a browser. Which is important, because I run the scraping process on my web server, which is a headless linode, and like most web servers doesn’t even have X11 running on it. (Some components of X11 got installed on it a while back because something needed something that needed something that needed fonts, and voila, suddenly I’ve got bits of X11 installed.)
This method has worked great for several years – when I’m debugging at home I use the version of the Selenium webdriver that fires up a Chrome or a Firefox instance and scrapes, but then when it’s working fine I switch over to the version that connects to a
java -jar selenium-standalone.jar process. I don’t know what the official term is, so I’m just going to call it “the selenium process”.
I tried a bunch of different ideas, and followed a bazillion web links and tried a bunch of things from those places. Nothing worked. Eventually I had to give up and install Firefox on my web server, and an optional piece of the selenium api called “geckodriver” that launches Firefox. Fortunately selenium knows how to launch Firefox in a headless manner (although installing it did drag in even more bits of X11 that I don’t actually want or need). That actually worked on the site, after I figured out how to put the geckodriver file somewhere on the path and get the geckodriver.log file put somewhere useful. But I’ve got it done for now. Until the next gratuitous change.