Scraping login requiring websites with cURL

Posted February 23rd, 2009 by Juozas

Scraping websites with XPath is very easy (read here), but how to scrape user’s friends list from social website if it can be viewed only when user is logged in?

What we need to do is to implement algorithm, which posts login and password fields to website login form and uses the same PHPSESSID id for further calls. For example, if login form is POSTed with 123 session id, then all requests with 123 session id could access users-only pages. This works because PHP (or other language) sets session data to loggedin=true for given session id.

But how you are going to do all this work with cookies and session id? Luckily, PHP has cURL extension which simplifies connecting to remote addresses, using cookies, staying in one session, POSTing data, etc. It’s really powerful library, which basically allows you to use all HTTP headers functionality.

For secure pages crawling, I’ve created very simple Secure_Crawler class, which works like this:

include ("crawler.php");
 
$crawler = new Secure_Crawler();
 
// Login to website
$crawler->login('my_username', 'secure_password');
 
// Get Content
$content = $crawler->get('http://www.example.com/secure/profile.php');
 
// modifications...

If you look at class source code, you would see that class has these specifications:

  1. When Secure_Crawler instance is created, default cURL options are set
  2. Login() method POSTs given credentials to login page (hard-coded)
  3. Get(url) method loads page by given URL (previous session data is used)

Class itself is very easily extendible – as long as you pass Cookies file to cURL object, login information will (should) be used and all users-only content would be available.

Using similar class, I’ve pseudo-reverse engineered API. I needed to enter information to other website multiple times per day by hand because they didn’t offered any remote services (like XML-RPC or REST), so I created class which mimics API functionality. From outside it looks like normal API object, but inside code actually POSTs everything to actual website.

All websites works differently so you need to spend some time analysing how login form submission is handled on that specifix website. For example, maybe you need to use SSL protocol or even you own certificate. It depends and differs from site to site, but basics are the same – it will work as long as you are calm enough to tweak it’s work-flow.

Comments (3)

  1. Joshua

    Very handy, thanks!

  2. Cory

    great tool!
    Curious, what code would you put if, after you login, you end up on a https page. I get a host error that the page does not exist when I try to crawl.

  3. Kęstutis

    Kadangi matau lietuvis, tai rašau lietuviškai :) .
    Pasidariau loginą su cURL. Veikia su puslapiais kurie turi loginą kaip visur. Bet va yra toks http://www.skelbikas.lt, kuris naudoja e-pasas loginą. Ten jungiantis eina į https://www.e-pasas.lt/Users/Auth.html
    , o paskui kažkaip redirektina http://www.skelbikas.lt/authorize.php?var1=PDD4LAjZawzIVw==&var2=6&var3=0 . Niekaip nesugalvoju, kaip padaryti loginą šitam. Nurodžius tą antrą vistiek neprisilogina.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">