Scraping the College Portal: An Engineering Case Study in Real-World Automation, UX, and Cloudflare Blocks

May 12, 2026·12 min read

Next.jsPlaywrightWeb ScrapingJavaScript

If you're an engineering student in India, the number 75 is basically a permanent background process running in your brain. It's the arbitrary, unforgiving line between academic survival and getting barred from writing your semester exams. Drop to 74.9%, and you're suddenly pleading with the department head or begging for medical certificates.

At St. Aloysius College (SOE), our official source of truth for this metric is a portal called btechconnect.staloysius.edu.in. To be fair to the developers who built it, the database is accurate. But using the actual portal feels like loading a webpage in 2012 over a dial-up connection. It has zero mobile optimization, crashes when everyone tries to log in before exams, and forces you to re-type your credentials literally every single time you open the tab.

But the real issue isn't the styling; it's the lack of empathy in how the data is presented. When you finally get the page to load, you are greeted with a dry, static HTML table that looks something like this:

Engineering Chemistry: 32 conducted, 25 attended (78.1%)

That raw percentage doesn't actually tell you what you need to know in the moment. As a student, your internal monologue is usually a series of highly anxious, algebra-heavy questions:

"I have a fever today. Can I skip this morning's double-slot lab without dropping below 75%?"
"I'm currently at 68%. Exactly how many consecutive classes do I have to attend to get back to safety?"
"There are only three weeks left in the semester. If I attend every single class from now on, is it mathematically possible for me to clear the bar, or am I already cooked?"

To answer these, we were constantly whipping out our phone calculators or scribbling algebra on the back of notebooks. It was an inefficient, stressful ritual. I decided to build a solution: B.Tech Connect — Attendance Tracker. A clean, modern, privacy-first web application that scrapes the legacy portal in the background and translates raw tables into actionable, real-time math, wrapped in a premium dark-mode dashboard.

Designing for clarity

From day one, I knew I didn't want to build a simple wrapper that just re-formatted the portal's layout. I wanted to build a proactive, intelligent dashboard that understands student anxiety.

No annoying sign-ups: I refused to build another "Sign Up with Email" flow. Students hate filling out forms. It had to be: type your portal ID, hit enter, and see your dashboard. Simple.
Absolute privacy: Storing university passwords on a database is a security nightmare. If my database ever leaked, I'd be in serious trouble, and rightfully so. The credentials had to be used on-the-fly to negotiate a session, then immediately discarded.
The "Can I Bunk?" Math: The app had to answer two primary questions for every subject: "Can Skip" (safe-to-bunk classes) and "Must Attend" (catch-up classes).
Timetable Context: WhatsApp is filled with outdated PDF timetables. The dashboard should automatically know what classes are scheduled today and highlight them.

I wanted the user experience to feel snappy, responsive, and "alive." Instead of a static spreadsheet, it had to feel like a premium financial dashboard—think Robinhood, but for tracking your academic credit.

Architecture & tech stack

I wanted to build this fast before the semester got too busy, so I kept the stack as lean as possible. No separate backend servers, no complex databases to manage. Just a single Next.js project.

bash

Next.js 16 (App Router)

I chose Next.js because it's a fantastic full-stack framework. The App Router allowed me to keep my frontend code and scraper logic colocated in a single project. The API routes serve as our backend, letting us spin up serverless functions (or standard Node environments) to run our automation logic without deploying a separate Express or FastAPI server.

Why standard JavaScript for the scraper?

I write a lot of TypeScript, but for the scraping layer, TS felt like a chore. The student portal changes its HTML structure randomly. When a selector breaks, I want to edit a JS file, hot-reload, and see the fix in milliseconds. Writing type definitions for messy HTML tables and casting every DOM node just slowed down my trial-and-error loop.

Backend Structure & Auth: Moving Away from Insecure Patterns

In early prototypes, developers often make the mistake of storing credentials or active portal session cookies in the browser's sessionStorage or localStorage. This is a massive security risk (susceptible to XSS attacks).

To solve this, I designed a server-side session handler using the jose library for secure, encrypted JWT cookies:

When a student enters their credentials on our login page, a POST request is sent to /api/attendance.
The server spins up a Playwright headless instance, logs into btechconnect, and retrieves the authenticated session cookies.
Instead of sending these raw cookies back to the client, the server packages them inside an encrypted JWT, signs it using a server-side SESSION_SECRET, and sets it as an HttpOnly, Secure, SameSite: Lax cookie named attendance_session.
On subsequent requests, the client's browser automatically sends this cookie. The Next.js API decrypts it, extracts the target portal's session cookies, and directly calls the portal's APIs to fetch fresh data—completely bypassing the slow browser login flow.

Zero-Config Branch Derivation

To keep onboarding down to a single click, I didn't want to ask users "What is your branch?". I dug into our university's registration patterns and wrote a utility in route.js to auto-derive their branch based on their register number ranges:

snippet.javascript

function deriveBranch(register_no) {
  const regNumber = parseInt(register_no, 10);
  if (isNaN(regNumber)) return "UNKNOWN";
  if (regNumber >= 25190101 && regNumber <= 25190157) return "CSE";
  if (regNumber >= 25191101 && regNumber <= 25191160) return "AIML";
  if (regNumber >= 25192101 && regNumber <= 25192151) return "ISE";
  if (regNumber >= 25195101 && regNumber <= 25195141) return "ECE";
  return "UNKNOWN";
}

This single piece of logic automatically hooks the user into the correct branch timetable on their first login!

Fighting Cloudflare

Everything was running beautifully on my local machine. Then, around April 2025, the college portal team implemented Cloudflare Turnstile. Suddenly, my serverless deployments on Vercel started returning 403 Forbidden errors. My automated scraper was hitting a brick wall.

This kicked off a two-week spiral of debugging. Locally, Playwright worked because it launched a real Chrome window on a residential IP. In the cloud, headless Chromium on an AWS or Vercel IP range is basically a giant flag waving "I AM A BOT" to Cloudflare.

I spent weeks researching stealth automated browsers. In scraper.js, I implemented Playwright Stealth Plugins to override default automation flags, disabled AutomationControlled, and overrode navigator.webdriver:

snippet.javascript

await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});

I also set realistic viewport dimensions, mimicked an en-US locale, set the timezone to Asia/Kolkata, used a common Windows user-agent, and added randomized delays (using Math.random()) to mimic human typing speeds when filling out the login form.

Despite these efforts, cloud provider IP ranges (AWS, DigitalOcean, Vercel) are heavily blacklisted by Cloudflare. Running headless Chromium from these IPs was reliably fingerprinted and challenged.

To solve this without paying for expensive residential proxy networks, I preserved the project as a fully functional local development application (which works beautifully on any domestic Wi-Fi connection) and archived the production cloud deployment. This was an invaluable lesson in the limitations of scraping as a backend strategy—at a certain scale, security walls require official APIs or user-cooperative scrapers (like browser extensions).

Handling inconsistent edge cases

Writing the attendance calculations seemed trivial at first, but edge cases quickly emerged. Solving the algebra for "Must Attend" classes was surprisingly tricky when dealing with discrete values.

Let's say your target is 75% (T = 0.75). You've attended A classes out of N conducted. You want to find the number of consecutive classes x you must attend to satisfy: (A + x) / (N + x) >= T.

Solving for x: A + x >= T(N + x) which simplifies to: x(1 - T) >= T * N - A. Thus: x >= (T * N - A) / (1 - T). Since classes are discrete integers, we take the ceiling: x = Math.ceil((T * N - A) / (1 - T)).

But when I first implemented this, my console started throwing weird NaN and Infinity values. Why? If a student has 0 classes conducted so far (like in the first week of a semester), N is 0, and the formula breaks with division by zero or negative values. What if the max possible attendance they can achieve by the end of the semester is mathematically lower than 75%?

In SubjectCard, I resolved these edge cases with rigorous safety guards:

snippet.javascript

const catchUpClasses = Math.ceil((targetDecimal * total - attended) / (1 - targetDecimal));
const safeToBunk = Math.floor((attended - (targetDecimal * total)) / targetDecimal);

const displaySafeToBunk = Math.max(0, safeToBunk);
const displayCatchUp = Math.max(0, catchUpClasses);

If a student's maximum possible percentage (assuming they attend every remaining class until the last working day) drops below their target, the UI shifts to render a custom warning:

snippet.javascript

const exactRemaining = calculateExactRemaining(code, branch, endDate);
const projectedTotal = total + exactRemaining;
const maxPossiblePercent = ((attended + exactRemaining) / projectedTotal) * 100;

Instead of displaying a confusing negative "Must Attend" number, the card displays a warning banner: "You're Cooked!" with the exact number of extra classes needed beyond the remaining schedule, preventing mathematical drift.

UI/UX Design Thinking: Designing for Calm

Student dashboards are usually ugly, cluttered spreadsheets that scream "You are failing!" at you. Because attendance tracking is linked to anxiety, I wanted the UI/UX of this app to feel calm, focused, and premium.

Password Anxiety: Explicit subtext on login: "We never store your password" — Establishes immediate trust and transparency.
Cluttered Timetables: "Today's Hitlist" Section — Filters the master timetable down to show only classes scheduled for the current day, showing a clean timeline.
Visual Stress: Harmonious Dark Palette (#09090b and #18181b) with soft glassmorphism — Reduces eye strain during late-night scrolling.
Mathematical Friction: Isolated "Can Skip" and "Must Attend" giant metrics — Gives the student their core answers within 500ms of looking at the page.

Dynamic Global Health Indicator

To provide an instant overview, I wanted to avoid making the student read every single subject card to understand their standing. I built a Global Health Indicator using the student's profile photo:

snippet.javascript

let profileRingColor = 'border-[#D9A02A]/30';
let profileGlow = 'shadow-[0_0_15px_rgba(217,160,42,0.15)]';

if (subjects && subjects.length > 0) {
  const lowestPercent = subjects.reduce((min, s) => {
    const percent = (s.attended / s.total) * 100;
    return percent < min ? percent : min;
  }, 100);

  if (lowestPercent < 73) {
    profileRingColor = 'border-[#FF453A]/80';
    profileGlow = 'shadow-[0_0_20px_rgba(255,69,58,0.4)]';
  } else if (lowestPercent >= 73 && lowestPercent < 75) {
    profileRingColor = 'border-[#FFD60A]/80';
  } else {
    profileRingColor = 'border-emerald-500/80';
  }
}

The profile picture's outer ring and ambient glow dynamically transition between green, yellow, and red based on the student's lowest subject percentage. It instantly signals whether they are fully safe, on the edge, or in danger.

User Autonomy: Personalizing the Dashboard

The official college database often imports names in rigid, all-caps strings or as raw register numbers. To make the dashboard feel personal, I added a feature that lets students simply click on their name in the header to edit it. This value is saved directly to their browser's localStorage and persists across sessions, giving them ownership of their dashboard.

Performance & Optimization: Perceived Speed

Scraping is slow. Logging into the student portal via Playwright takes anywhere from 4 to 8 seconds because of server lag on their end. To ensure this didn't ruin the user experience, I engineered several layers of optimization:

Cookie Session Cache: Once a session cookie is obtained, we reuse it for up to 24 hours. The app checks if a valid session exists on mount; if it does, it directly fetches the data using rapid fetch() JSON requests, bringing page load speeds down to under 1.5 seconds.
Optimistic UI and Bouncing Loaders: While a fresh login scraper runs, instead of showing a blank screen or a harsh spinner, we render a highly polished pulsing skeleton loader that mimics the final card layout, reducing the user's perceived wait time.
Smart Re-validation Guard: When switching semesters, the application checks if the cached attendance data matches the requested semester. If it does, it skips the expensive backend API call entirely and renders instantly, preventing unnecessary load on our scraper server and the college portal.

What I Learned: Engineering Beyond the Code

Security & Privacy are Non-Negotiable: When you build a tool that asks for college credentials, students will rightly hesitate. Securing the transport layer with HTTPS and using HttpOnly JWT cookies taught me how to architect secure sessions from the ground up.
The "Uncooperative Backend" Dilemma: In personal projects, we usually write our own APIs. Working with a legacy, slow, third-party backend taught me how to write robust error handling. I had to build screenshots-on-fail and HTML logging to debug exactly why our scraper failed in serverless environments.
Product Empathy: I realized that good software doesn't just calculate numbers; it manages emotion. Designing the "Can Skip" metric gave students immediate relief, while the "You're Cooked" warning gave them a realistic, albeit tough, reality check to take action.

Future Improvements: The Next Semester

If I were to rebuild this project or scale it further, there are several key architectural changes I would explore:

1. User-Cooperative Scraping (Chrome Extension): Instead of running Playwright on a cloud server (which is expensive and highly vulnerable to IP-based Cloudflare blocking), I would build a companion Chrome/Firefox Extension. The extension would scrape the data directly from the user's local machine (where they easily pass Cloudflare checks) and sync it to their local dashboard.
2. Predictive Timetable Analytics: Integrating the historical timetable trends to predict the likelihood of surprise classes or holiday cancellations, giving students even more accurate "Can Skip" projections.
3. Push Notifications: Alerting students when their attendance in a specific subject falls below 76%, acting as an early-warning system.
4. PWA Integration: Adding full offline support via service workers so students can check their schedule and cached attendance when walking through the college's low-connectivity basement labs.

Final Reflection: Solving Your Own Problems

Ultimately, this project represents the reason I got into software engineering.

There is a unique thrill in looking at a clunky, frustrating daily process, sitting down with a text editor, and building a tool that makes life easier for yourself and your peers. Even though our cloud deployment is paused due to the ever-escalating bot-detection wars of the modern web, the engineering journey—from solving mathematical ceiling edge cases to wrapping Playwright inside containerized Docker builds—has been incredibly rewarding.

It proved that with some reverse engineering, clean design, and focus, we can turn even the most boring, anxiety-inducing college spreadsheets into a premium, empowering user experience.

Written by Sid, an ISE Student at St. Aloysius College (SOE).

Find the repository on GitHub (https://github.com/sid20007).

← Back to all posts