Contents
TABLE OF CONTENTS
← All posts

Scraping the College Portal: An Engineering Case Study in Real-World Automation, UX, and Cloudflare Blocks

·12 min read
Next.jsPlaywrightWeb ScrapingJavaScript

If you're an engineering student in India, the number 75 is basically a permanent background process running in your brain. It's the arbitrary, unforgiving line between academic survival and getting barred from writing your semester exams. Drop to 74.9%, and you're suddenly pleading with the department head or begging for medical certificates.

At St. Aloysius College (SOE), our official source of truth for this metric is a portal called btechconnect.staloysius.edu.in. To be fair to the developers who built it, the database is accurate. But using the actual portal feels like loading a webpage in 2012 over a dial-up connection. It has zero mobile optimization, crashes when everyone tries to log in before exams, and forces you to re-type your credentials literally every single time you open the tab.

But the real issue isn't the styling; it's the lack of empathy in how the data is presented. When you finally get the page to load, you are greeted with a dry, static HTML table that looks something like this:

That raw percentage doesn't actually tell you what you need to know in the moment. As a student, your internal monologue is usually a series of highly anxious, algebra-heavy questions:

To answer these, we were constantly whipping out our phone calculators or scribbling algebra on the back of notebooks. It was an inefficient, stressful ritual. I decided to build a solution: B.Tech Connect — Attendance Tracker. A clean, modern, privacy-first web application that scrapes the legacy portal in the background and translates raw tables into actionable, real-time math, wrapped in a premium dark-mode dashboard.

Designing for clarity

From day one, I knew I didn't want to build a simple wrapper that just re-formatted the portal's layout. I wanted to build a proactive, intelligent dashboard that understands student anxiety.

I wanted the user experience to feel snappy, responsive, and "alive." Instead of a static spreadsheet, it had to feel like a premium financial dashboard—think Robinhood, but for tracking your academic credit.

Architecture & tech stack

I wanted to build this fast before the semester got too busy, so I kept the stack as lean as possible. No separate backend servers, no complex databases to manage. Just a single Next.js project.

bash

Next.js 16 (App Router)

I chose Next.js because it's a fantastic full-stack framework. The App Router allowed me to keep my frontend code and scraper logic colocated in a single project. The API routes serve as our backend, letting us spin up serverless functions (or standard Node environments) to run our automation logic without deploying a separate Express or FastAPI server.

Why standard JavaScript for the scraper?

I write a lot of TypeScript, but for the scraping layer, TS felt like a chore. The student portal changes its HTML structure randomly. When a selector breaks, I want to edit a JS file, hot-reload, and see the fix in milliseconds. Writing type definitions for messy HTML tables and casting every DOM node just slowed down my trial-and-error loop.

Backend Structure & Auth: Moving Away from Insecure Patterns

In early prototypes, developers often make the mistake of storing credentials or active portal session cookies in the browser's sessionStorage or localStorage. This is a massive security risk (susceptible to XSS attacks).

To solve this, I designed a server-side session handler using the jose library for secure, encrypted JWT cookies:

Zero-Config Branch Derivation

To keep onboarding down to a single click, I didn't want to ask users "What is your branch?". I dug into our university's registration patterns and wrote a utility in route.js to auto-derive their branch based on their register number ranges:

snippet.javascript
function deriveBranch(register_no) {
  const regNumber = parseInt(register_no, 10);
  if (isNaN(regNumber)) return "UNKNOWN";
  if (regNumber >= 25190101 && regNumber <= 25190157) return "CSE";
  if (regNumber >= 25191101 && regNumber <= 25191160) return "AIML";
  if (regNumber >= 25192101 && regNumber <= 25192151) return "ISE";
  if (regNumber >= 25195101 && regNumber <= 25195141) return "ECE";
  return "UNKNOWN";
}

This single piece of logic automatically hooks the user into the correct branch timetable on their first login!

Fighting Cloudflare

Everything was running beautifully on my local machine. Then, around April 2025, the college portal team implemented Cloudflare Turnstile. Suddenly, my serverless deployments on Vercel started returning 403 Forbidden errors. My automated scraper was hitting a brick wall.

This kicked off a two-week spiral of debugging. Locally, Playwright worked because it launched a real Chrome window on a residential IP. In the cloud, headless Chromium on an AWS or Vercel IP range is basically a giant flag waving "I AM A BOT" to Cloudflare.

I spent weeks researching stealth automated browsers. In scraper.js, I implemented Playwright Stealth Plugins to override default automation flags, disabled AutomationControlled, and overrode navigator.webdriver:

snippet.javascript
await page.addInitScript(() => {
  Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});

I also set realistic viewport dimensions, mimicked an en-US locale, set the timezone to Asia/Kolkata, used a common Windows user-agent, and added randomized delays (using Math.random()) to mimic human typing speeds when filling out the login form.

Despite these efforts, cloud provider IP ranges (AWS, DigitalOcean, Vercel) are heavily blacklisted by Cloudflare. Running headless Chromium from these IPs was reliably fingerprinted and challenged.

To solve this without paying for expensive residential proxy networks, I preserved the project as a fully functional local development application (which works beautifully on any domestic Wi-Fi connection) and archived the production cloud deployment. This was an invaluable lesson in the limitations of scraping as a backend strategy—at a certain scale, security walls require official APIs or user-cooperative scrapers (like browser extensions).

Handling inconsistent edge cases

Writing the attendance calculations seemed trivial at first, but edge cases quickly emerged. Solving the algebra for "Must Attend" classes was surprisingly tricky when dealing with discrete values.

Let's say your target is 75% (T = 0.75). You've attended A classes out of N conducted. You want to find the number of consecutive classes x you must attend to satisfy: (A + x) / (N + x) >= T.

Solving for x: A + x >= T(N + x) which simplifies to: x(1 - T) >= T * N - A. Thus: x >= (T * N - A) / (1 - T). Since classes are discrete integers, we take the ceiling: x = Math.ceil((T * N - A) / (1 - T)).

But when I first implemented this, my console started throwing weird NaN and Infinity values. Why? If a student has 0 classes conducted so far (like in the first week of a semester), N is 0, and the formula breaks with division by zero or negative values. What if the max possible attendance they can achieve by the end of the semester is mathematically lower than 75%?

In SubjectCard, I resolved these edge cases with rigorous safety guards:

snippet.javascript
const catchUpClasses = Math.ceil((targetDecimal * total - attended) / (1 - targetDecimal));
const safeToBunk = Math.floor((attended - (targetDecimal * total)) / targetDecimal);

const displaySafeToBunk = Math.max(0, safeToBunk);
const displayCatchUp = Math.max(0, catchUpClasses);

If a student's maximum possible percentage (assuming they attend every remaining class until the last working day) drops below their target, the UI shifts to render a custom warning:

snippet.javascript
const exactRemaining = calculateExactRemaining(code, branch, endDate);
const projectedTotal = total + exactRemaining;
const maxPossiblePercent = ((attended + exactRemaining) / projectedTotal) * 100;

Instead of displaying a confusing negative "Must Attend" number, the card displays a warning banner: "You're Cooked!" with the exact number of extra classes needed beyond the remaining schedule, preventing mathematical drift.

UI/UX Design Thinking: Designing for Calm

Student dashboards are usually ugly, cluttered spreadsheets that scream "You are failing!" at you. Because attendance tracking is linked to anxiety, I wanted the UI/UX of this app to feel calm, focused, and premium.

Dynamic Global Health Indicator

To provide an instant overview, I wanted to avoid making the student read every single subject card to understand their standing. I built a Global Health Indicator using the student's profile photo:

snippet.javascript
let profileRingColor = 'border-[#D9A02A]/30';
let profileGlow = 'shadow-[0_0_15px_rgba(217,160,42,0.15)]';

if (subjects && subjects.length > 0) {
  const lowestPercent = subjects.reduce((min, s) => {
    const percent = (s.attended / s.total) * 100;
    return percent < min ? percent : min;
  }, 100);

  if (lowestPercent < 73) {
    profileRingColor = 'border-[#FF453A]/80';
    profileGlow = 'shadow-[0_0_20px_rgba(255,69,58,0.4)]';
  } else if (lowestPercent >= 73 && lowestPercent < 75) {
    profileRingColor = 'border-[#FFD60A]/80';
  } else {
    profileRingColor = 'border-emerald-500/80';
  }
}

The profile picture's outer ring and ambient glow dynamically transition between green, yellow, and red based on the student's lowest subject percentage. It instantly signals whether they are fully safe, on the edge, or in danger.

User Autonomy: Personalizing the Dashboard

The official college database often imports names in rigid, all-caps strings or as raw register numbers. To make the dashboard feel personal, I added a feature that lets students simply click on their name in the header to edit it. This value is saved directly to their browser's localStorage and persists across sessions, giving them ownership of their dashboard.

Performance & Optimization: Perceived Speed

Scraping is slow. Logging into the student portal via Playwright takes anywhere from 4 to 8 seconds because of server lag on their end. To ensure this didn't ruin the user experience, I engineered several layers of optimization:

What I Learned: Engineering Beyond the Code

Future Improvements: The Next Semester

If I were to rebuild this project or scale it further, there are several key architectural changes I would explore:

Final Reflection: Solving Your Own Problems

Ultimately, this project represents the reason I got into software engineering.

There is a unique thrill in looking at a clunky, frustrating daily process, sitting down with a text editor, and building a tool that makes life easier for yourself and your peers. Even though our cloud deployment is paused due to the ever-escalating bot-detection wars of the modern web, the engineering journey—from solving mathematical ceiling edge cases to wrapping Playwright inside containerized Docker builds—has been incredibly rewarding.

It proved that with some reverse engineering, clean design, and focus, we can turn even the most boring, anxiety-inducing college spreadsheets into a premium, empowering user experience.

Written by Sid, an ISE Student at St. Aloysius College (SOE).

Find the repository on GitHub (https://github.com/sid20007).