What I Learned Adding Engagement Tracking to My Page View Counter

  ·  11 min read

Evolving my privacy-first web analytics #

I built Doorman to do one thing only: count page views. I have been using it for the past couple months across a few of my web projects, and it has served me well. However, during that time, it has become clear that I am lacking certain critical signals in the data I’m collecting.

I have no idea how long users stay on certain pages (which is important for my blog), how far they scroll down, or whether or not I’m counting bots, etc.

In this post, I will document my journey from basic page view counting to a more detailed analytics system that delivers actionable, useful insights, while still respecting user privacy.

In the end, Doorman will be able to:

  • Measure true engagement
  • Ensure insights are delivered even when users navigate away
  • Filter noise from bots and crawlers
  • Collect geographic insights while respecting privacy

Introducing Doorman #

My main reason for building Doorman was that neither Umami nor Plausible offered support for SQLite, which was a deal-breaker for me as I often host my web applications on very resource-constrained servers.

My initial implementation was quite simple: a Go web application serving a JS tracking script. The script triggered a client-side fetch() on page load, which sent basic info like the visited page, referrer, etc. to the server. There, user IPs would be hashed and stored in the database.

The main problem was that this approach told me what pages users visited, but nothing about how they interacted with them.

Part I: Measuring Real Engagement #

As I have already alluded to, knowing that someone viewed a given page doesn’t tell me much. I have no idea if they read anything or if they left after a couple of seconds. Imagine someone finds my blog post on Hackernews and just opens it in a new tab but never bothers to read it. I’d see that as a page view when it really wasn’t.

With that in mind, I needed to capture:

  1. Dwell time: how long the tab is open
  2. Active time: how long the user is actually engaged (not idle in another tab)
  3. Scroll depth: how much content they actually consumed

Why these metrics #

It’s important to understand why these metrics matter. Dwell time tells us the maximum engagement window, but active time is more honest because it excludes periods during which the user is idle or switches tabs. And, scroll depth shows me how much of the content they actually consume.

The approach I settled on involved tracking active vs idle states, setting an inactivity threshold, i.e. how long to wait before declaring a user inactive, sending constant updates to the backend via the Beacon API (more on that later), and tracking user interactions with the page.

// user is active when they interact with the page
["mousedown", "keydown", "touchstart", "click"].forEach(function (event) {
  document.addEventListener(event, markActive, { passive: true });
});

// mark as inactive after 30 seconds without interaction
var inactivityTimer;
var INACTIVITY_THRESHOLD = 30000;
function resetInactivityTimer() {
  clearTimeout(inactivityTimer);
  inactivityTimer = setTimeout(markInactive, INACTIVITY_THRESHOLD);
}

// heartbeat
setInterval(function () {
  if (sessionData.isActive || sessionData.activeTime > 0) {
    sendData(false); // false: not the final event
  }
}, 30000); // every 30s

This 30-second interval is fine for my traffic levels (~few hundred concurrent users max). At larger scale, you’d want to increase this interval or batch updates client-side.

Calculating scroll depth turned out to be a bit trickier than I initially anticipated. Here are some of the edge cases for which I had to account:

  • Pages shorter than the viewport (scroll already 100% on load)
  • Dynamic content
  • Ignoring horizontal scrolling

In the end, I went with a solution that calculates the current scroll depth as a percentage of the document:

function calculateScrollDepth() {
  var windowHeight = window.innerHeight;
  var documentHeight = document.documentElement.scrollHeight;
  var scrollTop =
    window.scrollY || window.pageYOffset || document.documentElement.scrollTop;

  if (documentHeight <= windowHeight) {
    return 100; // entire page visible but doesn't mean they read it
  }

  var scrollPercent = Math.min(
    Math.round(((scrollTop + windowHeight) / documentHeight) * 100),
    100
  );

  return scrollPercent;
}

What this does is calculate how far down the page the user has scrolled, expressed as a percentage from 0 to 100. The calculation accounts for the viewport height, so 100% means the user has scrolled to the very bottom of the page. If the entire page is visible without scrolling, it returns 100%.

This still needs improvements for dynamic content, infinite scroll, and zoom levels, but it’s good enough for static blog posts.

Part II: Data Delivery via the Beacon API #

My original implementation used the fetch() API to send data to the analytics backend. Of course, this was fine when I just needed to send data on page load, but I soon realised that this approach would not work with the new architecture as pending requests are cancelled by the browser when users navigate away or close the tab.

This is a problem that the Beacon API was designed specifically to solve. It provides queue support which guarantees that the request will be sent even if the page unloads. It’s also non-blocking and returns immediately without waiting for the server’s response. Plus, browsers may batch multiple beacons for efficiency, though this is implementation-dependent..

It does have its limitations such as only allowing POST requests, and the aforementioned lack of response handling, but for my use case, that is not a problem.

As you can see below, I’m sending the data as a blob:

function sendData(data) {
  const payload = JSON.stringify(data);

  // Try Beacon API first (best-effort delivery)
  if (navigator.sendBeacon) {
    try {
      var blob = new Blob([payload], { type: "application/json" });
      navigator.sendBeacon(TRACK_URL, blob);
    } catch (e) {
      sent = false;
    }
  }

  return false;
}

As a fallback, I also added a fetch() request to send the same data in case there are any issues with the Beacon.

fetch(TRACK_URL, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: payload,
  keepalive: true,
}).catch(() => {});

The keepalive is important because it tells the browser to complete the request even if the page unloads, giving similar functionality to Beacon, but with more control.

Troubleshooting Beacon API CORS issues #

I realised some time after implementing the fallback to the fetch API that sendBeacon wasn’t actually working and that my tracking requests were all being sent via fetch.

I did some digging, and came across this in my browser’s console:

Access to fetch at 'http://localhost:8080/event' from origin 'http://localhost:8000' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: The value of the 'Access-Control-Allow-Origin' header in the response must not be the wildcard '*' when the request's credentials mode is 'include'.

I wasn’t entirely sure what was happening, but it was clearly an issue with the beacon. At this point, I was tempted to just use fetch and forget about it, but I had already committed to using the Beacon API, so I had to figure it out. Plus, the keepalive option has a 64KB quota.

Fortunately, a quick Google search led me to this post, where I discovered that the issue was caused by setting the content type for the beacon data. That’s because sendBeacon includes credentials by default, and setting a custom Content-Type (like application/json) triggers CORS prefights, which means CORS headers need to be set. And, even then, Chrome will still block it for security.

In the end, I chose to omit the content type, send the data without a content type and parse the raw JSON on the backend.

Part III: Fighting the noise #

I have watched traffic patterns on my blog over the past few months and I am pretty sure that a not-insignificant percentage of my traffic is bot traffic. Some of them are harmless, like search engine crawlers or monitoring services, but others are malicious bots like those pesky, aggressive AI scrapers.

At least for my analytics, I wanted to filter out these bots just to focus on real users - I’ll leave the crawler bot blocking to the pros.

This is constantly evolving but for starters, I went with a simple list of known bot patterns that would be excluded.

var botPatterns = []string{
    "bot", "crawler", "spider", "scraper",
    "Googlebot", "Bingbot", "AhrefsBot",
    "GPTBot", "Claude-Web", "ChatGPT-User",
    "Pingdom", "Uptime", "Monitor",
}

func isBot(userAgent string) bool {
    ua := strings.ToLower(userAgent)
    for _, pattern := range botPatterns {
        if strings.Contains(ua, strings.ToLower(pattern)) {
            return true
        }
    }
    return false
}

Of course, it’s inevitable that some bots will slip through - maybe that can be an improvement for the future - but I would generally prefer false negatives (counting some bots) over false positives (excluding real users).

In addition to this, I added some rudimentary behavioural analysis to determine the likelihood of a visitor being a bot. For example, looking at engagement timing anomalies such visitors that stay on the page for a very short period, or those that instantly scroll to the bottom of the page, could tell me a lot about their bot-iness. Then, I used a scoring system to track this, not to exclude them, but to keep a record along with the page view whether or not it may have been a bot.

func calculateBotiness(pv *models.PageVisit) (int, string) {
    score := 0
    reasons := []string{}

    if pv.DwellTime > 0 && pv.ActiveTime == 0 {
        score += 30
        reasons = append(reasons, "no_active_time")
    }

    // instant scroll to bottom
    if pv.ScrollDepth == 100 && pv.ActiveTime < 2 {
        score += 20
        reasons = append(reasons, "instant_scroll")
    }
   
    // speed reader
    if pv.ScrollDepth > 80 && pv.DwellTime < 3 {
        score += 25
        reasons = append(reasons, "too_fast")
    }

    // no scroll
    if pv.ScrollDepth == 0 && pv.DwellTime > 10 {
        score += 15
        reasons = append(reasons, "no_scroll_long_dwell")
    }

    reasonStr := strings.Join(reasons, ",")
    return score, reasonStr
}

Any entry with a bot score > 50 is flagged as “likely a bot” and is stored with a is_bot flag for further analysis.

Part IV: Geographic insights #

It’s also important for most webmasters to know where their traffic is coming from, not least for the identification and handling of potential DDoS attacks. However, to preserve the privacy-respecting aspect, it meant that I had to geolocate users without storing raw IPs.

In the end, I decided to implement a system that hashes the IP (with a consistent, secret salt) immediately upon arrival on the server but still uses the raw version to look up geographic data via IP-API. This IP isn’t stored anywhere, even though we still rely on a *third-party service.

func getGeoData(ip string) (*GeoData, error) {
    url := fmt.Sprintf("http://ip-api.com/json/%s?fields=status,country", ip)

    client := &http.Client{Timeout: 2 * time.Second}
    resp, err := client.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    var geo GeoData
    if err := json.NewDecoder(resp.Body).Decode(&geo); err != nil {
        return nil, err
    }

    if geo.Status != "success" {
        return nil, fmt.Errorf("geo lookup failed")
    }

    return &geo, nil
}

Caching strategy #

Because IP-API has a rate limit (45 requests/minute for the free tier), I also implemented two caching layers because the optimist in me believes that I will smash that rate limit soon enough.

The first layer is an in-memory cache which stores the {IP hash -> geo data}. The second layer uses the database and tries to match the IP hash against existing IP hashes from the last 24 hours (to reduce the number of entries we need to check) and reuses the geo data if any is found.

var existingGeo PageView
err := h.DB.Select("country").
    Where("ip_hash = ? AND created_at > ?", ipHash, time.Now().Add(-24*time.Hour)).
    First(&existingGeo).Error

if err == nil && existingGeo.Country != "" {
    pageView.Country = existingGeo.Country
} else {
    geo, err := getGeoData(ip)
    if err == nil {
        pageView.Country = geo.Country
    }
}

This significantly reduces our API calls while still keeping the data relatively fresh.

Concerns #

Of course, there’s no free lunch. These improvements increased the write volume and server response time (especially because of the geo lookup). I’m planning to address these issues by possibly batching page views instead of reading and updating every 30 seconds during long-running sessions. For SQLite, I may also consider switching to WAL mode and potentially archiving old data to separate databases.

Additionally, the geo lookup can be migrated to a self-hosted geolocation service) to eliminate the reliance on a third-party service.

Conclusion #

Working on this was a crash course in understanding trade-offs. More information is always better, sure, but it’s possible to find a healthy balance of collecting whatever information you need to gain valuable insights without putting users at risk.

This update to Doorman will either send me into a deep depression when I discover that 90% of my “readers” are scraper bots, or it will spur me on to post more regularly. I’m not really planning to add a lot more functionality beyond what Doorman can already do. It was explicitly a learning project for my personal use, and I would still recommend one of the aforementioned, battle-tested options..

If you want to try out Doorman, you can check it out on GitHub. In theory, by the time I publish this, It should have a proper README set up.