Comment by AdriaanvRossum

7 years ago

Creator here. As a developer, I install analytics for clients, but I never feel comfortable installing Google Analytics because Google creates profiles for their visitors, and uses their information for apps (like AdWords). As we all know, big corporations unnecessarily track users without their consent. I want to change that.

So I built Simple Analytics. To ensure that it's fast, secure, and stable, I built it entirely using languages that I'm very familiar with. The backend is plain Node.js without any framework, the database is PostgreSQL, and the frontend is written in plain JavaScript.

I learned a lot while coding, like sending requests as JSON requires an extra (pre-flight) request, so in my script I use the "text/plain" content type, which does not require an extra request. The script is publicly available (https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...). It works out of the box with modern frontend frameworks by overwriting the "history.pushState"-function.

I am transparent about what I collect (https://simpleanalytics.io/what-we-collect) so please let me know if you have any questions. My analytics tool is just the start for what I want to achieve in the non-tracking movement.

We can be more valuable without exploiting user data.

First off: hats off for making a product that takes the rights of the end user seriously!

However, I am a bit confused as to who would want this product. The sort of questions this product answers seem quite limited:

1. What URLs are getting lots of hits?

2. What referrers are generating lots of hits?

3. What screen sizes are those hits coming from?

What decisions can be drawn from those questions? This seems useful only to perhaps some blog, where they're wondering what sort of content is successful, where to advertise more, and whether to bother making a mobile website.

Without the ability to track user sessions -- even purely in localStorage -- you can't correlate pageview events. For instance, how would I answer a question like:

- How many high-interest users do I have? By "high interest", I mean someone who visited at least three pages on my website.

- Is a mobile website really worthwhile? How much of an effect does being on mobile have on whether someone will be "high-interest"?

I should think some anonymized user ID system -- even if it rotates anonymous IDs -- should be able to answer these questions without compromising privacy.

Also, I'll leave it to others to point out it's unlikely this product is exempt from GDPR.

  • Since the creator points out that he doesn't store any IP addresses, he doesn't store any data that allows identifying an individual. For the GDPR to be applicable you need to store data that allows you to identify an individual. Thus when you use this, you don't have to think about GDPR.

    • I'm not so sure. By putting this service's code on your website, you transmit personal data (IP addresses) to this third party. That appears to make the GDPR applicable here? Transmission is considered "data processing" under the GDPR.

      Really, the central point that should be clear is that this is a question for lawyers. The GDPR is incredibly far-reaching.

      4 replies →

  • Hi,

    I might be able to help because I wrote an analytics tool a while back that tracks these three properties and some other stuff

    1. Knowing which URLs are being visited allows me to see if a particular campaign or blog site is popular

    2. The referrer tells me where a user came from, this is helpful to know if I'm being linked to reddit and should allocate more CPU cores from my host to the VMs responsible for a particular service

    3. The screen size allows me to know what aspect ratios and sizes I should optimize for. My general rule is that any screen shape that can fit a 640x480 VGA screen without clipping should allow my website to be fully readable and usable.

    4. I also track a trimmed down user agent; "Firefox", "Chrome", "IE", "Edge", "Safari" and other. All will include "(recent)" or "(old)" to indicate version and other will include the full user agent. This allows me to track what browsers people use and if people use outdated browsers ("(old)" usually means 1 year out of date, I try to adjust it regularly to keep the interval shorter)

    5. Page Load Speed and Connection. This is a number in 10ms steps and a string that's either "Mobile" or "Wired", which uses a quick and dirty heuristic to evaluate based on if a connection is determined to be throttled, slow and a few other factors. Mobile means people use my website with devices that can't or shouldn't be drawing much bandwidth, Wired means I could go nuts. This allows me to adjust the size of my webpage to fit my userbase.

    6. GeoIP: This is either "NAm", "SAm", "Eur", "Asi", "Chin", "OcA", "NAf", "SAf", "Ant" or "Other". I don't need to know more than the continent my users live on, it's good enough data. I track Chinese visitors separately since it interests me.

    Overall the tool is fairly accurate and high performance + low bandwidth (a full analytics run takes 4KB of bandwidth including the script and POST request to the server). It doesn't collect any personal data and doesn't allow accurate tracking of any individual.

    If I want to track high interest users, I collate some attributes together (Ie, Screen Size, User Agent, Continent) which gets me a rough enough picture of high interest stuff for what I care. You don't need to track specific user sessions, that stuff is covered under the GDPR and not necessary.

    Before anyone asks if they could have this tool; nope. It's proprietary and mine. The code I've written for it isn't hard, very minimal and fast. I wrote all this over a weekend and I use influx + grafana for the output. You can do that too.

    Both mine and the product of the HN post are likely not in the scope of the GDPR since no data is collected that can specifically identify a user.

How can I run this myself?

It absolutely isn't privacy-first if it requires running on someone else's machine and giving your users' data to them - another issue would be that while your server is in the EU, the hosting company is subject to US law, and all the stuff that comes with it (https://en.wikipedia.org/wiki/CLOUD_Act f.e.)

This look great—have bookmarked it for future projects.

I would however a little more skeptical with tools claiming to be privacy-first than I would be with GA (who I presume are not privacy-first). On that note, some quick questions:

- Any plans to open source? I've used Piwik/Matomo in the past, and while I'm not a massive fan of the code-quality of that project, it's at least auditable (and editable).

- You say you're transparent about what you collect—IPs aren't mentioned on that page[0]. Are IPs stored in full or how are they handled? I assume you log IPs?

- How do you discern unique page-views? You seem to be dogfooding and I see no cookies or localStorage keys set.

[0] https://simpleanalytics.io/what-we-collect

  • - No plans to go open source with the backend, but I do show the code that is run in the browser. The visualisation of the data is not super important I think. - I don't save IP's, not even in the logs. - I don't have unique pageviews at the moment. I will in the future. If the referrer is the same as the current page, I will measure that as a non-unique. What do you think?

    • If you don't go open source, will you at least offer paid self-hosting (similar to what e.g. Atlassian offers).

      The idea of privacy is much easier to sell if the data never leaves your own server, instead of using some analytics provider that might be run by the CIA or the Russian mafia for all we can prove.

      1 reply →

    • > What do you think?

      Apart from the unfortunate non-open-source answer, this sounds great!

      I get others' concerns about wanting unique pageviews, but that metric is always a bit of a sketchy either-or for extremely privacy-conscious people. It's both an incredibly valuable metric, and also one that's difficult to square with complete privacy (basically it's always going to be pseudonymous at best).

      3 replies →

    • Have you considered using a shared-source license where they can inspect and build from source that they have to pay for? And where people can obtain the source freely for academic research and/or security reviews?

      Shared-source proprietary goes as far back as Burroughs B5000 mainframe whose customers got the source and could send in fixes/updates. Microsoft has a Shared Source program. Quite a few suppliers in embedded do it. There's also a company that sells UI software which gives the source to customers buying higher-priced version.

      I will warn that people might still rip off and use your code. Given it's JavaScript, I think they can do that anyway with reverse engineering. It also sounds like they could build it themselves anyway. Like most software bootstrappers or startups, you're already in a race with other players that might copy you with clean slate implementations. So, I don't know if the risk is that big a deal or not. I figured I should mention it for fairness.

    • Doesn't seem like a very useful measure of uniqueness.

      What if you had one-day retention of IP addresses for per-day unique views? Seems like too important of a metric to eliminate completely, and one-day retention seems like a decent trade-off at the expense of being able to do unique analysis over longer time periods.

      8 replies →

    • > No plans to go open source with the backend

      You say that you do not store IP addresses, but why should anybody believe it?

      Modern security is based on proof, not on trust.

      1 reply →

> unnecessarily track users without their consent

Regardless of your intentions, you are collecting enough data to track users.

> I am transparent about what I collect ([URL])

That page doesn't mention that you are also collecting (and make no claim about storing) the globally-visible IP address (and any other data in the IP and TCP headers). This can be uniquely identifying; even when it isn't unique you usually only need a few bits of additional entropy to reconstruct[1] a unique tracking ID.

In my case, you're collecting and storing more than enough additional entropy to make a decent fingerprint because [window.innerWidth, window.innerHeight] == [847, 836]. Even if I resized the window, you could follow those changes simply by watching analytics events from the same IP that are temporally nearby (you are collecting and storing timestamps).

[1] An older comment where I discussed how this could be done (and why GA's supposed "anonymization" feature (aip=1) is a blatant lie): https://news.ycombinator.com/item?id=17170468

  • I think there's value in at least distributing the data that's collected. I may not like that the analytics provider has my data, but it seems like a lesser evil if that provider isn't also the world's largest ad company and they aren't using it to build profiles behind the scenes to track my every move across a significant part of the Internet.

    Given the choice between a lot of data about me given to a small provider and somewhat less data about me given to Google, I'd generally choose the former.

    • Thats no a good way to make a decision. Big,small doesn't matter. What matters is who is providing better security? When 2 parties big,small are collecting data ,then the party which can act on security vulnerabilities quickly and has great security engineers and dedicated teams like Project Zero- is the much better choice. People nowadays assume that a small,indie developer is a good guy. I am just pointing out that this is a very bad bias to have. Technicalities matter, security robustness matters. Google might be collecting data,but their security is really good. Good effort by this dev though.

      13 replies →

    • I think how the data is used is also a big factor.

      There is 'justice' in the blog creator using analytics data to to improve the experience of blog visitors: a user's data will, theoretically and in aggregate, create a better experience for that user in the future. The class of 'users who browse this page' gets a benefit from the cost of providing data.

      Selling browsing information to advertisers is sort of 'anti-justice'. Using blog visitor data to track and more effectively manipulate those visitors elsewhere on the internet into paying people money. The blog visitor's external online experience is made worse by browsing that blog.

  • Good comment! I only store the window.innerWidth metric. I updated the what we collect page (https://simpleanalytics.io/what-we-collect) to reflect the IP handling. We don't store them. And fingerprinting is something that would be definitely tracking, not on my watch!

    • > We don't collect and store IPs.

      First, "IPs" might be confusing; "IP addresses" would be more accurate.

      More importantly, you have to collect IP addresses (or any other value in the packet headers[1][2]) - even if you don't store it - if you want to receive any packets from the rest of the internet. Storage of those values is separate issue entirely, and it's good to hear that you are intending to NOT store IP addresses (and updating the documenting)!

      Also, I strongly recommend using Drdrdrq's suggestion to lower the precision of the collected window dimensions, which should be done on the client i.e. "Math.floor(window.innerWidth/50)*50". This kind of bit-reduction makes fingerprinting a lot harder.

      [1] https://en.wikipedia.org/wiki/IPv4#Header

      [2] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

      1 reply →

    • There is absolutely no reason to collect and store window dimensions, other than for fingerprinting and tracking. Sure it might be an interesting piece of trivia for the dev, but it's not necessary for the dev to "make sure the website works great on all of those dimensions", since that much is already obvious and presumed when making websites these days.

      13 replies →

  • That page doesn't mention that you are also collecting (and make no claim about storing) the globally-visible IP address (and any other data in the IP and TCP headers). This can be uniquely identifying; even when it isn't unique you usually only need a few bits of additional entropy to reconstruct[1] a unique tracking ID.

    This is true. The legal department for the healthcare web sites I maintain doesn't let me store or track IP addresses, even for analytics.

    I'm only allowed to tally most popular pages, display language chosen, and date/time. There might be one or two other things, but it's all super basic.

  • > That page doesn't mention that you are also collecting (and make no claim about storing) the globally-visible IP address

    I’m not the OP, but where is there evidence that they’re storing the IP? Sure it’s in the headers that they process but that doesn’t mean they’re storing it.

  • >Regardless of your intentions, you are collecting enough data to track users.

    I'd imagine it's difficult to do in depth analytics with tracking users...

Hey I like the idea but have a question.

How are you storing all the information that analytics users want to know i.e. (What devices, what languages, what geolocations, what queries, what page navigations and clicks, etc.)

After reading what you collect I'm assuming you are doing a lot of JS sniffing of browser properties to gather this information along with IP address analysis is that correct? Or what are you plans about these features if you don't have them now?

Overall though I'd say great design + sales pitch. I think if the product delivers on enough features you will have something here. Great job!

  • It doesn't look like they collect or store languages, geolocations, or devices beyond screen sizes.

    • And even if they did, if it is stored as aggregates, I (as a visitor) wouldn't mind.

Just a heads up, HN comments only use a (I think) small subset of Markdown for formatting, but your link will work as is without having to wrap it in [] and adding the ().

https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...

https://simpleanalytics.io/what-we-collect

Anyway, cool project! I've always felt the same about using GA given I actually like to pretend I have some sort of privacy these days, and always have an adblocker on, so I hated setting it up for people. Definitely will be keeping an eye on this the next time someone asks me to setup GA.

> The script is publicly available

Nice. You might want to add an explicit copyright/license though. Make it less (or more) dangerous for other devs to read it...

I think it could actually be quite useful to "standardize" on a simple (open/libre) front end for analytics (with an implied back-end standard).

Good Sir, props to you for including a noscript/image tag in the default code. Google Analytics didn't do it for the longest time, and in fact may still not do it.

Whether on purpose or by accident (or simply by mental bias) they seriously misrepresent the amount of people for whom JavaScript is blocked, not loading, disabled by default for unknown websites (me) or not available for any other reason.

Website owners and creators should at least have that information as a reliable metric to base their development choices on.

This is pretty much exactly what I have been looking for. I recently ditched Google Analytics and all other possible third party resources (except for YouTube which I implemented a click to play system) on my blog (consto.uk).

I just have a quick question. What subset of the javascript implementation does the tracking pixel provide? If all that is missing is screen size, I might just choose that to avoid running third party code. For performance, I combine, minify, and embed all scripts and styles into each page which lets me acheive perfect scores in the Chrome Auditor.

If you could track pageload time and show a distribution of that per page time - I would buy this in a second.

Nice product!

Could I ask what tech you're using for the graph data? I'm working on a similar SaaS (not analytics) which requires graphs. I'm a DevOps engineer for an ISP, and I do a lot of work with things like Graphite/Carbon, Prometheus and so on - but I can't seem to settle on what to use for personal projects. Do you use a TSDB at all? Or are you just storing it in SQL for now?

Thanks for making something simple and elegant.

Main question: How are you handling Safari Intelligent Tracking Protection 2.0?

> like sending requests as JSON requires an extra (pre-flight) request, so in my script I use the "text/plain" content type, which does not require an extra request.

what are the security implications of this?

  • He's talking about skipping CORS by using a "plain" request. Avoiding CORS is not a huge security vulnerability afaik.

nice job! i like the direction you are taking with this project. it's still young, so we don't know X) you might get bitten too lol

anyways, wish you the best luck with your endeavor. btw you might want to fix links above.