Replace Google Analytics with a FOSS alternative?

The pulpproject.org website uses Google Analytics.

In general, this is and will become even more problematic because Google’s data collection practices mean the transfer of user data from EU to US. There is also increasingly strict state-specific regulation of data collection in the USA.

Yesterday, I updated the Google Analytics preferences to anonymize IP addresses before sending data.

However, on chatting to @duck and @misc I have been wondering if we should just move from Google Analytics to a FOSS alternative.

As far as I am aware, we have never used Google Analytics for anything more than getting a general idea of our page views. From time to time, I look at the page view counter as a sign of what information Pulp users might be most interested in, so that I can do a better job at providing relevant information.

There are alternatives of varying complexity that can provide a range of metrics. However, I am strongly opposed to gathering information to the extent that users would have to opt-in. You’ve probably all seen this tweet about the joys of trying to access a website in 2022.

I would prefer that we build a community where people feel safe to self-report any shortfalls we have. I’ve tried to fix any issues with the website that have been reported during my time helping out with Pulp, whether the issues relate to content or layout.

Discourse itself provides some nice anonymous statistics. We don’t require anyone to use any identifying information. People can use Tor or throwaway email accounts and still ask and answer questions here. The only requirement is to follow our general community code of conduct.

I know that there is also a separate special interest group focusing on telemetry in Pulp itself that is progressing in the same spirit in its nascent state.

Google Analytics and alternatives can provide statistics on devices, for example, desktop or mobile access to the site.
I consider the website a platform for information and updates. I don’t plan to move past a static site with text and images. Gathering device information is therefore not really relevant. What do you think?

I would appreciate feedback on whether you’d prefer Google Analytics over FOSS alternatives?
I would also appreciate it if you would +1 the collection of just page views.
If you think device type information or anything else would be interesting or relevant, please tell me why so I can understand.

Please let me know if there is anything I have not considered as part of this.

6 Likes

+1 to FOSS
+1 to collection of just page views

5 Likes

One thing that I tend to find useful on top of pages view is the origin of the views. While I can safely say “the USA will be the top country in term of visits” as this is always the case for project I work on (and IMHO, a issue), knowing by how much and if there is a interest for translated content, or specific event in specific geo can IMHO be useful, which bring the question of the granularity of the data (as some country span multiple time-zones, having something more precise than “China” or “USA” can also be useful)

That usually just requires the IP, even if a more advanced system can detect the supported languages as sent by the browser with JS.

Pages views also bring the questions of the granularity (eg, per hour, per day, per week), and of the need for backups or not. I guess also the retention period as well, but once aggregated, that’s hardly personal data any more.

And one question that come to my mind is the question of access. EG, who would have access to the stats, and if that requires a account, where should it come from (eg, discourse, github, separate system).

the devices stats are IMHO not that useful unless there is a designer to use it. I would take for granted that:

  • people use their mobile
  • people use a computer

The website is not complex enough to make complex choices on resolution IMHO, so responsive design that look ok is enough.

I guess there is also the question of transferring existing data out of GA. But there is some anecdotal evidences (here, here) that enough people filter GA to have a huge impact so I suspect we can’t just compare the numbers before and after.

1 Like

Cannot agree more.

1 Like

I’m +1 to FOSS, and to simple-stats only. One question that the telmetry group has come up with for evaluating any data-gathering suggestion, is “what change would we make in response to this metric?” If we can’t think of one - don’t gather the data.

As @misc mentioned, device-stats aren’t useful unless we have a UXD expert that’s going to work on multiple views of the site based on device. Even “what geos” isn’t very useful, if we don’t have any bandwidth for localization efforts. (A danger here is that while we don’t plan on those things, if we gathered data and it showed that 90% of our site-views came from people in Tokyo on Android phones, maybe we should be finding bandwidth. You don’t know, what you don’t know, basically. But I’m not convinced that possibility makes it worth the gathering…)

If the data is simple/abstract enough, the answer to “who can view it” becomes easy - anyone. The stats should be available to anyone who is curious. If that answer makes us nervous, we need to step back and think about why we’re gathering data for which that’s inappropriate.

I will admit to being a data-junky, who would absolutely love to have All The Info, just out of curiosity. But I don’t think scratching that itch “just because” is useful for our project :slight_smile:

2 Likes

I see the information around geos to be useful for more than localization efforts. For example, that would cover meetups, or impact of presentation during events, etc. If there is a presentation on Pulp during a specific event in France and we see a uptick in visit from the country, we can say the event was a good idea.

In fact, now I think about it, there is also the question of referrer, would it be useful to see where the incoming traffic come from to later decide where to communicate (or where more effort should be placed ) ?

There are few different ways to do this without using personally identifiable information.
For example, for a survey, you can generate a unique link for each location you post the survey. No user data is captured past the point of where in our ecosystem someone might have found the link.

If you click share on Discourse, it generates a link, it appends ?u=mcorr on the end, and Discourse notify if it receives more than twenty clicks etc.

I wrote about Pulp a few times for opensource.com and the traffic on the website spiked after that. We’re so small that if there’s a sudden spike, it’s generally easy to understand why or how.

+1 to FOSS preferred to Google Analytics
+1 collection of page views

I’d appreciate a selection of a tool that might provide future options for data, however I am 100% ok that we approach that when we decide exactly what data is useful and now it might just be best to go with the one that is easiest or perhaps other communities use & have success with more than anything. I think we know enough higher priority efforts today that are deserving attention to support community growth that additional data won’t change or influence those priorities.

1 Like