Data Collection Part 1 - Front-End

Context

Our customers often ask what data is collected by Kameleoon, and for which purposes. Developers are also sometimes wondering where and how this data is stored or retrieved. This article serves both as a reference for listing all collected data, and as a guide providing technical details about the underlying implementation. We focus here on the client side, where data are mostly used for activation (targeting segments of audience, etc). Obviously these data are also usually sent to backend servers, for analytical and reporting purposes.

The choice of Local Storage as a data storage mechanism

Kameleoon needs to collect a lot of data, as does any web analytics solution. We're talking about URLs visited, flavor and version of the browser, time spent on your website... The main difference is that as a CRO platform, we also need to have permanent, real-time access to those data. A web analytics tool can just send all this data to a backend server, and use them only much later, for reporting purposes on a different browser (when a customer decides to consult the analytics platform). But a personalization or testing solution needs to be able to access them while the visit is in progress, and on the visitor's browser. So basically, we need to write and read large quantities of data on every different web page (URL). Since the JavaScript runtime environment is reset on each page change, it is impossible to just keep these data in memory.

Of course, one solution would be to always make a call at the beginning of each page load to a backend server to refetch those data. This is extremely costly in terms of performance: every page load would be delayed by this additional server call, and personalizations or experiments could only be triggered once the call is over. In addition, it scales very badly. So it is natural to try to write such data locally on the browser. Historically, on the early days of the web, there was only one method for this: cookies. Cookies are small pieces of data that are stored in the browser, and they can be written by JavaScript code. The problem is that they are not designed to hold the large quantities of data Kameleoon needs. Also, they are sent to every server call made on their associated domain (including static resources such as images), so even a few KBs in a cookie is bad for performance and bandwidth.

For this reason, we turned to the LocalStorage, and in some cases to the SessionStorage. The SessionStorage is a minor variation of the LocalStorage which has a limited lifespan: it is destroyed when the user closes the browser tab. In addition, it is scoped to a single tab (a tab on the same website cannot access the data written by another one).

Local Storage to the rescue

Local Storage, a standard Web technology supported by virtually all browsers in 2020, has some common characteristics with cookies. It's basically a data store in the browser, so it can handle data in the same way that cookies do. It can actually store much more data than cookies: the exact amount is browser dependent, but is usually around a few MBs (compared to a few KBs for cookies). Contrary to cookies, Local Storage content can be written only through JavaScript access, servers cannot store anything in this space. Local Storage data can only be retrieved via JavaScript as well; LS content is never sent out to remote HTTP servers, which makes it inherently more secure / private. Like cookies, LS data is tied to a domain, the domain of the web page where the JS code is executed. Once written, only code located in the exact same domain will be able to access this content (which is logical from a security point of view).

Given all of its characteristics, Local Storage seems to be an ideal candidate to replace cookies. Actually, several testing platforms already implement LS based storage instead of cookie based storage. However, there is a huge problem with Local Storage: it is limited to a single exact subdomain.

New technology, but new limitations: back to where we started

Unfortunately, naively implementing Local Storage support is far from being enough. This is because of the subdomain / protocol partitioning scheme used by Local Storage. Unlike cookies, where JavaScript code running on http://www.example.com can create a cookie associated with the domain *.example.com (which means that you can access this cookie from buy.example.com, or even buy.now.really.example.com), Local Storage is partitioned by exact subdomain AND protocol. If you write data on the LS on the subdomain http://www.example.com, you cannot access it later from a page athttp://buy.example.com, or even from https://www.example.com. Both protocol and domain have to match exactly.

For this very reason, unless you're lucky enough to have a single site and a single subdomain throughout this entire site, you will still suffer from serious issues if you use a solution relying on an easy Local Storage implementation. These issues would manifest themselves in a way quite similar to those caused by ITP, although the root cause is totally different. As an example, let's imagine that most of your e-commerce website is hosted on https://www.randomshop.com, but the conversion funnel is on https://transaction.randomshop.com. If your visitor ID is persisted on the LS, your analytics solution will report two different visitors when a complete purchase occurs, one seen on www.randomshop.com and the other on transaction.randomshop.com. Data does not transition consistently from one domain to another, and such a solution would see the two as completely separate sessions, rather than a single visitor session.

For A/B testing, this is - again - even worse. If you run an experiment modifying the navigation menu, with two variations, one user could very well be exposed to the first variation on the main site, but to the second one on the funnel! This is of course disastrous in terms of user experience, while also voiding the results of the experiment.

So we went further and implemented a full cross-domain Local Storage solution, as even a small percentage of pages on a different subdomain on your website can mean unreliable and even false results for A/B experiments. If you're curious about the technical details of the implementation, read this section.

List of collected data

We break this section in two subsections: the first one lists the data collected and stored by Kameleoon on all your visitors (except those that opted out of Kameleoon). The second lists all the data used only for internal use by Kameleoon, when you operate the Kameleoon back-office and editor applications. It is important to understand the difference: while the first keys are the "real" ones representing the data gathered for all your end users / customers, the second ones are only used by a few people at your organization (the ones configuring, implementing and QA'ing experiments and personalizations on Kameleoon). Usually, you can take into consideration only the first list for data privacy issues, as the second one concerns a few employees only.

Local Storage example for the kameleoonLegalConsent key:

kameleoonLegalConsent: {"value": {"AB_TESTING": true, "PERSONALIZATION": true},"expirationDate":1545297630228}

It's also worth noting that unlike cookies, LocalStorage does not have a built-in expiration mechanism. Thus our JavaScript engine implements an "emulation" of lifespan for local storage keys. Every key mentionned below has a JSON content with two keys, "value" and "expirationDate". "value" contains the actual information to be persisted, while "expirationDate" corresponds to a (virtual) expiration date when this LS key would not been read at all.

Data collected and stored for all the visitors of your website

Data stored in the LocalStorage

kameleoonVisitorCode: Contains the unique Kameleoon visitorCode identifier. Lifespan of 380 days.
kameleoonData: Contains a lot of data about the visitor and his visits (mostly browsing related data). This data is encoded on the LS via a simple encoding scheme. You can see the full list of data present on this key here. Lifespan of 380 days.
kameleoonExperiment-${experimentId}: Saves the experiment variation allocation, the ID of the variation (either “Reference” or the variation ID or “none” if a traffic exclusion has been configured) and the date on which the variation was assigned. Lifespan of 30 days by default.
kameleoonPersonalization-${personalizationId}: Saves the personalization variation allocation. Lifespan of 30 days.
kameleoonGlobalPersonalizationExposition: Represents the visitor's global exposition status for all personalizations. If the value is "false", this visitor won't be exposed to any personalization (and if it is set to "true", he will be if he triggers the required targeting conditions). By default, 10% of the global traffic is excluded from any personalization. This value is configurable on Kameleoon's back-office. Lifespan of 380 days.
kameleoonLegalConsent: Legal consent (usually RGPD or similar regulation) obtained for the use of Kameleoon. The value can be for instance {"AB_TESTING": true, "PERSONALIZATION": false} or {"PERSONALIZATION": true}. Lifespan of 380 days.
kameleoonOpenTabs: Contains ids of opened tabs on the same website. Lifespan of 380 days.

Data stored in the SessionStorage

kameleoonDisabledForVisit: Disables Kameleoon for the current visit. This is needed if you selected the "Disable Kameleoon for the entire visit" option in case of timeout (for your website settings on Kameleoon's back-office). By default, it will not be used.
kameleoonAnalyticsTrackingTimes: Used to optimize (reduce) the number of 3rd party analytics tracking calls made by Kameleoon when using a web analytics integration.

Data stored in cookies

Kameleoon only uses one cookie, and it uses it for implementing a Back-end / Front-end Bridge (important for Intelligent Tracking Prevention as well as other use cases). The same identifier is also saved in the LocalStorage.
In addition, one additional cookie may be set by the default Kameleoon CDN (Content Delivery Network) when delivering the application file, if you don't self host this file. Note that Kameleoon does not use this cookie at all. Deleting it has no impact on Kameleoon operations.

kameleoonVisitorCode: Contains the unique Kameleoon visitorCode identifier. Lifespan of 380 days.
__cfduid: This cookie is set by the default Kameleoon CDN. Lifespan of 30 days.

Temporary data stored only for Kameleoon internal use

Data stored in the LocalStorage

kameleoonSimulation: Data needed for simulation purposes. Lifespan of 1 hour.
kameleoonSimulationShortURL: Short URL of the simulation. Lifespan of 1 hour.
kameleoonSimulationVisitorData: Contains all the (virtual) data about the visitor and his visits for simulation. Lifespan of 1 hour.

Data stored in the SessionStorage

kameleoonFullApplicationCode: Contains Kameleoon entire codebase, which is needed for simulation purposes (as the production version of the application file is optimized for size and does not contain all the code).
kameleoonVariation-${variationId}: Variation code and data for preview or simulation purposes.

Details about the kameleoonData LocalStorage key

The kameleoonData key contains (in encoded form) the following data for each visits made by this particular visitor (on this device only, or ALL visits if you use cross-device history reconciliation).

custom data;
device type (mobile, tablet or desktop);
operating system;
name and version of the browser;
screen size;
window size;
time zone of the browser;
language of the browser;
original referrer (acquisition channel);
number of pages viewed;
title and URL of pages visited;
time spent on the website;
time of the beginning and end of the visit;
number of opened tabs;
adblocker activated;
list of conversions (clicks, transactions, etc);
list of personalizations and A/B experiments seen by the visitor;
current weather conditions (if the corresponding targeting condition is activated): temperature, wind, rain...;
time of sunset (present if a weather targeting condition has been activated, as it is required for some weather targeting criteria);
weather forecast (if the corresponding targeting condition is activated): temperature, wind, rain...;
geolocation (if the corresponding targeting condition is activated or if a weather targeting condition has been activated, as it requires geolocation data);
IP (if the corresponding targeting condition is activated);
external segmentation data (obtained from a third party DMP or CRM);
internal search history (if activated);
products seen (if activated).

How our engineering team implemented a cross-domain Local Storage for Kameleoon

The genesis of this project

We started the implementation of this project in 2014, and this progressively became a cornerstone of our whole Kameleoon architecture. At the time, we just needed a large storage space on the browser. We wanted to store all data gathered by Kameleoon on the front-end in order to run our predictive machine learning algorithms on that data (Yes, you read that correctly: we run neural network algorithms directly on the browser, in JavaScript. It works very well, but that's a topic for a different article). So of course, we investigated the Local Storage approach, which was correctly supported on most browsers even back in 2014. We quickly understood that it was ideal for what we wanted to achieve, but that the domain partitioning scheme would be a real problem that would make it unusable for many of our customers. So we decided to implement a cross-domain Local Storage.

Initially, we estimated the amount of development time required by our JavaScript engine team to be around 3 months. We were completely wrong: it actually took about 3 years (!) to finish this work and have a reliable, production-ready cross-domain LS - even with a stellar team of JS experts. We clearly underestimated the amount of effort required, because the base idea behind cross-domain LS is not that complex (or just seems that way). This is for example (correctly) mentionned in Simo Ahava's blog. Instead of fetching data from the Local Storage directly via JavaScript code running in the context, you load an iFrame (hosted on a different domain) and send a data query to this iFrame via the postMessage() method. Then within the iFrame, with JavaScript code running on a different domain / context, you retrieve data from the LS associated and then again use the postMessage() method to send back the data to the "host" page / domain. To store data, the principle is exactly the same, you send a message to the iFrame containing the data to be written.

This works well, because whatever the URL / domain of your host pages, you always load the same iFrame hosted on the same domain, and data is always read and written in this domain. Thus the full visitor journey, as seen by the data store, always occurs in a unique domain. In addition, it's worth noting that the iFrame file is cached on the browser (it's a static file), so this doesn't hurt performance too much.

As is often the case with software development, the main difficulty of the implementation was not with the base mechanism, but rather with some unforeseen challenges.

The 3 main challenges we faced

Without going into too much details, because we want to keep some in-house implementation features secret, these are the three nightmares we encountered on our path to a reliable cross-domain Local Storage:

Asynchronous I/O. This was clearly the most painful point. Since reading and writing data has to happen via postMessage(), all data reads and writes happen asynchronously. This creates a very difficult programming environment for our engine, especially in JavaScript where developers are not used to this kind of hindrance. All the code of our engine had to be rewritten within this new paradigm, and carefully checked.
Multiple open tabs on the same site / customer. Associated with the first point, this makes for an even more unstable environment where data can be updated randomly and at any time. Indeed, several tabs can be reading and writing at the same time on the LS iFrame, and all of that happens asynchronously! In addition, debugging these kind of issues is extremely difficult, as it happens only on our customer's visitor's browsers, where it's impossible to add any development or logging tool.
Performance issues. One of our main focuses at Kameleoon has always been performance: we consider our platform to be the best - regarding performance - among front-end optimization tools. Using an additional external iFrame added additional hurdles, and we had to implement several tweaks to continue supporting our Anti-Flicker technology within this context.

From a timeline perspective, the initial base implementation was done in about 3 months as initially planned. However, it took an additional 18 months to refactor and reimplement our entire engine to fix the 3 issues mentionned above. Thus we deployed the first implementations - in beta - to selected customers almost two years after. Following this, it took about one more year to fix all remaining issues, as some were quite difficult to reproduce due to their volatile nature. We still remember fixing one particularly nasty bug after months of tracking and debugging! In the end, we consider that our implementation reached production maturity in 2017, 3 years after the first lines of code were initially written. Our first successful pilots in Predictive Targeting / A.I. Personalization, which relied heavily on this feature, were launched around this time as well.

Cross-domain Local Storage and ITP

When the first versions of ITP (Intelligent Tracking Prevention, on Safari) came out, we actually had to perform some minor changes to our system. Strangely, external LS domains, in the same way as cookies, are actually treated as third-party context by all ITP versions. We initially hosted the common iFrame file on our own (Kameleoon) domain, not on the customer domain (mostly to facilitate setup), so the common LS data was considered to be third-party and was blocked by Safari. To fix this, we just changed the hosting of the iFrame so that it be hosted on the customer domain.

This has the added benefit that loading the iFrame is not even needed if you are currently on the main customer domain, as you can directly write to the LS without having to use postMessage(). Our implementation thus autodetects if the visitor is on the main domain from the current URL. If it is the case, the iFrame file is not loaded and LocalStorage is accessed directly. If it's not the case, the iFrame is loaded and LocalStorage is accessed via the iFrame / postMessage().