This is a text-only exported version of the thesis. See also the pdf version, as well as the thesis site which contains the open source code, open data, presentation video, presentation slides, and further papers based on this research.


Master’s thesis
Swedes Online: You Are More Tracked Than You Think

Joel Purra
mig@joelpurra.se, joepu444
http://joelpurra.com/
+46 70 352 1212
Right Header: Joel Purra's Master's Thesis
Left Header:
Abstract
When you are browsing websites, third-party resources record your online habits; such tracking can be considered an invasion of privacy. It was previously unknown how many third-party resources, trackers and tracker companies are present in the different classes of websites chosen: globally popular websites, random samples of .se/.dk/.com/.net domains and curated lists of websites of public interest in Sweden. The in-browser HTTP/HTTPS traffic was recorded while downloading over websites, allowing comparison of HTTPS adoption and third-party tracking within and across the different classes of websites.
The data shows that known third-party resources including known trackers are present on over 90% of most classes, that third-party hosted content such as video, scripts and fonts make up a large portion of the known trackers seen on a typical website and that tracking is just as prevalent on secure as insecure sites.
Observations include that Google is the most widespread tracker organization by far, that content is being served by known trackers may suggest that trackers are moving to providing services to the end user to avoid being blocked by privacy tools and ad blockers, and that the small difference in tracking between using HTTP and HTTPS connections may suggest that users are given a false sense of privacy when using HTTPS.

Acknowledgments

The Internet Infrastructure Foundation
1
(.SE)

.SE is also known as Stiftelsen för internetinfrastruktur (IIS).
This thesis was written in the office of – and in collaboration with – .SE, who graciously supported me with domain data and internet knowledge. Part of .SE’s research efforts include continuously analyzing internet infrastructure and usage in Sweden. .SE is an independent organization, responsible for the Swedish top level domain, and working for the benefit of the public that promotes the positive development of the internet in Sweden.

Thesis supervision

Niklas Carlsson
Associate Professor (Swedish: docent and universitetslektor) at Division for Database and Information Techniques (ADIT), Department of Computer and Information Science (IDA), Linköping University, Sweden. Thank you for being my examiner!
Patrik Wallström
Project Manager within R&D, .SE (The Internet Infrastructure Foundation), Sweden. Thank you for being my technical supervisor!
Staffan Hagnell
Head of New Businesses, .SE (The Internet Infrastructure Foundation), Sweden. Thank you for being my company supervisor!
Anton Nilsson
Master student in Information Technology, Linköping University, Sweden. Thank you for being my opponent!

Domains, data and software

.SE (Richard Isberg, Tobbe Carlsson, Anne-Marie Eklund-Löwinder, Erika Lund), DK Hostmaster A/S (Steen Vincentz Jensen, Lise Fuhr), Reach50/Webmie (Mika Wenell, Jyry Suvilehto), Alexa, Verisign. Disconnect.me, Mozilla. PhantomJS, jq, GNU Parallel, LyX. Thank you!

Tips, feedback, inspiration and help

Dwight Hunter, Peter Forsman, Linus Nordberg, Pamela Davidsson, Lennart Bonnevier, Isabelle Edlund, Amar Andersson, Per-Ola Mjömark, Elisabeth Nilsson, Mats Dufberg, Ana Rodriguez Garcia, Stanley Greenstein, Markus Bylund. Thank you!
And of course everyone I forgot to mention – sorry and thank you!

Table of Contents

Acknowledgments
Chapter 1 Introduction
Chapter 2 Background
Chapter 3 Methodology
Chapter 4 Results
Chapter 5 Discussion
Chapter 6 Related work
Chapter 7 Conclusions and future work
Appendix A Methodology details
Appendix B Software
Appendix C Detailed results
The end
Table 3.1:Domain lists in use
Table 3.2:Disconnect's categories
Table 3.3:Organizations in more than one category
Table 3.4:TLDs in dataset in use
Table 3.5:.SE Health Status domain categories
Table 5.1:.SE Health Status HTTPS coverage 2008-2013
Table 5.2:Cat and Mouse ad coverage
Table 5.3:Cat and Mouse ad server match distribution
Table 5.4:Follow the Money aggregators' revenue, users and publisher coverage fractions plus this thesis' Disconnect organization coverage
Table 5.5:Top P3P values
Table A.1:Machine specifications
Table A.2:Google Tag Manager versus Google Analytics and DoubleClick
Table A.3:Output file size
Table A.4:Dataset variations
Table A.5:System load
Table A.6:Mime-type grouping
Figure 3.1:Domains per organization
Figure 4.1:Selection of HTTP-www and HTTPS-www variations from Figure #, # and #
Figure 4.2:Selection of HTTP-www and HTTPS-www variations from Figure #, # and #
Figure 4.3:Small versions of Figure #, # and # showing a selection of HTTP-www and HTTPS-www variations
Figure C.1:Distribution of HTTP status codes
Figure C.2:Distribution of domains with strictly internal, mixed or strictly external resources
Figure C.3:Cumulative distribution of the ratio of internal resources per domain
Figure C.4:Distribution of domains with strictly secure, mixed or strictly insecure resources
Figure C.5:Cumulative distribution of the ratio of secure resources per domain
Figure C.6:Distribution of domains with strictly secure, mixed or strictly insecure redirects
Figure C.7:Cumulative distribution of the ratio of Disconnect's blocking list matches for all resources per domain
Figure C.8:Cumulative distribution of the number of organizations per domain
Figure C.9:Ratio of domains with requests to Disconnect's categories
Figure C.10:Ratio of domains with requests to Google, Facebook and Twitter
Figure C.11:Distribution of external primary domains detected/undetected by Disconnect

Nomenclature

.SE
The Internet Infrastructure Foundation. An independent organization for the benefit of the public that promotes the positive development of the internet in Sweden. .SE is responsible for the .se country code top level domain.
.com
A generic top level domain. It has the greatest number of registered domain of all TLDs.
.dk
The country code top level domain name for Denmark.
.net
A generic top level domain.
.se
The country code top level domain name for Sweden.
Alexa
A web traffic statistic service, owned by Amazon.
CDF
Cumulative distribution function.
CDN
Content delivery network
Content
Information and data that is presented to the user. Includes text, images, video and sound.
Content delivery network
(CDN) The speed at which data can be delivered is dependant on distance between the user and the server. To reduce latency and download times, a content delivery network places multiple servers with the same content in strategic locations, both geographic and network toplolgy wise, closer to groups of users.\\\\For example, a CDN could deploy servers in Europe, the US and Australia, and reduce loading speed by setting up the system to automatically use the closest location.
Cumulative distribution function
(CDF) In this thesis usually a graph which shows the ratio of a property as seen per domain on the x axis, with the cumulative ratio of domains which show this property on the y axis. The steeper the curve is above an x value range, the higher the ratio of domains which fall within the range.
DNT
Do Not Track
Do Not Track
(DNT) A HTTP header used to indicate that the server should not record and track the client's traffic and other data.
Domain name
A human-readable way to navigate to a service on the internet: example.com. Often implicitly meaning FQDN. Domains are also used, for example, as logical entities in regards to security and privacy scopes on the web, often implemented as same-origin policies. As an example, HTTP cookies are bound to domain that set them.
External resource
A resource downloaded from a domain other than the page that requested it was served from.
External service
A third party service that delives some kind of resource to the user's browser. The service itself can vary from showing additional information and content, to ads and hidden trackers.\\\\External services include file hosting services, CDNs, advertisting networks, statistics and analytics collectors, and third party content.
FQDN
Fully qualified domain name
Fully qualified domain name
(FQDN) A domain name specific enough to be used on the internet. Has at least a TLD and a second-level domain name - but oftentimes more depending on TLD rules and organizational units.
GOCS
Government-owned corporations
Government-owned corporations
(GOCS) State-owned corporations.
HAR
HTTP Archive (HAR) format, used to store recorded HTTP metadata from a web page visit. See the software chapter.
HTTP
Hypertext Transfer Protocol
HTTPS
Secure HTTP, where data is transfered encrypted.
Hypertext Transfer Protocol
(HTTP) A protocol to transfer HTML and other web page resources across the internet.
JSON
JavaScript Object Notation
JavaScript Object Notation
(JSON) A data format based on Javascript objects. Often used on the internet for data transfer. Used in this thesis as the basis for all data transformation.
P3P
Platform for Privacy Preferences (P3P) Project
Parked domain
A domain that has been purchased from a domain name retailer, but only shows a placeholder message – usually an advertisement for the domain name retailer itself.
Platform for Privacy Preferences Project
(P3P) A W3C standard for HTTP where server responses are annotated with an encoded privacy policy, so the client can display it to the user. Work has been discontinued since 2006.
Primary domain
For the thesis, the first non-public suffix part of a domain name has been labeled the primary domain. For example example.com.br has been labeled the primary domain for www.company-abc.com.br, as .com.br is a public suffix.
Public suffix
The part of a domain name that is unavilable for registrations, used for grouping. All TLDs are public suffixes, but some have one or more levels of public suffixes, such as .com.br for commercial domains in Brazil or .pp.se for privately owned personal domains (a public suffix which has been deprecated, but still exists).
Resource
An entity external to the HTML page that requested it. Types of resources include images, video, audio, CSS, javascript and flash animations.
SLD
Second-level domain
Second-level domain
(SLD) A domain that is directly below a TLD. Can be a domain registerable to the public, or a ccSLD.
Subdomain
A domain name that belongs to another domain name zone. For example service.example.net is a subdomain to example.net.
Superdomain
For the thesis, domains in parent zones have been labeled superdomains to their subdomains, such as such as example.se being a superdomain to www.example.se.
TLD
Top level domain.
Third-party content
Content served by another organization than the organization serving the explicitly requested web page. Also see external resource.
Third-party service
A service provided by an organization other than the explicitly requested service. Also see external service.
Top level domain
(TLD) The last part of a domain name, such as .se or .com. Registration of TLDs is handled by ICANN.
Tracker
A resource external to the visited page, which upon access receives information about the user's system and the page that requested it.\\\\Basic information in the HTTP request to the resource URL includes user agent (browser vendor, type and version down to the patch level, operating system, sometimes hardware type) referer (the full URL of page that requested the resource), an etag (unique string identifying the data from a previous request to the same resource URL) and cookies (previously set by the same tracker).
URL
Uniform Resource Locator
Uniform Resource Locator
(URL) A standard to define the address to resources, mostly on the internet, for example http://joelpurra.com/projects/masters-thesis/
Web browser
Or browser. Software a user utilizes to retrieve, present and traverse information from the web.
Web service
A function performed on the internet, and in this document specifically web sites with a specific purpose directed towards human users. This includes search engines, social networks, online messaging and email as well as content sites such as news sites and blogs.
Web site
A collection of web pages under the same organization or topic. Often all web pages on a domain is considered a site, but a single domain can also contain multiple sites.
Zone
A technical as well as administrative part of DNS. Each dot in a domain name represents another zone, from the implicit root zone to TLDs and privately owned zones -- which in turn can contain more privately controlled zones.
ccSLD
Country-code second-level domain. A SLD that belongs to a country code TLD. A ccSLD is not for public use, which are required to register their domains on the third domain level.
ccTLD
A top level domain based on a country code, such as .se or .dk.
gTLD
Generic top level domain such as .com or .net.
jq
A tool and domain specific programming language to read and transform JSON data. See the software chapter.
phantomjs
Browser software used for automated web site browsing. See the software chapter.

Chapter 1
Introduction

How many companies are recording your online trail, and how much information does the average Swede leak while using popular .se websites? Many, and a lot – more than you may think. Large organizations like Google, Facebook and Amazon are able to connect the dots you leave behind during everyday usage, and construct a persona that reflects you from their perspective. Have you told your family, friends or colleagues about your gambling addiction, your sex toy purchases, or your alcoholism? Even if you did not tell anyone your deepest secrets, these companies might conclude that they can put labels on you by looking at everything you do online. And now they are selling it as hard facts behind the scenes.
While browsing the web users are both actively and passively being tracked by multiple companies, for the purpose of building a persona for targeted advertising. Sometimes the data collection is visible, as in social network sites and questionnaires, but it is most common in the form of different kinds of external resources which may or may not serve a purpose other than keeping track of your every click. Secure connections between server and client help against passive data collection along the network path, but not against site owners allowing in-page trackers. Tracking code is installed on web pages that have adverts as well as those that do not – the spread and reach of tracking across web pages and domains of different kinds increases the quality of the user data collected and inferred, making it more valuable for advertising purposes. With the extent of the use of trackers and other external resources largely unknown and ever evolving, what is already known raises privacy concerns – data considered personal leak without the user's knowledge or explicit permission and end up in privately owned databases for further distribution. Data collection is the new wild west, and you are the new cattle.
This thesis uses large-scale measurements to characterize how different kinds of domains in Sweden and internationally use website resources. Front pages of approximately random .se, .dk, .com, .net domains and Swedish, Danish and Alexa's top domains were visited and their resources, including those dynamically loaded, recorded. Each domain was accessed both with insecure HTTP and secure HTTPS connections to provide a comparison. Resources were grouped by mime type, URL protocol, domain, if it matches the domain the request originated from and compared to lists of known trackers and organizations. The thesis makes three primary contributions:
  1. Software for automated, repeatable retrieval and analysis of large amounts of websites has been developed, and released as open source (see Appendix B). Datasets based on publicly available domain lists have been released for scientific scrutinization. The data allows analysis of websites' HTTP/HTTPS requests including the use of resources internal versus external to the entry domain, which the most common confirmed tracker organizations are, what spread they have and how much the average internet user can expect to be tracked by visiting some of the most important and popular sites in Sweden, Denmark and worldwide. Downloading and analyzing additional/custom datasets is very easy.
  2. HTTPS usage for different domains has been characterized from a Swedish perspective; adoption rates are compared between classes of domains within Sweden as well as against popular international domains (see Section 4.1). HTTPS adoption among globally popular websites (10-30%, 50% for the very top) and curated lists of Swedish websites (15-50%) is much higher than for random domains (less than 1%). This means that most websites in the world are susceptible to passive eavesdropping anywhere along the network path between the client and the server. But even with HTTPS enabled, traffic data and personally identifiable information is leaked through external resources and third-party trackers, which are just as prevalent on insecure HTTP as secure HTTPS enabled websites (see Section 4.2 and 4.3). This means that a secure, encrypted connection protecting against eavesdropping doesn't automatically lead to privacy – something which users might be lead to believe when it is called a “secure connection” as well as through the use of “security symbols” such as padlocks.
  3. The use of known or recognized third-party trackers and other third-party (external) services for different classes of domains, has been analyzed. Using public lists of recognized tracker domains, we analyzed and compared the widespread adoption of these services across domains within Sweden, as well as internationally. The use of external resources is high among all classes of domains (see Section 4.2). Websites using strictly internal resources are relatively few; less than 7% of top sites, even less in most categories of curated lists of Swedish websites, but more common among random domains at 10-30%. This means most websites around the world have made an active choice to install external resources from third-party services, which means that users' traffic data and personal information is leaked (see Section 4.3). Most websites also have at least one known tracker present; 53-72% of random domains, 88-98% of top websites and 78-100% of websites in the Swedish curated lists.
    The number of known tracker organizations present is interesting to look at, as a higher number means users have less control over where leaked data ends up (4.3.2). Around 55% of random Swedish domains have 1-3 trackers, and about 5% have more than 3. Nearly 50% of global top sites load resources from 3 or more tracker organizations, while about 5% load from more than 20 organizations. Half of the Swedish media websites load more than 6 known trackers; a single visit to the front page of each of the 27 investigated sites would leak information in over external requests (C.5) to at least 57 organizations (C.11.1). This means that any guesswork in what types of articles individuals read would read in a printed newspaper is gone – and with that probably the guesswork in exactly what kind of personal opinions these individuals hold.
    It is clear that Google has the widest coverage by far – Google trackers alone are present on over 90% of websites in over half of the datasets (4.3.3). That being said, it is also hard to tell how many trackers are missed – Disconnect's blocking list only detects 10% of external primary domains as trackers for top website datasets (4.3.4).

Chapter 2
Background

In everyday web browsing, browsers routinely access a lot of material from other domains or services than the one visited [11]. These external resources vary from content that the user explicitly want to obtain, to implicitly loaded third-party services, ads, and non-visible resources with the sole purpose of collecting user data and statistical material [22]. All are downloaded on behalf of the user with no or few limitations, and oftentimes without the user's need, understanding and explicit consent. These external resources can all be seen as browsing habit trackers, whose knowledge and power increase with any additional visits to other domains or services loading the same resources [35]. While privacy is both hard to define as well as relative to perspective and context, there is a correlation between trackers and online privacy; more trackers means it becomes harder to control the flow of personal information and get an overview of where data ends up [41, 7].

Trackers are a commercial choice

While online privacy has been in the spotlight due to recently uncovered mass surveillance operations, the focus has been on national government intelligence agencies collecting information around the globe. Public worry regarding surveillance in Sweden is low. Only 9% of adult Swedish internet users say they worry to some degree about government surveillance, but at 20% twice as many worry about companies' surveillance – a number that has been steadily rising from 11% in 2011 [2, 14]. Governments are able to intercept traffic data and metadata by, among several techniques, covertly hooking into the internet infrastructure and passively listening. Basic connection metadata can always be collected, but without secure connections between client and server, any detail in the contents of each request can be extracted.
In contrast, external resources are approved by and actively installed by site and service owners, and presented openly to users with basic technical skills and tools. Reasons can be technical, for example because distributing resources among systems improves performance [22, 21]. Other times it is because there are positive network effects in using a third-party online social network (OSN) to promote content and products. Ads are installed as a source of income. More and more commonly, allowing a non-visible tracker to be installed can also become a source of income – data aggregation companies pay for access to users' data on the right site with the right quantity and quality of visitors. Because these external resources are used on behalf of the service, they are also loaded when end-to-end encryption with HTTPS is enabled for enhanced privacy and security. This encryption bypass gives these private trackers more information than possible with large-scale passive traffic interception, even when there is a security nullifying mixture of encrypted and unencrypted connections.

What is known by trackers?

Depending on what activities a user performs online, different things can be inferred by trackers on sites where they are installed. For example, a tracker on a news site can draw conclusions about interests from content a user reads (or choses not to) by tagging articles with refined keywords and creating an interest graph [24]. The range of taggable interests of course depend on the content of the news site. Private and sensitive information leaked to third-party sites during typical interaction with some of the most popular sites in the world include personal identification (full name, date of birth, email, ip address, geolocation) and sensitive information (sexual orientation, religious beliefs, health issues) [35].
Social buttons, allowing users to share links with a simple click, are tracking users whether they are registered or not, logged in or not [39]. They are especially powerful when the user is registered and logged in, combining the full self-provided details of the user with their browsing habits – all within the bounds of the services' privacy policies agreed to by the user. Once a user has provided their personal information, it is no longer just the individual browser or device being tracked, but the actual person using it – even after logging out [19, 23]. This direct association, as opposed to inferred, to the person also allows for tracking across devices where there is an overlap of services used.

What is the information used for?

Publishers reserve areas of their web pages for displaying different kinds and sizes of advertisements alongside content. Ads chosen for the site may be aligned with the content but it is more valuable the more is known about the visitors. Combining and aggregating information from past visitors means that more information can be assumed about future visitors, on a statistical basis, which will define the general audience of the site. To generate even more revenue per displayed ad, individual users are targeted with personalized ads depending on their specific personal data and browsing history [16].
Indicators such as geographic location, hardware platform/browser combinations have been shown to result in price steering and price discrimination on some e-commerce websites [18, 36]. While the effects of a web-wide user tracking have not been broadly measured with regards to pricing in e-commerce, using a larger and broader portion of a user's internet history and contributions would be a logical step for online shopping, as it has been used to personalize web search results and social network update feeds [9, 38].
Social networks can use website tracking data about their users' to increase per-user advertising incomes by personalization, but they will try to keep most of the information to themselves [40, 3, 46]. There are also companies that only collect information for resale – data brokers or data aggregators – which thrive on combining data sources and package them as targeted information for other companies to consume
1
. The market for tracking data resale is expected to grow, as the amount of data increases and quality improves. The Wall Street Journal investigated some of these companies and their offerings:
Some brokers categorize consumers as "Getting By," "Compulsive Online Gamblers" and "Zero Mobility" and advertise the lists as ideal leads for banks, credit-card issuers and payday and subprime lenders, according to a review of data brokers' websites. One company offers lists of "Underbanked Prime Prospects" broken down by race. Others include "Kaching! Let it Ride Compulsive Online Gamblers" and "Speedy Dinero," described as Hispanics in need of fast cash receptive to subprime credit offers.

Chapter 3
Methodology

Emphasis for the thesis is on a technical analysis, producing aggregate numbers regarding domains and external resources. Social aspects and privacy concerns are considered out of scope.

High level overview

Based on a list of domains, the front page of each domain is downloaded and parsed the way a user's browser would. The URL of each requested resource is extracted, and associated with the domain it was loaded from. This data is then classified in a number of ways, before being boiled down to statistics about the entire dataset. Lastly, these aggregates are compared between datasets. In the following sections we describe each of these steps in more detail. For yet more details of the methodology, we refer to Appendix A
1
To help the reader, explicit references of the form A.1 is used to refer to Section A.1 of Appendix A.
. The software developed is described in Appendix B and the details of the results are presented in Appendix C.
The thesis is primarily written from a Swedish perspective. This is in part because .SE
2
has access to the full list of Swedish .se domains, and in part because of their previous work with the .SE Health Status reports (6.1). The reports focus on analyzing government, media, financial institutions and other nation-wide publicly relevant organization groups' domains, as they have been deemed important to Sweden and Swedes. This thesis incorporates those lists, but focus on only the associated websites.

Domain categories

Curated lists
The .SE Health Status reports use lists of approximately domains in the categories counties, domain registrars, financial services, government-owned corporations (GOCS), higher education, ISPs, media, municipalities, and public authorities (A.1.1). The domains are deemed important to Swedes and internet operations/usage in Sweden.
Top lists
Alexa's Top sites (A.1.5) and Reach50 (A.1.6) are compiled from internet usage, internationally and in Sweden respectively. The Alexa top list is freely available and used in other research; four selections of the domains were used – top , random , all .se and all .dk domains.
Random zone lists
To get snapshot of the status of general sites on the web, random selections directly from the .se (A.1.2), .dk (A.1.3), .com and .net (A.1.4) TLD zones were used. The largest set was .se domains; domains each from .dk, .com and .net were also used.
Table 3.1 summarizes the domain lists and samples from each of theses lists used in the thesis. More details on each sublist is provided in Appendix A. However, at a high level we categorize the lists in three main categories. In total there are more than domains considered.
We note that it is incorrect to assume that domain ownership is always based on second-level domains, such as iis.se or joelpurra.com. Not all TLDs' second-level domains are open for registration to the public; examples include the Brazilian top level domain .br, which only allows commercial registrations under .com.br. There is a set of such public suffixes used by browser vendors to implement domain-dependent security measures, such as preventing super-cookies (A.2). The list has been incorporated into this thesis as a way to classify (A.5.4) and group domains such as company-abc.com.br and def-company.com.br as separate entities, instead of incorrectly seeing them as simple subdomains of the public suffix .com.br – technically a second-level domain.
For the thesis, the shortest non-public suffix part of a domain has been labeled the primary domain. The domain example.com.br is the primary domain for machine100.services.example.com.br, as .com.br is a public suffix. The term superdomain has also been used for the opposite of subdomain; example.org is a superdomain of www.example.org.

Capturing tracker requests

One assumption is that all resources external to the initially requested (origin) domain can act as trackers, even for static (non-script, non-executable) resources with no capabilities to dynamically survey the user's browser, collecting data and tracking users across domains using for example the referer (sic) HTTP header [22]. While there are lists of known trackers, used by browser privacy tools, they are not 100% effective [35, 22] due to not being complete, always up to date or accurate. Lists are instead used to emphasize those external resources as confirmed and recognized trackers.
Resources have not been blocked in the browser during website retrieval, but have been matched by URL against a third-party list in the classification step (A.5.4) of the data analysis. This way trackers dynamically triggering additional requests have also been recored, which can make a difference if they access another domain or another organization's trackers in the process.
The tracker list of choice is the one used in the privacy tool Disconnect.me, where it is used to block external requests to (most) known tracker domains (A.3). It consists of domains, each belonging to one of organizations and five categories – see Table 3.2 for the number of domains and organizations per category. The domain level blocking fits well with the thesis' internal versus external resource reasoning. Because domains are linked to organizations as well as broadly categorized, blocking aggregate counts and coverage can form a bigger picture.
Not all domains in the list are treated the same by Disconnect.me; despite being listed as known trackers, the content category (A.3.6) is not blocked by default in order to not disturb the normal user experience too much. Most organizations are only associated with one domain, but some organizations have more than one domain (A.3.3). Figure 3.1 shows the number of organizations (out of the 980 organizations) that have a certain number of tracker domains (x axis). We see that 47% (459 of 980) have at least two domains listed by Disconnect.me. Google (rightmost point) alone has 271 domains and Yahoo has 71. Some organizations have their domains categorized in more than one category, as shown in detail in Table 3.3. Due to the relaxed blocking of the content category this can provide a way to track users despite being labeled a tracker organization.
While cookies used for tracking have been a concern for many, they are not necessary in order to identify most users upon return, even uniquely on a global level [10]. Cookies have not been considered to be an indicator of tracking, as it can be assumed that a combination of other server and client side techniques can achieve the same goal as a normal tracking cookie [1].

Data collection

The lists of domains have been used as input to har-heedless, a tool specifically written for this thesis (B.2.2). Using the headless
3
browser phantomjs, the front page of each domain has been accessed and processed the way a normal browser would (A.4.3). HTTP/HTTPS traffic metadata such as requested URLs and their HTTP request/response headers have been recorded in the HTTP Archive (HAR) data format (B.1.1).
In order to make comparisons between insecure HTTP and secure HTTPS, domains have been accessed using both protocols. As websites traditionally have been hosted on the www subdomain, not all domains have been configured to respond to HTTP requests to the primary domain – thus both the added www prefix and no added prefix have been accessed. This means four variations for each domain in the domain lists, quadrupling the number of accesses (A.4.4) to over . List variations have been kept separate; downloaded and analyzed as different datasets (3.6).
Multiple domains have been retrieved in parallel (A.4.3), with adjustable parallelism to fit the computer machine's capacity (A.4.5). To reduce the risk of intermittent errors – either in software, on the network or in the remote system – each failed access has been retried up to two times (A.4.6).
Details about website and resource retrieval can be found in A.4.

Data analysis and validation

With HAR data in place, each domain list variation is analyzed as a single dataset by the purpose-built har-dulcify (B.2.3). It uses the command line JSON processor jq (B.1.3) to transform the JSON-based HAR data to formats tailored to analyze specific parts of each domain and their HTTP requests/responses.
Data extracted includes URL, HTTP status, mime-type, referer and redirect values – both for the origin domain's front page and any resources requests by it (A.5.2). Each piece of data is then expanded, to simplify further classification and extraction of individual bits of information; URLs are split into components such as scheme (protocol) and host (domain), the status is labeled by status group and the mime-type split into type and encoding (A.5.3).
Once data has been extracted and expanded, there are three classification steps. The first loads the public suffix list and matches domains against it, in order to separate the FQDN into public suffixes, private prefixes and extract the primary domain (A.5.4). The primary domain, which is the first non-public suffix match, or the shortest private suffix, is used as one of the basic classifications; is an HTTP request made to a domain with the same primary domain as the origin domain's primary domain? Other basic classifications (A.5.4) compare the origin domain with each requested resource's URL, to see if they are made to the same domain, a subdomain or a superdomain. Same domain, subdomain, superdomain and same primary domain requests often overlap in their classification – collectively they are called internal requests. Any request not considered an internal request is an external request – which is one of the fundamental ideas behind the thesis' result grouping (C.4). Mime-types are counted and grouped, to show differences in resource usage (C.9). To get an overview of domain groups, their primary domains and public suffixes (C.10) are also kept. Another fundamental distinction is also wether a request is secure – using the HTTPS protocol – or insecure. Finally, Disconnect's blocking list (3.3) is mixed in, to be able to potentially classify each requests' domain as a known tracker (A.5.4), which includes a mapping to categories and organizations (C.11).
After classification has completed, numbers are collected across the dataset's domains (A.5.5). Counts are summed up per domain (C.5), but also reduced to boolean values indicating if a request matches a certain classification property or primary/tracker domain, so that a single domain making an excessive number of requests would not skew numbers aggregated across domains. This allows domain coverage calculations, meaning on what proportion of domains a certain value is present.
Most of the results presented in the thesis report are based on non-failed origin domains. Non-failed means that the initial HTTP request to the domain's front page returned a proper HTTP status code, even if it was not indicative of a complete success (C.2). Subsequent requests made while loading the front pages were grouped into unfiltered, only internal and only external requests (C.4). The analysis is therefore split into six versions (B.2.3), although not all of them are interesting for the end results.
Apart from these general aggregates, har-dulcify allows specific questions/queries to be executed against any of the steps from HAR data to the complete aggregates. This way very specific questions (A.5.6), including Google Tag Manager implications (A.4.3) and redirect chain statistics (C.8), can be answered based using the input which fits best. There are also multiset scripts, collecting values from several or all 72 datasets at once. Their output is the basis for most of the detailed results' tables and graphs; see Appendix C.
See also analysis methodology details in Appendix A.5.

High level summary of datasets

Domains lists chosen for this thesis come in three major categories – top lists, curated lists and random selection from zone files (Section 3.2 and Table 3.1, Section A.1). While the top lists and curated lists are assumed to primarily contain sites with staff or enthusiasts to take care of them and make sure they are available and functioning, the domain lists randomly extracted from TLD zones might not. Results (Chapter 4, Appendix C) seem to fall into groups of non-random and randomly selected domains – and result discussions often group them as such.
Table 3.4 shows the top TLDs in the list of unique domains; while random TLD samples of course come from a single TLD, top lists are mixed. Looking at the complete dataset selection, the gTLD .org, ccTLDs .ru and .de are about the same size. This list can be compared to the per-TLD (or technically public suffix) results in Table C.10, which shows the coverage of TLDs for external requests per dataset.
The curated .SE Health Status domain categories in Table 3.5 show that the number of domains per category is significantly lower than the top lists and random domains. This puts a limit on the certainty with which conclusions can be drawn, but still serves a purpose in that the categories often show different characteristics.
The most interesting category is the media category, as it is the most extreme example in terms of requests per domain and tracking (C.6). While the thesis is limited to the front pages of each domain (3.7), it would be interesting to see if users are still tracked after logging in to financial websites (C.4). It is also interesting to see how public authorities, government, county and municipality websites include trackers from foreign countries (C.11.1).

Limitations

With lists of domains as input, this thesis only looks at the front page of domains. While others have spidered entire websites from the root to find, for example, a specific group of external services [44], this is an overview of all resources. The front page is assumed to contain many, if not most, of the different types used on a domain. Analysis has mostly been performed on each predefined domain list as a whole, but dynamic – and perhaps iterative – re-grouping of domain based on results could improve accuracy, understanding and crystallize details. It would also be of interest to build a graph of domains and interconnected services, to visualize potential information sharing between them [1].

Chapter 4
Results

This chapter presents the main results, exemplified primarily by four datasets in their HTTP-www and HTTPS-www variations; Alexa's top 10k websites, Alexa's top .se websites, .se 100k random domains and Swedish municipalities. Some result highlights from Swedish top/curated datasets, random .se domains and findings from other domains in different categories are presented in the table below. Supplementary results and additional details are provided in Appendix C.
Figure 4.3:
Small versions of Figure C.3, C.5 and C.8 showing a selection of HTTP-www and HTTPS-www variations

HTTP, HTTPS and redirects

Figure 4.1(a) shows the ratio of domains' HTTP response status code on the x axis (C.3). There were few 1xx, 4xx and 5xx responses; the figure focuses on 2xx and 3xx; no response is shown as null. In general, HTTPS usage is very low at less than 0.6% among random domains – see random .se (null) responses for the se.r.100k-sw dataset. Reach50 top sites are leading the way with a 53% response rate (C.2).
Sites which implement HTTPS sometimes take advantage of redirects to direct the user from an insecure to a secure connection, for example when the user didn't type in https:// into the browser's address bar. Surprisingly, this is not very common – while many redirect to a preferred variant of their domain name, usually the www subdomain, only a few percent elect to redirect to a secure URL (C.8). The average number of redirects for domains with redirects is , but some domains have multiple, chained redirects; a few even to a mixture of HTTP and HTTPS URLs.
Figure 4.1(a) shows the ratio of domains responding with redirects (x axis' 3xx responses), and the effect of these redirects are detailed in Figure 4.1(b) as ratio of domains (x axis) which are strictly secure, have mixed HTTP/HTTPS redirects, are strictly insecure or which could not be determined because of recorded URL mismatches. It is surprising to see that redirects more often point to insecure than secure URLs – even if the origin request was made to a secure URL. The secure random .se domains (se.r.100k-sw) have a higher secure redirect ratio, but due to the very low response rate of 0.3% when using HTTPS – and even fewer which use redirects – it is hard to draw solid conclusions.
It seems that Swedish media shun secure connections – not one of them present a fully secured domain, serving mixed content in case of responding to secure requests. At the same time, they use the highest count of both internal and external resources – with numbers several times higher than other domain lists – and more than 20% of requests go to known trackers.

Internal and external requests

With each request classified as either internal or external to the originating domain, it is easy to see how sites divide their resources (C.4). Less than 10% of top sites (for example alexa.top.10k-hw) use strictly internal resources, meaning up to 93% of top sites are composed using at least a portion of external resources. See the percentage of domains (x axis) using strictly internal, mixed and strictly external resources in Figure 4.2(a) for a selection of datasets, and Figure C.2 for all datasets. This means external resources – in this thesis seen as trackers – have actively been installed, be it as a commercial choice or for technical reasons (2.1). The difference between HTTP and HTTPS datasets is generally small, showing that users are as tracked on secure as on insecure sites.
Figure 4.3(a) shows the cumulative distribution function (CDF) of the ratio of external resources used by each domain, with 0% and 99% internal resources marked. In particular, we show the ratio of domains (y axis) as a function of the ratio of internal resources seen by each domain (x axis). This maps to the bar graph in Figure 4.2(a); 0% is all external, over 99% is all internal – in between means mixed security.
Similar to the HTTPS adoption, we observe significant differences between randomly selected domains and the most popular (top ranked) domains. See how dataset HTTP/HTTPS variation lines follow each other for most datasets, again pointing towards “secure” HTTPS serving as many trackers as insecure HTTP. This means that a secure, encrypted connection protecting against eavesdropping doesn't automatically lead to privacy – something which users might be lead to believe when it is called a “secure connection” as well as through the use of “security symbols” such as padlocks.
For the HTTP variation of random .se domains (se.r.100k-hw) 40% use strictly external resources; this seems to be connected with the fact that many domains are parked
1
and load all their resources from an external domain which serves the domain name retailer's resources for all parked domains. The same domains seem to not have HTTPS enabled, as can be seen in 4.1(a), and the remaining HTTPS domains show the same internal resource ratio characteristics as top domains. There is a wide variety of parked page styles, as well as other front pages without actual content, but they have not yet been fully investigated and separately analyzed (7.5.4).

Tracker detection

While looking at the number of requests made to trackers can give a hint towards how much tracking is installed on a website, it can be argued that one or two carefully composed requests can contain the same information as several request. The difference is merely technical, as not all types of resources can be efficiently bundled and transferred in a single requests, but require more than one – therefore it's more interesting to look at the number of organizations which resources are loaded from (C.11.1). Looking at categories can also be interesting – especially for the content category, which isn't blocked by default by Disconnect.me.

4.3.1 Categories

Figure 4.2(b) shows coverage of the five categories in Disconnect.me's blocking list (3.3, A.3.1), as well the grey “any” bar showing the union of known tracker coverage (x axis). The special Disconnect category (A.3.7) is predominant in most datasets, showing coverage almost as large as the union of all categories. The second largest category is content – which is not blocked by default by Disconnect.me, as these requests have been deemed desirable even to privacy-aware users. This means that even when running Disconnect.me's software, users are still tracked on 60-70% of websites (C.11.3).

4.3.2 Organizations per domain

Figure 4.3(c) shows the CDF of the ratio of domains (y axis) with the number of organizations detected per domain (x axis) for a selection of datasets. The random .se domain HTTP variation (se.r.100k-hw) has a different characteristic than others, with 40% of domains having no detected third party organizations; it can be due to domain registrars who serve parked domain page not being listed as trackers. Around 55% of random Swedish HTTP domains (se.r.100k-hw) have 1-3 trackers, and about 5% have more than 3.
Once again it can be seen that the amount of tracking is the same in other HTTP-www variations as in their respective HTTPS-www variation – as the figure shows, the lines follow each other. Most websites also have at least one known tracker present; 53-72% of random domains have at least one tracker installed, while 88-98% of top websites have trackers and 78-100% of websites in the Swedish curated lists. In the larger Alexa global top dataset (alexa.top.10k-hw and alexa.top.10k-sw), 70% of sites allow more than one external organization, 10% allow 13 or more and 1% even allow more than 48 trackers – and that is looking only on the front page of the domain.
Out of the Swedish media domains, 50% share information with more than seven tracker organizations – and one of them is sharing information with 38 organizations. Half of the Swedish media websites load more than 6 known trackers; a single visit to the front page of each of the 27 investigated sites would leak information in over external requests (C.5) to at least 57 organizations (C.11.1). This means that any guesswork in what types of articles individuals read would read in a printed newspaper is gone – and with that probably the guesswork in exactly what kind of personal opinions these individuals hold. While it is already known that commercial media outlets makes their money through advertising, this level of tracking might be surprising – it seems to indicate that what news users read online is very well known.

4.3.3 Google's coverage is impressive

Figure 4.2(c) shows Google, Facebook and Twitter's coverage. It also shows the grey “any” bar showing the union of known tracker coverage and an x marking the coverage of the entire Disconnect category of Disconnect.me's blocking list, which they are part of (A.3.7). The organization with the most spread, by far, is Google. Google alone has higher coverage than the Disconnect category, meaning that a portion of websites use resources from Google domains in the content category (A.3.6).
The runner ups with broad domain class coverage are Facebook and Twitter, but in terms of domain coverage they are still far behind – see Section (C.11.4). Google is very popular with all top domains and most Swedish curated datasets have a coverage above 80% – and many closer to 90%. Random domains have a lower reliance on Google at 47-62% – still about half of all domains. Apart from the .SE Health Status list of Swedish media domains, Facebook doesn't reach 40% in top or curated domains. Facebook coverage on random zone domains is 6-10%, which is also much lower than Google's numbers. Twitter has even lower coverage, at about half of that of Facebook on average. As can be seen in Figure 4.2(c), Google alone oftentimes has a coverage higher than the domains in the Disconnect category – it shows that Google's content domains are in use (A.3.3). While Disconnect's blocking list contains very many Google domains (A.3.2), the coverage is not explained by the number of domains they own, but by the popularity of their services (C.11.2). In fact, at around 90% of the total tracker coverage, Google's coverage approaches that of the union of all known trackers.

4.3.4 Tracker detection effectiveness

While all external resources are considered trackers, parts of this thesis has concentrated on using Disconnect.me's blocking list for tracker verification. But how effective is that list of known and recognized tracker domains across the datasets? See Section C.12 and Figure C.11 for the ratio of detected/undetected domains. While some of the domains which have not been matched by Disconnect are private/internal CDNs, the fact that less than 10% of external domains are blocked in top website HTTP datasets (such as alexa.top.10k-hw) is notable. The blocking results are also around 10% or lower for random domain HTTP datasets, but it seems it might be connected to the number of domains in the dataset. Only 3% of the external primary domains in .se 100k random domain HTTP dataset (se.r.100k-hw) were detected. Smaller datasets, including HTTPS datasets with few reachable websites, have a higher detection rate at 30% and more.

Chapter 5
Discussion

Two previously investigated pieces of data this thesis' subject was based upon were .SE's statistics regarding the use of Google Analytics and the adoption rates for HTTPS on Swedish websites. Both have been thoroughly investigated and expanded upon. Below is a comparison to the .SE Health Status reports as well as a few other reports, in terms of results and methodology. After that, a summary of the software developed and open source contributions follow.

.SE Health Status comparison

5.1.1 Google Analytics

One of the reasons this thesis subject was chosen was the inclusion of a Google Analytics coverage analysis in previous reports. The reports shows overall Google Analytics usage in the curated dataset of 44% 2010, 58% in 2011 and 62% in 2012 [31, 32, 33].
Thesis data from filtered HTTP-www .SE Health Status domain lists shows usage in the category with the least coverage (financial services) is 59% while the rest are above 74% (C.11.2); the average is 81%. The highest coverage category (government owned corporations) is even above 94%. Since Google Analytics can now be used from the DoubleClick domain as well as Google offering several other services, looking only at the Google Analytics domain makes little sense – instead it might make more sense to look at the organization Google as a whole. The coverage jumps quite a bit, with most categories landing above 90% (C.11.4), which is also the .SE Health Status average.
This means that traffic data from at least 90% of web pages considered important to the Swedish general public end up in Google's hands. In a broader scope considering all known trackers, 95% of websites have them installed.
It is possible to extract the exact coverage for both Google Analytics and DoubleClick from the current dataset. Google Analytics already uses a domain of its own, and by writing a custom question differentiating DoubleClick's ad specific resource URLs from analytics specific resource URLs, analytics on doubleclick.net can be analyzed separately as well.

5.1.2 Reachability

The random zone domain lists (.se, .dk, .com, .net) have download failures for 22-28% of all domains when it comes to HTTP without www and HTTP-www variations, where www has fewer failures (C.2). The HTTP result for .se is consistent with results from the .SE Health Status reports, according to Patrik Wallström, where they only download www variations. Curated .SE Health Status lists have fewer failures for both HTTP, generally below 10% for the http://www. variation – perhaps explained by the thesis software and network setup (A.4.6). Several prominent media sites with the same parent company respond as expected when accessed with a normal desktop browser – but not automated requests, suggesting that they detect and block some types of traffic.

5.1.3 HTTPS usage

.SE have measured HTTPS coverage among curated health status domains since at least 2007 [28, 29, 31, 32, 33, 30]. The reports are a bit unclear about some numbers as measurement methodology and focus has shifted over the years, but the general results seem to line up with the results in this thesis. Quality of the HTTPS certificate is also considered by looking at for example expiry date, deeming them correct or not. Data comparable to this thesis wasn't published in the report from 2007 and HTTPS measurements were not performed in 2012. Also, measurements changed in 2009 so they might not be fully comparable with earlier reports.
Table 5.1 shows values extracted from reports 2008-2013 as well as numbers from this thesis. Thesis results show a 24% HTTPS response rate (C.2) while the report shows 28%. The 2013 report also finds that 24% of HTTPS sites redirect from HTTPS back to HTTP. In this thesis it is shown that 22% of .SE Health Status HTTPS domains have non-secure redirects (C.8) – meaning insecure or mixed security redirects – which is close to the report findings.

Cat and Mouse

Similar to the methodology used in this thesis, the Cat and mouse paper by Krishnamurthy and Wills [22] use the Firefox browser plugin AdBlock to detecting third-party resources – or in their case advertisements. The ad blocking list Filterset.G from 2005-10-10 contains 108 domains as well as 55 regular expressions. Matching was done after collecting requested URLs using a local proxy server, which means that HTTPS requests were lost.
As this thesis uses in-browser URL capturing, HTTPS requests have been captured – a definite improvement and increase in the reliability of result. On the other hand, not performing URL path matching (7.5.1) and instead only using domains (the way Disconnect does it) might lead to fewer detected trackers, as the paper shows that only 38% of their ad matches were domain matches, plus 10% which matched both domain and path rules. Their matching found 872 unique servers from 108 domain rules – the domains ( in the advertisement category) in the currently used Disconnect dataset might be enough, as subdomains are matched as well.
The paper also discusses serving content alongside advertisements as a way to avoid blocking of trackers (C.11.3), as well as obfuscating tracker URLs by changing domains or paths, perhaps by using CDNs (C.11.2). While this thesis has not confirmed that this is the case, it seems likely that some easily blockable tracking is being replaced with a less fragile business model where the tracker also adds value to the end user. There are two ways to look at this behavior – do service providers add tracking to an existing service, or do they build a service to support tracking? For Google, the most prevalent tracker organization, it might be a mixture of both. In the case of AddThis, a wide-spread (C.11.2) social sharing service (A.3.8), it seems the service is provided as a way to track users. The company is operated as a marketing firm selling audience information to online advertisers, targeting social influencers
1
.
The report looks the top 100 English language sites from 12 categories in Alexa's top list, plus another 100 sites from a political list. These sites come from domains. A total of pages were downloaded from that set, plus 457 pages from Alexa's top 500 in a secondary list. The overlap was 180 pages. See Table 5.2 for ad coverage per dataset and Table 5.3 for ad domain match percentage.
The paper's top ad server (determined by counting URLs) was doubleclick.net at 8.8%. While thesis data hasn't been determined in the same way, it seems that Doubleclick has strengthened their position since then: comparing the coverage of doubleclick.net (C.11.2) to other advertisement domains, organizations or even the category seems to point to that Doubleclick alone has more coverage than the other advertisers together for several datasets. Advertisement coverage was 58% for Alexa's top 500, while this thesis detects 54% advertisement coverage plus an additional 52% doubleclick.net coverage – the union has unfortunately not been calculated.

Follow the Money

Gill and Erramilli et al. [16] have explored some of the economical motivations behind tracking users across the web. Using HTTP traffic recorded from different networks, the paper looks at the presence of different aggregators (trackers) on sites which are publishers of content. To distinguish publishers from aggregators, the domains in each session are grouped with regards to the originating request's domain's IP-address' network autonomous system (AS) number – requests to another AS number are counted as third parties/aggregators. In some cases looking at AS numbers leads to confusion, for example when multiple publishers are hosted on CDN service; separating publishers and aggregators by domain names is then used.
The largest dataset, a mobile network with users and sessions presumably excludes HTTPS traffic, as it would be unavailable to public network operators. The paper's Figure 3 shows that aggregator coverage is higher for the very top publishers; over 70% for the top 10.
The coverage of top aggregators on top publishers are shown in Table 5.4, alongside numbers from this thesis. This thesis doesn't use recorded HTTP traffic in the same way, and download each domain only once per dataset, but looking at publisher coverage should allow a comparison.
Google again shows a significantly greater coverage at 80%, compared to Facebook's 23% in second place. It looks like the paper has grouped AdMob under Global Crossing, which was a large network provider connecting 70 countries before being acquired in 2011. AdMob was acquired by Google already in 2009, so it's unclear why it's listed separately; one reason might be because the dataset is mobile-centric and that AS number is still labeled Global Crossing. Thesis results show even higher numbers for Disconnect's matching of Google and Facebook – 4 and 14 percentage points – even when looking at the Alexa's top sites. Microsoft doesn't seem to have as much coverage in the top , but the other organizations show about the same coverage.

Trackers which deliver content

In Disconnect's blocking list, there is a category called content (A.3.6). While all other categories are blocked by default, this one is not as it represents external resources deemed desirable to Disconnect's users. So while they are known tracker domains, they are allowed to pass “by popular demand” – similar to CDNs (A.2, 5.2). This brings an advantage to companies that can deliver content, as they can just as well use content usage data as pure web bug/tracker usage data when analyzing patterns.
Google has several popular embeddable services in the content category (coverage from large datasets in C.11.2 in parentheses), including Google Maps
2
(2-5%), Google Translate
3
and least but not least YouTube
4
(3-7%). Lesser known examples include Recaptcha
5
which is an embeddable service to block/disallow web crawlers/bots access to web page features. Those are visible examples, which users interact with; Google Fonts
6
which serves modern web fonts for easy embedding, is still visible but not branded. Google Hosted Libraries
7
is another very popular service yet unbranded service. It hosts popular javascript libraries from Google's extensive CDN network instead of the local server for site speed/performance gains, and is not visible as components – but they cannot be removed without affecting functionality. Especially the two latter, served from the googleapis.com (30-56%) domain, are prevalent in several of the datasets – and they are usually loaded on every single page on a website, and thus gain full insight on users' click paths and web history. The content tracking is passive and on the HTTP level, as opposed to scripts executing and collecting data such as Google Analytics (30-80%) and Doubleclick (11-53%).
As Disconnect's blocking blacklist is shown to cover only a fraction of the external domains in use, a whitelist could be an alternative. As Disconnect already has whitelisted the content category, it can be considered a preview of what shared whitelisting might look like. It is already the second largest category, in terms of coverage, with over 50% of domains in most datasets having matches (C.11.3). While the number of organizations being able to track users might be reduced by whitelisting, it seems the problem needs more research (7.5.1).

Automated, scalable data collection and repeatable analysis

One of the prerequisites for the type of analysis performed in this thesis was that all collection should be automated, repeatable and be able to handle tens of thousands of domains at a time. This goal has been achieved, and a specialized framework for analyzing web pages's HTTP requests has been built. While most of the code has been tailored to answer questions posed in this thesis, it is also built to be extendable, both in and between all data processing steps. More data can be included, additional datasets can be mixed in, separate questions can be written to query data from any stage in the data preparation or analysis. Tools have been written to easily download and compare separate lists of domains, and by default data is kept in its original downloaded form so that historical analysis can be performed.
It might be hard to convince other researchers to use code, as it might not fulfill all of their wishes at once on top of any “not invented here” mentality. Fortunately, the code is easy to run, and with proper documentation other groups should be able to at least test simple theories regarding websites. Some of the lists of domains used as input are publicly available, and thus results can also be shared. This should encourage other groups, as looking at example data might spark interest.

Tracking and the media

Swedish media websites have been shown to have the highest amount of trackers per site among the datasets – both in general and for advertisement and analytics categories. Media and press are trying to be independent from and uninfluenced by their advertiser, despite being a source of income.
Advertisement choices are historically based on audience size and demographic, determined by readership questionnaires. Even publishers themselves couldn't know what pages and articles readers actually read, once the paper had left the press. Current tracker technology allow both publishers and advertisers to see exactly what users are reading online – down to time spent per paragraph if they wanted. It could also mean that this type of intimate knowledge of what news is popular, connected with the kind of click traffic advertisers are seeing, means that they have financial incentive to control exactly what the media should write – as opposed to should not write. Does bad news bring more advertisement clicks (or other measurable targets such as product purchases, as opposed to newspaper readers) than good? – spend more money advertising on articles about bad news [45]. This will eventually affect the publisher's advertisement income. Advertisers could also separately investigate “relatedness” [27] but use it to value advertisement and provide their own expanded article categorization with fine-grained details for further refinement.

Privacy tool reliability

Can a privacy tool using a fixed blacklist of domains to block be trusted – or can it only be trusted to be 10% effective (C.12)? Regular expression based blocking, such as EasyList used by AdBlock, might be more effective, as it can block resources by URL path separate from the URL domain name (7.5.1) – but it's no cure-all. It does seem as if the blacklist model needs to be improved – perhaps by using whitelisting instead of blacklisting. The question then becomes an issue of weighing a game of cat and mouse (5.2) – if the whitelist is shared by many users – against convenience – if each user maintains their own whitelist. At the moment it seems convenience and blacklists are winning, at the cost of playing cat and mouse with third parties who end up being blocked.

Open source contributions

During the development of code for this thesis, other projects have been utilized. In good open source manners, those projects should be improved when possible.

5.8.1 The HAR specification

After looking at further processing of the data, some improvements might be suggested.
One such suggestion might be to add an absolute/resolved version of response.redirectURL, as specification 1.2 seems to be unclear wether or not it should be kept as-is from the HTTP Location header or browser's redirectURL values – both of which possibly is relative. As subsequent HTTP requests are hard to refer to without relying either on exact request ordering (the executed redirect always coming exactly as the next entry) or at least having the URL resolved (preferably by the browser) before writing it to the HAR data. Current efforts in netsniff.js (B.2.2) to resolve relative URLs using a separate javascript library have proven inexact when it comes to matching against the browser's executed URL, differing for example in wether trailing slashes are kept for domain root requests or not. What would be even better, is a way to refer to the reason for the HTTP request, be it an HTML tag, a script call or an HTTP redirect – but that could to be highly implementation dependent per browser.

5.8.2 phantomjs

While netsniff.js (B.2.2) from the phantomjs example library has been improved in several ways, patches have not yet been submitted. Since it is only an example from their side, a more developed version might no longer serve the same purpose – educating new users on the possibilities of phantomjs. An attempt to break the code down and separate pure bug fixes from other improvements might help. The version written for this thesis is released under the same license as the original, so reuse should not be a problem for those interested.

5.8.3 jq

Using jq as the main program for data transformation and aggregation has given me a fair amount of knowledge of real world usage of the jq domain-specific language (DSL). Bugs and inconsistencies have been reported, and input regarding for example code sharing through a package management system and (semantic) versioning has been given. Some of the reusable jq code and helper scripts written for the thesis has been packaged for easy reuse, and more is on the way.

5.8.4 Disconnect

Disconnect relies heavily on their blocking list (A.3), as it is the base for both the service of blocking external resources and presenting statistics to the user. While preparing (B.2.3) and analyzing (B.2.3) the blocking list, a number of errors and inconsistencies were found. Unfortunately, the maintainers do not seem very active in the project, and even trivial data encoding errors were not patched over a month after submission. According to Disconnect's Eason Goodale in an email conversation 2014-08-13, the team has been concentrating on a second version of Disconnect as well as other projects. While patches can be submitted through Disconnect's Github project pages, Goodale's reply seems to indicate they will not be accepted in a timely fashion and perhaps irrelevant by the time the next generation is released to the public.

5.8.5 Public Suffix

A tool that parses the public suffix list from its original format to a JSON lookup object format has been written. Using that tool an inconsistency in the data was detected – the TLD .engineering being included twice instead of .engineer and .engineering separately. This had already been detected and reported by others, but it can be used to detect future inconsistencies in an automated manner.

Platform for Privacy Preferences (P3P) Project
8
HTTP header analysis

P3P is a way for websites to declare their policies and intentions for data collected from web users. It is declared in a machine-readable format, as an XML file and in a compact encoding as an HTTP header. W3C's work started in 1997 and P3P 1.0 became a W3C recommendation in 2002. It never gained enough momentum and the work with P3P 1.1 was suspended in 2006. P3P is still implemented by many websites, even though it may not follow the originally intended usage.
In conversations with Dwight Hunter, privacy policy researcher, he mentioned that P3P policies are seen as a good technical solution to policy problems in research he had read. Thesis data shows that this is not always true; there are policy-wise useless P3P headers being sent from some webpages, most probably to bypass Internet Explorer's (not all versions) strict cookie rules for third-party site without a P3P HTTP header. This has been highlighted by Microsoft in 2012, pointing at Google's P3P use.
By default, IE blocks third-party cookies unless the site presents a P3P Compact Policy Statement indicating how the site will use the cookie and that the site’s use does not include tracking the user. Google’s P3P policy causes Internet Explorer to accept Google’s cookies even though the policy does not state Google’s intent.
Looking at collected HAR data there are many examples of P3P headers. In the dataset “se.2014-07-10.random.100000-http” from 2014-09-01 with about recorded requests, about present a P3P policy. There are about unique values, including example values shown in Table 5.5.
Potato” comes from an example in a discussion regarding Internet Explorer and cookie blocking.
10
Other examples include CP="This is not a P3P policy! It is used to bypass IEs problematic handling of cookies", CP="This is not a P3P policy. Work on P3P has been suspended since 2006: ", CP="This is not a P3P policy. P3P is outdated.", CP=\"Thanks IE8\ (which is a malformed value), CP="No P3P policy because it has been deprecated".
This is but one example of where quantitive analysis of real-world web pages shows differences between technical, intended or perceived usage. While P3P may be an outdated example that has been researched [25], it shows how automated, generic tooling can help researchers a lot in their understanding of usage in the wild.

Chapter 6
Related work

Privacy research, tracker research and internet measurement can be challenging, as has been shown by others. As .SE have in-house experts, their information was very valuable at an early stage – several pitfalls may have been avoided. This chapter discusses results in comparison to others' experience, methodology limitations and puts the work in a context.

.SE Health Status

While .SE themselves have written reports analyzing the technical state of services connected to .se domains, .SE Health Status [28, 29, 31, 32, 33, 30], the focus has not been on exploring the web services connected to these domains. The research is focused on statistics about usage and security in DNS, IP, web and e-mail; the target audience is IT strategists, executives and directors. Data for the reports is analyzed and summarized by Anne-Marie Eklund Löwinder, a world-renown DNS and security expert
1
, while the technical aspects and tools are under the supervision of Patrik Wallström, a well known DNSSEC expert and free and open source software advocate
2
.
The thesis subject has been selected to be in line with the .SE reports, but focusing on web issues; code may be reused and results may be included in future reports. The .SE Health Status reports do offer some groundwork in terms of selecting and grouping Swedish domains, HTTPS usage and Google Analytics coverage [31, 32, 33] which have been discussed in Section 5.1. The report is based on data collected from around .se domain names deemed of importance to the Swedish society as a whole, as well as random selection of 1% of the registered .se domain names.
Results for the .se zone and curated lists have during meetings with Wallström and Eklund Löwinder been reported to be reasonable regarding comparable results, such as reachability, HTTPS adaptation and Google Analytics coverage.
Thesis input (domain lists) preparation was automated based on .SE:s internally used data formats. As the thesis is more detailed in analyzing web content than previous reports, there is not yet enough historic data to show a change over time.

Characterizing Organizational use of Web-based Services

Gill and Arlitt et al. [15] analyze several HTTP datasets collected in HTTP proxies at different times and in different organizations. Two datasets are from 2008; one enterprise and one university. They are contrasted with a dataset collected at a residential cable modem ISP in 1997.
The paper introduces methods to identify and categorize the collected traffic to services providers (organizations) and service instances (a service with a specific use, possibly accessed through different domains names) by looking at over unique HTTP Host header values in total. First domain components are grouped in different ways, such as brands (first part after the public suffix), then top results are manually consolidated to service providers. Further consolidation is done by looking at domain name system (DNS) and organization identifiers from Regional Internet Registry (RIR) entries. Services are then grouped to service classes; some automated grouping is also possible by looking at HTML metadata. This thesis chose to use the public suffix list for automatic domain part identification, down to the brand (primary domain) level (A.2). In combination with Disconnect's blocking list, organizations and a simple categorization is obtained (A.3). It is a less generic way and possibly not fully effective on historical data, but accurate for the amount of work put in, as manual grouping and classification work is avoided – and possibly improved – by using crowdsourcing.
The single domain which is easiest to compare is doubleclick.net, listed as a separate brand in the paper, is shown to have 19% of transactions in the 2008 datasets. Thesis numbers (11-53%) are higher for most datasets (C.11.2), but the paper datasets also contain repeated and continuous use of services – such as Facebook and client side applications – which may lower relative numbers for other services.
While a comparison between paper and thesis numbers would be possible for HTTP methods, HTTP status codes as well as content types, they would require additional, slightly modified analysis of existing requests.

Challenges in Measuring Online Advertising Systems

The paper Challenges in Measuring Online Advertising Systems [17] shows that identifying ads and how data collected from trackers affect ads has several challenges. This thesis does not look at which ads are shown to a user, but rather where ads are served from. Potentially relevant to this thesis would be for example DNS load-balancing by ad networks, cookies differing between browser instances and local proxies affecting the HTTP request. This is how they were considered and dealt with:

.SE Domain Check

In order to facilitate repeatable and improvable analysis for this thesis, tools have been developed to perform the collection and aggregation steps automatically. .SE already has a set of tools that run monthly; integration and interoperability will smooth the process and continuous usage. There is also a public .SE tool to allow website owners to test their own sites, Domänkollen
3
, which might benefit from some of the code developed within the scope of this thesis.

Cookie syncing

A recent large-scale study by Acar et al. [1] included a cookie syncing privacy analysis. It was shown that unique user identifiers were shared between different third parties. IDs can be shared in different ways. If both third parties exist on the same page, they can be shared through scripts or by looking for any IDs in the location URL. They can also be shared by one third-party sending requests to a second third-party (known as a fourth-party), either by leaking the location URL as an HTTP referrer or by embedding it in the request URL. In crawls of Alexa's top domains, one third-party script in particular sends requests with synced IDs to 25 domains; the IDs were eventually shared with 43 domains. They also showed that a user's browsing history reconstruction rate rose from 1.4% to 11% when backend/server-to-server overlaps were modeled.

HTTP Archive
4

In an effort to measure web page speed on the internet, initially developed in October 2010, the HTTP Archive collects HAR data and runs analyses on them. Unfortunately, their official data dumps are in a custom format, not the original HAR files, but there are some direct comparisons to be made with their Interesting stats
5
aggregate data.
It seems there are unofficial, not yet fully developed, exports of HAR data. Unfortunately they weren't made available until late in the thesis process, and could not be used for software validation and comparison.

Chapter 7
Conclusions and future work

In this chapter we first summarize our main conclusions, list unanswered questions, and then we discuss promising directions for future work. There are many potential improvements which could help other researchers, as well as refine the analysis of data already collected for this thesis.

Conclusions

The use of external resources and known trackers is very high. While it has been a trend to outsource resource hosting and to use third-party services, it was previously unknown to what extent. It has now been shown that most websites use external resources in some form – almost 80% of the responding domains looking at the most common variation, HTTP-www C.5. This broad non-governmental tracking should be a concern for privacy minded individuals much as government controlled surveillance is.
This concern should be even higher on HTTPS enabled websites. Such sites have made an active choice to install encryption to avoid passive surveillance and stave off potential attacks – but 94% of HTTPS-www variation domains use external resources C.5.
It seems using a blacklist to stop trackers is the wrong way to go about it. Even with the crowdsourced list used by popular privacy tool Disconnect, the blocking list only detects 10% of external primary domains as trackers for top website datasets (4.3.4). Looking at the sheer number of external domains in use, it is easy to understand why blocking high-profile targets seems like a good option – but if 90% of external domains aren't listed even as known, desirable content, the blacklisting effort seems futile. Further research could use other blacklists and compare effectiveness (7.5.1).

Open questions

There are further questions in line with current results which could potentially be answered with additional research.

Improving information sharing

7.3.1 Creating an information website

Despite thesis code being open source, much of the thesis data is hard to retrieve, analyze and process for individuals. A separate tool performing the work for anyone should be created. Apart from presenting data already collected as a part of the thesis, it could accept user input to analyze individual domains. With several domains as input, any overlap can be detected and presented to the user as an information sharing graph. One of the inspirations for this thesis was Collusion
1
, which is a tool to dynamically display from which external domains a page retrieves resources right in the browser. A version of the same tool could be built, where instead of letting the user's browser retrieve sites the server would do it. This way a non-technical user does not have to “risk” anything by visiting web pages, and their relationship could be displayed anyways. Collecting data server-side also allows for cached lookups and a grander scope, where further relationships apart from the first hand ones could be suggested. “If you frequently visit these sites, you might also be visiting theses sites – click to display their relationships as well.”
Over time and with user input, the dataset collected on the server would increase, and a historical graph relating to both results shown in this thesis and the relationships between sites can be created. This is similar to what both the HTTP Archive (6.6) is doing on a large scale but with slightly different focus, and what the .SE Health Status is doing but on a less continuous basis and with a shifting focus.

7.3.2 Package publicly available data

Datasets based on publicly available datasets can be packaged for other researchers to analyze. While fresh data would be better, a larger dataset can take time to download on a slow connection or computer and all software may not be available to the researcher. It lowers the step in for others who might be interested in the same kind of research, which might lead to the software used in this thesis being improved.

7.3.3 Code documentation

With some 75 scripts written and released as open source, the need for documentation has gradually increased. The reason for not writing proper documentation – not having direct collaborators writing code – is a hinderance for future collaborators or users to get started. While code documentation has not been an explicit part of the thesis plan, it can be seen as an important step for future usage. The code is not magic in any way, but if understanding the functionality of a file required reading over a hundred lines of code instead of two or three lines of comments, it means a rather steep learning curve for something that is supposed to be simple.

Improving domain retrieval

7.4.1 Automated testing

So far all testing of har-heedless and phantomjs has been done manually. It has proven to be a working setup, as thesis results are based on these tools, but the features are to be considered fragile as there are no regression tests. Automated tests of the different levels (shell scripts, netsniff.js (B.2.2), screenshots, error handling) might help achieve stability in case of for example future improvements of phantomjs. Tests might include setting up a web server with test pages serving different kinds of content, as well as different kinds of errors. During mass downloading of domains phantomjs has been observed outputting error messages, such as failed JPEG image decoding and unspecified crashes. The extent of these errors have so far not been examined, as they have ended up being clumped together with external errors such as network or remote server failures.

7.4.2 Investigating failed domains

There are many reasons domain retrieval could fail, but for top or curated domain lists the chances of the site being down are considerably lower than for randomly selected domains. Each website has been requested up to three times, in order to avoid intermittent problems (A.4.6). Despite this, certain sites do not respond to requests from the automated software. There are several ways for a remote system to detect requests from automation software, with the simplest one being looking at the HTTP User-Agent browser make/model identifier string.
Automated downloading of webpages, especially downloading several in short succession, can be seen by site and service owners as disruptive by using system resources and skewing statistical data. Traversing different pages on a single website can also be detected by looking at for example navigational patterns [42, 26]. By only downloading the domain root page and associated resources this tool might not fall into that category of detection.
As some sites respond to desktop browser requests, but not har-heedless' requests, it is believed they have implemented certain “protection” from this kind of software. In respect of their wish not to serve automated requests, har-heedless' browser has not been modified, for example by using a different User-Agent string, to try to avoid these measures.

7.4.3 Browser plugins

If possible, a set of common browser plugins could be installed into phantomjs. The first that comes into mind is Adobe Flash, which is sometimes used to display dynamic ads. Flash also has the ability to request resources from other domains, so it might affect results to not render them. An additional problem might be that Flash has its own cookie system, which used storage external to the browser. This brings a new set of potential problems, as Flash cookies are a big part of evercookies and cookie respawning [1]. This means that a headless browser without persistent storage might end up having identifier cookies set in Flash storage, thus being easily and uniquely identified on subsequent visits. While this might not affect this thesis much, as external plugins have not been installed, it might affect other kinds of research being conducted based on the same tools.

7.4.4 System language

Tests were run on English language systems, without making any customizations to phantomjs' settings or HTTP Accept-Language headers. While sites have been downloaded from around the world, localized domains might behave differently depending on user language. Google has a recommendation saying that they will prioritize TLDs specific to a region with a certain language (such as .se and Sweden) for users sending Accept-Language prioritizing Swedish
2
. This stems from them seeing that localized results have a higher usage rate.

7.4.5 System fonts

Some of the difference between site screenshots and manually browsing to a site is in the fonts displayed. Most of the domains have been downloaded on a headless server, where fonts have not mattered to the system owner. Installing additional fonts commonly available on average user systems might reduce perceived difference.

7.4.6 Do Not Track

While the HTTP header Do Not Track (DNT) has not been set, it would have been interesting to look at the difference in response from remote services. Detecting usage of the server-response header Tracking Status Value (TSV) would be a good start
3
. As cookie headers can be analyzed, the difference could have been detected both per origin domain and per connected service. See also the P3P analysis (5.9) for a related header.

7.4.7 Using more scalable software

While invoking phantomjs on a single machine is often enough (A.4.1, A.4.5), that level of computing power is not always enough to download large, or continuously monitor, domain lists in a timely manor. While downloaded HAR file output is easy enough to combine from different machines, it might be worth investigating already parallelized software, such as spookystuff
4
. Built to spider and extract data from websites using multiple – even thousands – of machines on Amazon's cloud computing platform, it could enable analysis of for example entire TLD zones.

7.4.8 Domain lists

There are other domain lists that might have been suitable in this thesis. One curated top list is the KIA index, a list of top sites in Sweden aggregating statistics from different curated sites' underlying analytics tools.
5
Other TLD zone files, both from other countries and more generic ones, could be used as well. For example the new generic TLDs could be compared to older ones.

Improved data analysis and accuracy

Improvements to the data transformation and analysis steps.

7.5.1 Refined ad and privacy blocking lists

There are several lists of known ads and privacy invading trackers in use in blocking software than Disconnect (A.3). One of the most popular ones is EasyList
6
, which exists in several varieties – privacy blocking, country specific, social media content blocking and others. They were considered, but in the end not incorporated because of the filter list format. It is a mixture of HTML element and URL blocking, and it lacks the connection between blocks and corresponding organization
7
. There is also Ghostery
8
, which uses a proprietary list that also contains organizations, but it has not been used because of licensing issues
9
.
In another approach, future research could use whitelisting to try to determine challenges in detecting and recording desirable and functional external resources. An example of whitelisting already exists in Disconnect's content category, which might be a good start (5.4).
On a technical level, some blocking rule formats have also posed a problem in terms of implementation into the current data processing framework that is har-dulcify. It relies heavily on jq (B.1.3), which does not have a public release that implements in regular expressions support, a major part of some blocking lists. The idea is that ads and related resources are filtered matching requests' complete URL against the blocking rules, which are a mixture of both more coarse and more fine-grained than Disconnect's domain based rules. One example is /ads/, matching a folder name that suggests that all URLs containing this particular path substring is serving advertisements. The thought of using a single path substring to block advertisements served from any domain is more coarse than pinpointing a single domain, but it is also more specific as it would not block legitimate content from another subfolder on the same domain. This way general ad serving systems can be blocked, while domains that serve both ads and content is still allowed serve the content without interfering with the blocking of ads.
At the time of writing, jq is released as version 1.4. Support for regular expressions is planned for version 1.5.

7.5.2 Automated testing

Data transformations have been written in a semi-structured manner, with separate files for most separate tasks, often executed in serial stages. Each task accepts a certain kind of data as input for the transformation to work correctly – but as both input and output from separate stages looks very similar, it is hard to tell which kind of data it accepts and what the expected output is – and if a change in one stage will affect later stages. Writing automated tests for each stage would have helped during both adding functionality and refactoring the structure. At times, there have been rather time-consuming problems with unexpected or illegal input from real world sites – extracting that kind of input to create a test suite would have sped up fixes and raised confidence in that the input would be handled appropriately and output would still be correct. So far that has not been done, and much of the opportunity to gain from tests have been lost as work has progressed past each problem.
One solution to validating both input and output would have been to create JSON schemas
10
for each transformation. This kind of verification can easily be automated, and it will help any future changes.

7.5.3 Code reuse

Much of the code written in shell scripts, both Bash and jq code, is duplicated between files. While common functionality suitable for pipelining have been broken out, shared functions have not. Bash provides the source command for sharing functionality. Code sharing in jq through the use of modules and packages is still under development, but there is a way to load a single external command file. This file can be precomposed externally by concatenating files with function definitions first, and the actual usage of those functions second. The improvement was postponed due to the relative little reuse in early scripting and bright outlook on modules and packages support. As the number of scripts grew, code sharing/composition possibilities grew as well – and with them possible improvements in development speed, consistency and correctness. At this stage, software stability is more important for the final dataset download and analysis, and code refactoring can only be postponed. Foreseeing a greater reuse of JSON and jq tools, a separate open source project has been started – jq-hopkok
11
– where some scripts have been collected. Many functions and utilities local to har-dulcify are project-agnostic, and thus suitable objects to move to jq-hopkok for ease composition.
At the time of writing, jq is released as version 1.4. Support for modules is planned for a version after 1.5. Packages/package managers are external to the jq core, and do not follow the same planning.

7.5.4 Ignore domains without content

Many domains do not contain any actual content. Examples include web server placeholder pages (“Installation succeeded”), domain listings (“Index of /”), parked domains (“This domain was purchased from ...”) and advertisement domains (such as Google Adsense for Domains
12
, now retired, or similar Adsense usage). There is a Sweden-centric list of site titles for recognized non-content pages available internally at .SE, but it has not been incorporated.

7.5.5 Implement public suffix rules, use non-ICANN suffixes

As datasets have been analyzed, public suffix rules have proven to work in general, with co.uk and similar second level domains being properly grouped and primary domains extracted. There are still traces of the wildcard rules (A.2) in the data though, which means that while numbers are low, there are domains for which the public suffix rules have not been properly applied.
Other potential improvements would be implementing the non-ICANN, private suffixes. This would for example lower the aggregate numbers for cloudfront.net and amazonaws.com as primary domains in the aggregates, focusing on the fact that subdomains belongs to different organizations. Disconnect's dataset, which lists cloudfront.net as a single tracker entity would still present Amazon as a the single tracking organization behind the domain though, which might be a bit misleading. It is true that they can read traffic data in the same way other web hosting and cloud services host can read traffic data to their customers, but customers' common domain name suffix has little to do with it.

7.5.6 Separating analysis steps

As some of the analysis relies on aggregate numbers, such as domain and requests counts, they expect the entire dataset to be available at the start of the analysis. Saving these numbers in an intermediate step would allow further dataset refinement without having to perform the same filtering multiple times, and thus ease second-level filtering. Custom data questions (A.5.6) are one example, as they need to carry the number of (non-failed) domains throughout the calculations, even though only a small subset of the data is interesting, in order to present them as part of the output.
Another example is the current analysis split into unfiltered and non-failed domains, which then is further split into unfiltered, internal and external resources (B.2.3). If the first step had been made as a separate filtering step instead of an integrated part, further analysis would have been clearer as well as easier to modularize.
Saving the intermediate filtering results would simplify selecting for example non-failed domains which only use secure resources (C.7), to look at their usage of internal/external/tracker usage compared to insecure domains (C.4). While redirects have been analyzed to some extent (C.8), another interesting idea would be to select domains with redirects (C.3) and perhaps consume the redirects – especially from secure domain variations looking at secure redirects – before performing further analysis.

7.5.7 Dynamic grouping, domain selection

The har-heedless software was built to download domains based on a simple list, put them in folders per list and list variation (A.4.4) and then har-dulcify use the most recent HAR data to perform the aggregate analysis for the entire dataset. The overlap between different domain lists have been downloaded multiple times, and each round of downloads have started from zero to be sure that the domain list results represent a specific point in time. It would be beneficial to download each unique domain once then select domains as belonging to a specific analysis group, currently represented by domain lists, after they have been downloaded. This would allow a more dynamic grouping, and possibly re-arranging of domains to enable more interesting second-level analysis.

7.5.8 Incremental adding and updating of domains

Another improvement would be to ensure that all data mapping steps are built in such a way that a single domain can be excluded or included from the results. This would enable single domains to be updated, perhaps as part of continuous analysis or from user input (7.3.1), without having to recalculate all steps for the entire dataset results. While most incremental updates are a matter of easy addition and subtraction, some late analysis steps introduce coverage calculations and other arithmetical divisions, which may cause some data/precision loss if reversed. If these data reductions can be deferred to a separate mapping step, computing time might be acceptable.

References

1Acar, Gunes and Eubank, Christian and Englehardt, Steven and Juarez, Marc and Narayanan, Arvind and Diaz, Claudia, "The Web Never Forgets: Persistent Tracking Mechanisms in the Wild", in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA: ACM, 2014), pp. 674--689.
2Ahlgren, Marianne and Davidsson, Pamela, "Svenskarna och politiken p internet -- delaktighet, pverkan och övervakning", .SE The Internet Infrastructure Foundation (2014).
3Äımeur, Esma and Lafond, Manuel, "The Scourge of Internet Personal Data Collection", in Proceedings of the International Conference on Availability, Reliability and Security (Washington, DC, USA: IEEE Computer Society, 2013), pp. 821--828.
4T. Berners-Lee and R. Fielding and H. Frystyk, "Hypertext Transfer Protocol -- HTTP/1.0", IETF (1996).
5T. Berners-Lee and R. Fielding and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", IETF (2005).
6T. Bray, "The JavaScript Object Notation (JSON) Data Interchange Format", IETF (2014).
7Bylund, Markus, Personlig integritet p nätet 1 (FORES, 2013).
8D. Cooper and S. Santesson and S. Farrell and S. Boeyen and R. Housley and W. Polk, "Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile", IETF (2008).
9Dou, Zhicheng and Song, Ruihua and Wen, Ji-Rong, "A Large-scale Evaluation and Analysis of Personalized Search Strategies", in Proceedings of International World Wide Web Conference (New York, NY, USA: ACM, 2007), pp. 581--590.
10Eckersley, Peter, "How Unique Is Your Web Browser?", Electronic Frontier Foundation (2009).
11Feldmann, Anja and Kammenhuber, Nils and Maennel, Olaf and Maggs, Bruce and De Prisco, Roberto and Sundaram, Ravi, "A Methodology for Estimating Interdomain Web Traffic Demand", in Proceedings of ACM SIGCOMM Conference on Internet Measurement (New York, NY, USA: ACM, 2004), pp. 322--335.
12R. Fielding and J. Reschke, "Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests", IETF (2014).
13R. Fielding and J. Reschke, "Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content", IETF (2014).
14Findahl, Olle and Davidsson, Pamela, "Svenskarna och internet 2014", .SE The Internet Infrastructure Foundation (2014).
15Gill, Phillipa and Arlitt, Martin and Carlsson, Niklas and Mahanti, Anirban and Williamson, Carey, "Characterizing Organizational Use of Web-Based Services: Methodology, Challenges, Observations, and Insights", ACM Trans. Web 5, 4 (2011), pp. 19:1--19:23.
16Gill, Phillipa and Erramilli, Vijay and Chaintreau, Augustin and Krishnamurthy, Balachander and Papagiannaki, Konstantina and..., "Best Paper -- Follow the Money: Understanding Economics of Online Aggregation and Advertising", in Proceedings of ACM SIGCOMM Conference on Internet Measurement (New York, NY, USA: ACM, 2013), pp. 141--148.
17Guha, Saikat and Cheng, Bin and Francis, Paul, "Challenges in Measuring Online Advertising Systems", in Proceedings of ACM SIGCOMM Conference on Internet Measurement (New York, NY, USA: ACM, 2010), pp. 81--87.
18Hannak, Aniko and Soeller, Gary and Lazer, David and Mislove, Alan and Wilson, Christo, "Measuring Price Discrimination and Steering on E-commerce Web Sites", in Proceedings of ACM SIGCOMM Conference on Internet Measurement (New York, NY, USA: ACM, 2014), pp. 305--318.
19Kontaxis, Georgios and Polychronakis, Michalis and Keromytis, Angelos D. and Markatos, Evangelos P., "Privacy-preserving Social Plugins", in Proceedings of USENIX Conference on Security Symposium (Berkeley, CA, USA: USENIX Association, 2012), pp. 30--30.
20Krishnamurthy, Balachander, "Privacy and Online Social Networks: Can Colorless Green Ideas Sleep Furiously?", IEEE Security and Privacy 11, 3 (2013), pp. 14--20.
21Krishnamurthy, Balachander and Wills, Craig E., "Analyzing Factors That Influence End-to-end Web Performance", in Proceedings of International World Wide Web Conference (Amsterdam, The Netherlands, The Netherlands: North-Holland Publishing Co., 2000), pp. 17--32.
22Krishnamurthy, Balachander and Wills, Craig E., "Cat and Mouse: Content Delivery Tradeoffs in Web Access", in Proceedings of International World Wide Web Conference (New York, NY, USA: ACM, 2006), pp. 337--346.
23Krishnamurthy, Balachander and Wills, Craig E., "Characterizing Privacy in Online Social Networks", in Proceedings of Workshop on Online Social Networks (New York, NY, USA: ACM, 2008), pp. 37--42.
24Kumar, Saurabh and Kulkarni, Mayank, "Graph Based Techniques for User Personalization of News Streams", in Proceedings of ACM India Computing Convention (New York, NY, USA: ACM, 2013), pp. 12:1--12:7.
25Leon, Pedro Giovanni and Cranor, Lorrie Faith and McDonald, Aleecia M. and McGuire, Robert, "Token Attempt: The Misrepresentation of Website Privacy Policies Through the Misuse of P3P Compact Policy Tokens", in Proceedings of ACM Workshop on Privacy in the Electronic Society (New York, NY, USA: ACM, 2010), pp. 93--104.
26Lourenço, Anália G. and Belo, Orlando O., "Catching Web Crawlers in the Act", in Proceedings of International Conference on Web Engineering (New York, NY, USA: ACM, 2006), pp. 265--272.
27Lv, Yuanhua and Moon, Taesup and Kolari, Pranam and Zheng, Zhaohui and Wang, Xuanhui and Chang, Yi, "Learning to Model Relatedness for News Recommendation", in Proceedings of International World Wide Web Conference (New York, NY, USA: ACM, 2011), pp. 57--66.
28Löwinder, Anne-Marie Eklund, "Health Status 2008", .SE The Internet Infrastructure Foundation (2008).
29Löwinder, Anne-Marie Eklund, "Health Status 2009", .SE The Internet Infrastructure Foundation (2009).
30Löwinder, Anne-Marie Eklund, "Health Status 2013", .SE The Internet Infrastructure Foundation (2013).
31Löwinder, Anne-Marie Eklund and Wallström, Patrik, "Health Status 2010", .SE The Internet Infrastructure Foundation (2010).
32Löwinder, Anne-Marie Eklund and Wallström, Patrik, "Health Status 2011", .SE The Internet Infrastructure Foundation (2011).
33Löwinder, Anne-Marie Eklund and Wallström, Patrik, "Health Status 2012", .SE The Internet Infrastructure Foundation (2012).
34M. Ayenson, D. J. Wambach, A. Soltani N. Good and Hoofnagle, C. J., "Flash cookies and privacy II: Now with HTML5 and ETag respawning", in Social Science Research Network Working Paper Series (, 2011).
35Malandrino, Delfina and Petta, Andrea and Scarano, Vittorio and Serra, Luigi and Spinelli, Raffaele and Krishnamurthy, Balach..., "Privacy Awareness About Information Leakage: Who Knows What About Me?", in Proceedings of ACM Workshop on Workshop on Privacy in the Electronic Society (New York, NY, USA: ACM, 2013), pp. 279--284.
36Mikians, Jakub and Gyarmati, László and Erramilli, Vijay and Laoutaris, Nikolaos, "Crowd-assisted Search for Price Discrimination in e-Commerce: First Results", in Proceedings of ACM Conference on Emerging Networking Experiments and Technologies (New York, NY, USA: ACM, 2013), pp. 1--6.
37Naylor, David and Finamore, Alessandro and Leontiadis, Ilias and Grunenberger, Yan and Mellia, Marco and Munafò, Maurizio..., "The Cost of the "S" in HTTPS", in Proceedings of ACM International Conference on Emerging Networking Experiments and Technologies (New York, NY, USA: ACM, 2014), pp. 133--140.
38Pariser, Eli, The filter bubble : what the Internet is hiding from you (New York: Penguin Press, 2011).
39Roosendaal, Arnold, "Facebook Tracks and Traces Everyone: Like This!", Tilburg Law School Legal Studies (November 30, 2010).
40Saez-Trumper, Diego and Liu, Yabing and Baeza-Yates, Ricardo and Krishnamurthy, Balachander and Mislove, Alan, "Beyond CPM and CPC: Determining the Value of Users on OSNs", in Proceedings of ACM Conference on Online Social Networks (New York, NY, USA: ACM, 2014), pp. 161--168.
41Smith, H. Jeff and Dinev, Tamara and Xu, Heng, "Information Privacy Research: An Interdisciplinary Review", in Markus, M. Lynne and Pavlou, Paul, ed., MIS Quarterly vol. 35, no. 4 (, 2011), pp. 989-1015.
42Tan, Pang-Ning and Kumar, Vipin, "Discovery of Web Robot Sessions Based on Their Navigational Patterns", Data Min. Knowl. Discov. 6, 1 (2002), pp. 9--35.
43Tange, O., "GNU Parallel - The Command-Line Power Tool", ;login: The USENIX Magazine 36, 1 (2011), pp. 42-47.
44Vapen, Anna and Carlsson, Niklas and Mahanti, Anirban and Shahmehri, Nahid, "Third-Party Identity Management Usage on the Web", in Faloutsos, Michalis and Kuzmanovic, Aleksandar, ed., Passive and Active Measurement vol. 8362, (Springer International Publishing, 2014), pp. 151-162.
45Weber, Ingmar and Garimella, Venkata Rama Kiran and Borra, Erik, "Mining Web Query Logs to Analyze Political Issues", in Proceedings of ACM Web Science Conference (New York, NY, USA: ACM, 2012), pp. 330--334.
46Yuan, Nicholas Jing and Zhang, Fuzheng and Lian, Defu and Zheng, Kai and Yu, Siyu and Xie, Xing, "We Know How You Live: Exploring the Spectrum of Urban Lifestyles", in Proceedings of ACM Conference on Online Social Networks (New York, NY, USA: ACM, 2013), pp. 3--14.

Appendix A
Methodology details

Domains

Table 3.1 has the details of the final domain lists in use, including full dataset size
1
2
and selection method. Table 3.4 shows the top TLDs in the list of unique domains; while random TLD samples of course come from a single TLD, top lists are mixed. This list can be compared to the per-TLD (or technically public suffix) results in Table C.10, which shows the coverage of TLDs for external requests per dataset.

A.1.1 .SE Health Status domains

When .SE performs their annual .SE Health Status report measurements, they use an in-house curated list of domains of national interest. These domains are mostly from the .se zone and cover government, county, municipality, higher education, government-owned corporations, financial service, internet service provider (ISP), domain registrar, and media domains – see Table 3.5 for category domain counts and descriptions. Some domains overlap both within and between categories; domains have been deduplicated.

A.1.2 Random .se domains

The thesis was written in collaboration with .SE, which runs the .se TLD, and the work focusing on the state of Swedish domains. Early script development was done using a sample of random domains, most often tested in groups of . A final sample of domains was also provided. The .se TLD is to be considered Sweden-centric.

A.1.3 Random .dk domains

The Danish .dk TLD organization, DK Hostmaster A/S
3
, helped out with a sample of domains, chosen at random from the database of active domains in the zone. The .dk TLD is to be considered Denmark-centric.

A.1.4 Random .com, .net domains

The maintainers of the .com, .net and .name TLDs, Verisign, allow downloading of the complete zone file under an agreement. The .com zone is the largest one by far, and the .net zone is in the top 4.
4
This allows for a random selection of sites from around the world, even though usage is not geographically uniform – both in terms of registrations and actual usage.

A.1.5 Alexa Top sites
5

Alexa, owned by Amazon, is a well-known source of top sites in the world. It is used in many research papers, and can be seen as the standard dataset. Their daily 1-month average traffic rank top list is freely available for download.
6
As Alexa distinguishes between a site and a domain, some domains with several popular sites are listed more than once. URL paths have been stripped and domains have been deduplicated before downloading.

A.1.6 Reach50 domains
7

The top 50 sites in Sweden are presented by Webmie
8
, who base their list on data from a user panel. The panelists have installed an extension into their browser, tracking their browsing habits by automated means. They also have results grouped by panelists categories: women, men, age 16-34, 35-54, 55+ but only the unfiltered top list is publicly available.

Public suffix list
9
10

In the domain name system, it is not always obvious what parts of a domain name are a public suffix and which are open for registration by Internet users. The main example is example.co.uk, where the public suffix .co.uk is different from the TLD .uk. Because HTTP cookies are based on domains names, it is important to browser vendors to be able to recognize which parts are public suffixes to be able to protect users against supercookies
11
; cookies which are scoped to a public suffix, and therefore readable across all websites under that public suffix. It should be noted that all subdomains do not have to point to servers owned by the same organization – it can be used as a way to allow tracking cookies for server-to-server tracking behind the scenes (external linkage) [20].
The same dataset is also useful for grouping domains without improperly counting example.co.uk as a user-owned subdomain of .co.uk, which would then render .co.uk as the most popular domain under the .uk TLD. Swedish examples include second level domains .pp.se for privately owned domains and .tm.se for trademarks
12
. These second level domains were more important before April 2003
13
, when first level domain registration rules restricted registration to nation-wide companies, associations and authorities.
The public suffix list (2014-07-24) contains rules, against which domains are checked in one of the classification steps (B.2.3). It becomes the basis for the domain's division into public suffix and primary domain (first non-public suffix match), and subsequent grouping.
There is also an algorithm for wildcard rules which can have exceptions; this thesis has not implemented wildcards and exceptions in the classification step. There are 24 TLDs with wildcard public suffixes, and 8 non-TLD wildcards. Out of these 8 non-TLD wildcards, 1 is *.sch.uk and 7 are Japanese geographic areas. The 24 wildcards have 10 exception rules; 7 of them are Japanese cities grouped by the previously mentioned geographic areas and the remaining 3 seem to belong to ccTLD owner organizations.
Apart from ICANN domains, which have been implemented, there are also private domains considered public suffixes listed as rules. They are domains which have subdomains controlled by users/customers, for example joelpurra.github.io which is controlled by me but hosted by the code hosting service github.com. Other examples include cloud hosting/CDN services such as cloudfront.net, amazonaws.com, azurewebsites.net, fastly.net, herokuapp.com, blogs from several blogspot.TLD domains and dyndns.com's wide choice of dynamic domains. One example that looks like a technical choice in order to hinder accidental or malicious setting of cookies is googleapis.com, which is listed despite being (presumably) completely under Google's control.

Disconnect.me's blocking list

One of the most popular privacy tools is Disconnect.me, which blocks tracking sites by running as a browser plugin. Disconnect was started by ex-Google engineers, and still seems to have close ties to Google as the own domain disconnect.me is listed as a Google content domain in the blocking list.
The Disconnect software lets users block/unblock loading resources from specific third-party domains and domain categories. The dataset (2014-09-08) has a list of domains used as the basis for the blocking. Each entry belongs to one of organizations, which come with a link to their webpage. There is also a grouping into categories – see description and examples later in this chapter. Worth noting is that the content category is not blocked by default.
There are other open source alternatives to Disconnect's blocking list, but they use data formats that are not as easy to parse. The most popular ones also do not contain information about which organization each blocking rule belongs to. See Section 7.5.1.

A.3.1 Categories

Most domains and organizations by far are in the advertisement category. The reason the Disconnect category has so few organizations, is that it is treated as a special category (A.3.7) with only Google, Facebook and Twitter. See Table 3.2 for domain and organization count per category.

A.3.2 Domains per organization

The dataset shows of the organizations have more than one domain. One organization stands out, with domains – Google. The biggest reason is that they own top level domains such as google.se and google.ch from over TLDs. Yahoo comes in second with domains, many of which are service-specific subdomains to yahoo.com, such as finance.yahoo.com and travel.yahoo.com. See Figure 3.1 for the distribution of organizations (y axis) with a certain number of associated domains (x axis), where Google is the datapoint to the far right on the x axis.

A.3.3 Organizations in more than one category

Some organizations are represented in more than one of the five Disconnect categories. Organizations represented in the content category may be blocked in part – but by serving content, they can achieve at least partial tracking. Yahoo has several ad services, several social services, several content services and a single analytics service, putting them in four categories. At least one organization, Google, is misrepresented in the categories; the special Disconnect category contains both their advertisement and analytics service domains (A.3.7). See Table 3.3 for organizations in more than one category, and which categories they are represented in.

A.3.4 Advertising

While this category has the most domains and organizations by far (see Table 3.2), many of the actors are unknown to the general public making it harder to know how information is collected and used. Several recognizable companies – such as AT&T, Deutsche Post DHL, eBay, Forbes, HP, IDG, Match.com, Monster, Opera, Salesforce.com, Telstra and Tinder – are listed with their primary domains. This suggests that they can follow their own customers across sites where their trackers are installed, without the use of more advanced techniques, such as cookie-sharing [1].
amazon-adsystem.com
Amazon's ad delivery network. Several amazon.tld domains, such as ca, co.uk and de are also listed here – but amazon.com is not.
appnexus.com
The AppNexus ad network.
imiclk.com
Akamai's ad network Adroit.
overture.com
Yahoo's ad network.
omniture.com
Adobe's ad network.
tradedoubler.com
The TradeDoubler ad network.

A.3.5 Analytics

Analytics services offer a simple way for website owners to gather data about their visitors. The service is often completely hosted on external servers, and the only connection is a javascript file loaded by the website. The script collects data, sends it back to the service and then presents aggregate numbers and graphs to the website owner.
alexa.com
Amazon's web statistics service, considered an authority in web measurement. Alexa's statistics, in the form of their global top list, is also used as input for this thesis (A.1.5).
comscore.com
Analytics service that also publishes statistics.
gaug.es
GitHub's analytics service.
coremetrics.com
Part of IBM's enterprise marketing services.
newrelic.com
A suite of systems monitoring and analytics software, up to and including browsers.
nielsen.com
Consumer studies.
statcounter.com
Web statistics tool.
webtrends.com
Digital marketing analytics and optimization across channels.

A.3.6 Content

Sites that deliver content. There is a wide variety of content, from images and videos to A/B testing, comment and help desk services. This category is not blocked by default.
apis.google.com
One of Google's API domains.
brightcove.com
Video hosting/monetization service.
disqus.com
A third-party comment service.
flickr.com
Flickr is a photo/video hosting site, owned by Yahoo.
googleapis.com
One of Google's API domains, hosting third-party files/services such as Google Fonts and Google Hosted Libraries.
instagram.com
Facebook's photo/video sharing site.
office.com
Microsoft's Office suite online.
optimizely.com
An A/B testing service.
truste.com
Provides certification and tools for privacy policies in order to gain users' trust; “enabling businesses to safely collect and use customer data across web, mobile, cloud and advertising channels.” This includes ways to selectively opt-out from cookies by feature level; required, functional or advertising.
tumblr.com
A popular blogging platform.
uservoice.com
A customer support service.
vimeo.com
A video site.
www.google.com
Google's main domain, which also hosts services such as search.
youtube.com
One of Google's video sites.

A.3.7 Disconnect

A special category for non-content resources from Facebook, Google and Twitter. It seems to initially have been designed to block their respective like/+1/tweet buttons which seem to belong in the social category. As the category now contains many other known tracking domains from the same organizations, unblocking the social buttons also lets many other types of resources trough.
It is worth noting that this category includes google-analytics.com plus Google ad networks such as adwords.google.com, doubleclick.net and admob.com. It might have been more appropriate to have them in the analytics and advertisement categories respectively.

A.3.8 Social

Sites with an emphasis on social aspects. They often have buttons to vote for, recommend or share with others.
addthis.com
A link sharing service aggregator.
digg.com
News aggregator.
linkedin.com
Professional social network.
reddit.com
Social new and link sharing, and discussion.

Retrieving websites and resources

Websites based on lists of domains were downloaded using har-heedless (B.2.2).

A.4.1 Computer machines

Two computers were used to download web pages during development – one laptop machine and one server machine – see Table A.1 for specifications. The server is significantly more powerful than the laptop, and they downloaded a different number of web pages at a time. The final datasets were downloaded on the server.

A.4.2 Network connection

The laptop machine was connected by ethernet to the .SE office network, which is shared with employees' computers. The server machine was connected to server co-location network, which is shared with other servers. The .SE network technicians said load was kept very low, and only a few percent of the dedicated 100 Mbps per location was used. Both locations are in Stockholm city, and should therefore be well placed in regard to websites hosted in Sweden.

A.4.3 Software considerations

To expedite an automated and repeatable process, a custom set of scripts were written as the project har-heedless. The scripts are written using standard tools, available as open source and on multiple platforms.

Cookies

Cookies stored by a website may affect content upon requesting subsequent resources, and is one of the primary means of keeping track of a browser. Each browser instance has been started without any cookies, and while cookie usage has not been turned off, none have been stored after finalizing the web page request. A cookie analysis is still possible by looking at HTTP headers, but they have not been considered as an indicator of tracking as other techniques can serve the same purpose [1, 10].

Dynamic web pages

Previous efforts to download and analyze web pages by .SE used a static approach, analyzing the HTML by means of simple searches for http:// and https:// strings in HTML and CSS. It had proven hard to maintain, and the software project was abandoned before the thesis was started, but had not yet been replaced. In order to better handle the dynamic nature of modern web pages, the headless browser phantomjs (B.1.2) was chosen, as it would also download and execute javascript – a major component in both user interfaces as well as active trackers and ads.

Cached content

Many of the external resources will be overlapping between websites and domains, and downloading them multiple times can be avoided by caching the file the first time in a run. Keeping cached content would, depending on per-response cache settings and timeout, result in a different HTTP request and potentially different response. A file that has not changed on the server would generate an HTTP response status of 304 with no data, saving bandwidth and lowering transfer delays, where a changed file would generate a status 200 response with the latest version.
One of the techniques in determining if a locally cached file is the correct/latest version includes the HTTP Etag header [12], which is a string representation of a URL/file at a certain version. When content is transferred it may have an Etag attached; if the file is cacheable, the Etag is saved. Subsequent requests for the same, cached URL contain the Etag – and the server uses it to determine if a compact 304 response is enough or a full 200 response is necessary. It has been found that the Etag header can be used for cookieless cross-site tracking by using an arbitrarily chosen per-browser value instead of a file-dependent value [34]. This means that keeping a local file cache might affect how trackers respond; a file cache has not been implemented in har-heedless, making the browser amnesiac.

Flash files

Flash is a scriptable proprietary cross-platform vector based web technology owned by Adobe. Several kinds of content, including video players, games and ads, use Flash because it has historically been better suited than javascript for in-browser moving graphics and video. Flash usage has not been considered for this thesis as the technology isn not available on all popular web browsing platforms, notably Apple's iPad, and is being phased out by HTML 5 features such as <canvas> and <video> elements.

Combined javascript

A common technique for speeding up websites is to reduce the number of resources the browser needs to download, by combining or concatenating them in different ways depending on the file format. Javascript is a good example where there are potential benefits
14
since functionality often is spread across several files, especially after the plugin style of frameworks such as jQuery
15
emerged. One concern is whether or not script concatenation on a web page would affect script analysis at a later stage, by reducing the number of third-party requests. While it is hard to analyze all scripts, based on their wide spread use, third-party scripts stay on their respective home servers as software as a service (SaaS) to enable faster update cycles and tracking of HTTP requests.

Google Tag Manager
16

One of the concerns was Google Tag Manager (GTM), which a script aggregation service with asynchronous loading directed specifically to marketers. Google provides builtin support for their AdWords, Analytics (GA) and DoubleClick (DC) services. While simplifying management with only one <script> tag, each part should download separately and perform the same duties, including “calling home” to the usual addresses. In order to confirm this, a query was run on one of the datasets, se.2014-07-10.random.100000-http-www – see Table A.2.
The numbers point to that every domain that uses Google Tag Manager uses at least one of Google Analytics and DoubleClick, and will therefore not obscure information regarding which services are called when further analyzed.

robots.txt and <meta name="robots" />

Automated web spider/bot software can bring a significant load to a web server, as the spidering speed can exceed user browsing speed by far – a single web spider can potentially request thousands of pages in the span of minutes, effectively being a denial of service attack on an underpowered server. Some sites also have information considered sensitive to being available for spiders, or even for certain (kinds) of spiders such as image spiders. The choice to not serve spiders can stem from technical reasons (bandwidth, server load), to privacy (do not allow information to be indexed in search engines) and business (do not allow data to be collected and aggregated). In order to instruct web spiders not to get certain kind of material, there are two basic mechanisms – the special robots.txt file and the HTML header tag<meta name="robots" />. Both can contain instructions for certain bots not to index certain paths or pages, or not to follow further links stemming from a page.
Commercial software from search engines, information collectors and other software vendors take these explicit wishes from webmasters into consideration, but har-heedless has not. While it is automated software, it is not spidering software requesting many pages, following links to explore the site – it only accesses the front page of a domain, and resources explicitly requested by that one page. Information is not retrieved for indexing as such, as only HTTP request metadata is recorded. While some information requested to be not indexed might end up in the screenshots, they are kept in a format hard to re-parse for machines as non-public thesis data used for verification.
Future versions of har-heedless might implement logic that checks robots.txt in a domain list preprocessing step, to determine wether or not the request should be made. The same list can be used to filter further resource requests made by phantomjs, perhaps considering also other domains' robot.txt files. The tag <meta name="robots" /> can also be respected, perhaps in terms of not saving screenshots for noindex values.

Parallelizing downloads

A lot of the time spent downloading a web page is spent waiting for network resources, especially during timeouts during retrieval. To speed up downloading large amounts of web pages, parallelizing was employed using simple scripting techniques, starting multiple processes at once in batches, waiting for each batch to finish before starting the next. This was deemed inefficient, as the download and rendering speed of the slowest web page would be a bottle neck. In a later script versions, GNU parallel was used, with a setting to not start more jobs if the current CPU/system load was too high.
During initial parallelizing tests, the laptop machine was shown to be able to handle 100 domain downloads in parallel, but this was later scaled down to ensure system overload would not affect results, at the cost of significantly longer download times.

Screen size

When phantomjs is running, it emulates having a browser window by keeping an internal graphics buffer. Even if web page is not rendered and shown on screen, it still has a screen size, which affects layout. With a bit of javascript or responsive CSS, the screen size can affect downloaded resources. Javascript can be used to delay download of images and other resources that are below the fold, meaning outside of the initial view the user has without scrolling in any direction, as a page speed improvement. Responsive CSS adapts the page style to the screen size in order to increase usability for mobiles and other handheld devices, and might optimize the quality of downloaded images to match the screen size.
By default, phantomjs uses the physical computer's primary screen size. In order to reduce differences between the laptop and server machines, a fixed emulated window size of 1024x768 pixels was chosen. The basis for this is that 1024x768 pixel resolution screens have been the recommended screen size to design for
17
for a long time
18
, and it still is a common screen size
19
.
Scripted browser scrolling (vertically) through the page has not been performed, thus javascript scrolling events triggering for example downloading of images below the fold are not guaranteed.

Screenshots

In order to visually confirm that web pages have been downloaded correctly, a PNG screenshot can optionally be taken when the page has finished downloading. There is a processing cost associated with taking screenshots; it takes time for phantomjs to render the internal buffer as an image, convert it to base64 encoding for the augmented HAR data, jq to extract the image data, the system to convert the data back to binary and finally write it to disk. Extra processing is also needed to remove the screenshot from the augmented HAR file.
The emulated browser window size also sets a limit on the screenshot size when rendered, but the entire browser canvas is captured. This means that screenshots that are saved to disk in most cases extend beyond the viewport size, most often vertically; this corresponds to scrolling through the entire page.
The ratio of PNG to HAR data size on disk points to the compressed PNG files being up to 10 times the size of the uncompressed HAR files for certain types of pages, for example media domains in the .SE Health Status report dataset. See Table A.3.

A.4.4 Dataset variations

Each dataset has been used in four variations, effectively quadrupling the number of websites access attempts. Each of these variations have been downloaded and analyzed separately, then used for both intra- and inter-dataset comparisons.
Each dataset in Table A.4 has the variation appended to the name in the detailed results, Chapter C.

A.4.5 System load during downloads

System load can affect the end results, if network timeouts occur during downloading and processing of domains' front pages. Apart from CPU and memory limitations, the other users of the .SE network should not be affected by tests.
System load on *NIX systems can be found using the uptime command.
20
Other processes were running at the time of these tests, so the numbers are not exclusive to the downloading. The complexity of the front pages of the currently processed domains affects the load, as well if screenshot generations is enabled. Final downloads were done with screenshots enabled. Table A.5 show loads for random samples in time.
System load has been shown to vary greatly based on the type of request, mainly differing by HTTP and HTTPS response rates. High failure rates mean a lot of time is spent waiting until a set time limit/timeout has been reached. Increasing the upper bound on parallelism and instead dynamically adjusting the number of concurrent requests emphasizing system load should decrease time needed to download a set of domains with a large failure rate (C.2). Time limits can also be adjusted based on previous dataset results' actual reply timings found in HAR data, instead of setting a high and “safe” upper bound timeout, both for page timeouts and individual resources.

A.4.6 Failed domains

Some websites are not downloaded successfully, for different reasons. The DNS settings might not be correct, the server may be shut down, there might have been a temporary network timeout, there might have been a software error – or the server has been programmed to not respond to automated requests from phantomjs (B.1.2) and similar tools. Unfortunately, outside of local software errors which may result in parseable error messages, sources of the errors are hard to detect without an extensive external analysis of DNS settings and network connectivity – and even so, an automated analysis may include false negatives because of remote system automated request countermeasures.
Each HTTP request has their HTTP status response recorded if it is available; absence or numbers outside the RFC7231 [13] range (100-599) indicates failure. Any error output the web page itself has produced, mostly because of javascript errors, have also been recorded in the HAR log or individual entry/request comment fields.
A distinction is made between failed and unsuccessful domains – unsuccessful domains rendered a complete response with an HTTP status that indicated that it was not successful. Domains that failed have been re-downloaded; it relieved some, but not all, failures.
The first round of retries rendered the greatest results, and subsequent retries are less successful. This seem to point to some intermittent failures being recoverable, and that some domains will not respond. Due to diminishing returns in the number of additional successful domains in each retry cycle, the number of retries was limited to two (B.2.1).

Analyzing resources

After downloading HAR files, they are processed using har-dulcify (B.2.3).

A.5.1 Screenshots

Screenshots were mainly used for verification during development, to see that the pages were loaded properly. While they have been retained, the manual inspection necessary makes it infeasible as a way to verify each and every domain's result.

A.5.2 Extracting HAR format parts

The HAR format specification includes fields that have to do with for example request/response timings and data sizes. Those, and other fields, are not analyzed in this thesis, so the first step is to extract the relevant information. These properties will be enough to see what kinds of resources are requested, if requests are successful and where the request is made to.
  1. URL The request's URL. Recorded whether the request is successful or not. Although almost all requests are made to http:// or https:// a negligible amount of other and non-standard (sometimes misspelled) protocols have been registered. Urls starting with data: have been ignored, as they are page-internal.
  2. Status The HTTP status code found in the server's response. Defined as a 3-digit integer result, 100-599, grouped into classes by the first digit.
  3. Mime-type The HTTP content-type header, which is the body internet media type (previously known as Multipurpose Internet Mail Extensions (MIME) type) combined with optional parameters, such as character encoding.
  4. Referer The URL of the page that requested the resource. Can be used to build a tree of requests, but is limited by the fact that it requires HTML <frame> or <iframe> to differ from the origin page. The word referrer was misspelled referer in the original proposal [4].
  5. Redirect The URL of the new location of the requested resource, if defined by a 3xx (A.5.3) HTTP status.

A.5.3 Expanding parts

URL, referer, redirect

The URL format has several components [5], with interesting ones for standard web requests listed and/or split up further to get a more fine-grained analysis.
  1. Scheme In this case, http and https protocols have been the most interesting to look at. Other interesting examples that can be found in the wild include ftp (for file downloads), data (for resources encoded into the URI) and about (mostly used for blank placeholder pages).
  2. Domain For this thesis, the domain part has been of much interest, as it signifies the difference between internal and external resources.
  3. Port While custom ports can be used, they usually implicitly default to 80 for HTTP and 443 for HTTPS.
  4. Path The path specifies a folder or file on the server.
  5. Querystring Most parameters sent back to servers are defined in an RFC compliant way, but there are other variants building on for example `/` as a pseudo-path separator.
  6. Fragment The fragment is in the web context a client-only component, and is not to be sent back to the server as part of a request. The usage affects browsers' presentation, historically only by scrolling to a matching named element, but modern usage includes keeping browser state using javascript, for example following the web spider crawlable hash-bang syntax.

Status

The status value groups are defined by their first digit [13]. Groups outside of the defined range 100-599 are defined as null.
  1. 1xx Informational
  2. 2xx Successful
  3. 3xx Redirection
  4. 4xx Client error
  5. 5xx Server error

Mime-type

The mime-type is grouped by their usage, which usually is the first group part. Table A.6 shows a selection of common types grouped together.

A.5.4 Classification

Public Suffix List

The public suffix list is prepared for lookups per domain component. Each request's domain (including shorter domain components) is checked against it, and any matching public suffixes are kept in an array. A list of private suffixes, as in domain components not in the public suffix list, is also kept. The primary domain (first non-public suffix match, or the shortest private suffix) is extracted.
In terms of grouping, it might be good to keep the primary (longest matching) public suffix separately. The public suffix list also contains special wildcard/exception rules and private suffixes (A.2). They have not been used in the thesis (7.5.5).

Basic

Simple properties in the request are checked, and their valued saved as a classification property. This property is used for grouping and for further analysis.
Successful request
The status of the request is 200-299 or 304.
Unsuccessful request
The status is 100-199 or 300-303 or 305-599.
Failed request
The log file format is incomplete or the status is null, below 100 or above 599.
Same domain
The request is to the same domain that was first visited.
Subdomain
The request is to a subdomain of the domain that was first visited.
Superdomain
The request is to a domain to which the origin domain is a subdomain. This basic superdomain classification is currently not checked against the public suffix list for invalid superdomains.
Same primary domain
The request shared the same primary domain with the origin.
Internal domain
The request is to the same domain, a subdomain or a superdomain of the domain that was first visited.
External domain
The request is not to an internal domain.
Secure request
The request is using HTTPS.
Insecure request
The request is not using HTTPS.

Disconnect.me

The URLs in the extended data contains lists of domain components. As the disconnect list of blocked domains is prepared for lookups by domains, each of the matching domains (including shorter domain components) are extracted with organization, organization URL and domain category.

A.5.5 Analysis

An analysis, where request classifications are counted, summed and the coverage calculated, is performed as an automated step.

Origin

The origin domains are grouped separately from the requests that stemmed from them.

Requested URL counts

All requests are represented with their domain, and other classifications.

Distinct requested URLs

Requested URL counts can skew aggregate results if a single domains makes an excessive amount of requests to a certain URL/domain/tracker or in a certain classification. Reducing counts to boolean values, indicating at least one request matched the classification, gives the possibility to calculate coverage per domain later on.

Request/domain counts

All the numbers from all domains added together.

Request/domain coverage

The summed up counts divided by either the total request count (for requested URLs) or the number of domains in the current group (for distinct requested URLs). This gives a coverage percentage – either for the percentage of the number of requests, or domains that has the value.

Grouping

In order to not make too broad assumptions, some grouping was performed. The analysis was performed the same way on each of these groups (B.2.3). The origin domain's download status was checked, and grouped into both unfiltered and non-failed groups. The list of requested URLs was grouped into unfiltered, internal and external URLs.

A.5.6 Questions

Where the aggregate analysis is not enough, there are custom questions. These questions/queries can be executed against any previous intermediate step in the process, as they are saved to disk.

Google Tag Manager

One of the questions posed beforehand was if Google Tag Manager would have an impact upon results; it has been answered with the help of this data (A.4.3).

Origins with redirects

Looking at preliminary results, a large portion of domains yielded a redirect as the initial response. In order to look at these redirects specifically, and determine if they redirect to an internal or external domain, a specific question was written.

A.5.7 Multiset queries

After downloading several datasets, it is often interesting to compare them side by side. The multiset queries extract pieces of data from several datasets, and combine them into a single file.

Appendix B
Software

Development was performed in the Mac OS X operating system while the server machine performing most of the downloading and analysis was running the Linux distribution Debian (A.4.1). The software is thought to be runnable on other Unix-like platforms with relative ease.

Third-party tools

In order to download and analyze tens of thousands of webpages in an automated fashion, a set of suitable tools were sought. Tools released as free and open source software have been preferred, found and selected; code written specifically for the thesis has also been released as such.

B.1.1 HTTP Archive (HAR) format
1

In an effort to record and analyze network traffic as seen by individual browsers, the data/file format HTTP Archive (HAR) was developed. Browsers such as Google Chrome implement it as a complement to the network graph shown in the Developer Console, from where a HAR file can be exported. While constructed to analyze for example web performance, it also contains data suitable for this thesis: requested URLs and HTTP request/response headers such as referrer and content type. HAR files are based upon the JSON
2
standard [6], which is a Javascript object compatible data format commonly used to communicate dynamic data between client side scripts in browsers and web servers. The most recent specification at the time of writing was HAR 1.2.

B.1.2 phantomjs
3

Accessing webpages is normally done by users in a graphical browser; the browser downloads then displays images, executes scripts, and plays videos. A browser is user friendly but not optimal for batch usage due to the overhead in constantly drawing results on screen and the lack of automation without external tools such as Selenium Webdriver
4
. While Webdriver can be used to control several kinds of browsers, such as Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, they are not suitable for usage on a rack server that was not set up with “normal” browser usage in mind – that is with desktop software functionalities. A good alternative for such servers is phantomjs, which is built as a command line tool without any graphical user interface. It acts like a browser internally, including rendering the webpage to a image buffer that is not displayed, and is controllable through the use of scripts. One such example script included in the default installation generates HAR files from a webpage visit. phantomjs has been implemented on top of the web page renderer Webkit library
5
, also used in Apple Safari, Opera and previously Google Chrome.
There are alternatives to phantomjs, but they have not been tested within the scope of the thesis. Future versions could try alternative automated browsers, such as SlimerJS
6
or both with CasperJS
7
, to verify phantomjs' results.

B.1.3 jq
8

While there are command line tools to transform data in for example plain text, CSV and XML files, tools to work with JSON files are not as prevalent. One such tool gaining momentum is jq, which is implemented with a domain specific language (DSL) suitable for extracting or transforming data. The DSL is based around a set of filters, similar to pipes in the Unix world, transforming the input and passing it on to the next stage. jq performs well with large datasets, as it treats data as a stream where each top-level object is treated separately.
At the time of writing, jq is released as version 1.4. Support for regular expressions is planned for version 1.5, which has been in the making for the duration of the thesis. As the thesis code is run on multiple machines/systems and expected to deliver the same results, using standardized packages has been a part of ensuring that.

B.1.4 GNU parallel
9

To parallelize task execution, GNU parallel [43] has been used. It allows an input file to be distributed among several processes/CPU cores, and the results to be combined into a single file. It helps speed up downloading websites to create HAR files, and processing of the JSON data through jq, which is single threaded.

Code

In order to efficiently and repeatably download and analyze web pages, special tools have been written. Most of the code is written in bash
10
scripts utilizing external commands when possible, such as jq. The code for the jq commands has been embedded in the bash scripts.
The source code for the respective projects have been released to the public under GNU General Public License version 3.0 (GPL-3.0)
11
, so other projects can make use of them as well.

B.2.1 har-portent
12

A set of scripts that both downloads, retries failed downloads and analyzes websites in a single run – see har-heedless (B.2.2) and har-dulcify (B.2.3).

domains/download-and-analyze-https-www-combos.sh <parallelism> <domainlists>

Uses domains/download-and-analyze.sh to download four variations (A.4.4) of the same domains, so any differences between secure/insecure and www-prefixed domains can be observed. This is the true fire-and-forget script you want to run to download and analyze multiple large sets.

domains/download-and-analyze.sh <prefix> <parallelism> <domainlists>

Downloads a list of domains in parallel, with automatic per-prefix/input file folder and log file and creation. It also performs automatic retries for failed domains two times per dataset, with an increased parallelism as retries are mostly expected to yield another network timeout or error.

B.2.2 har-heedless
13

Scriptable batch downloading of webpages to generate HTTP Archive (HAR) files, using phantomjs. With a simple text file with one domain name per line as input, har-heedless downloads all of their front pages. Downloads can be made either in serial or in parallel. The resulting HAR data is written in a folder structure per domain, with a timestamp in the file name. The script that extracts HAR data, netsniff.js, is based on example code shipped with phantomjs, but modified to be more stable.

get/netsniff.js

A modified version of netsniff.js
14
from the phantomjs project.
Some fixes include:
These patches have not yet been submitted to the phantomjs project. They should be split up into separate parts for the convenience of the project maintainers. A complete refactoring is also an alternative, but it might be less likely to be accepted.

get/har.sh <url>

Contains logic to run netsniff.js through phantomjs. If phantomjs crashed or otherwise encountered an error, a fallback HAR file is generated with a dummy response explaining that an error occurred.

url/single.sh <domain> [–screenshot <true|false>]

Downloads a URL of a single domain, and takes care of writing the HAR output to the correct folder and file. If a screenshot has been requested, it is extracted (and removed) from the extended HAR data and written to a separate file parallel to the resulting HAR file.

url/parallel.sh [parallel-processes [–screenshot <true|false>]]

Uses GNU parallel to download multiple webpages at a time. The number of separate processes running is adjusted per machine, depending on capacity, with parallel-processes.

domain/parallel.sh <prefix> [parallel-processes [–screenshot <true|false>]]

Download the front pages of a list of domains, in parallel, using a specific prefix, such as https://www.. See url/parallel.sh.

B.2.3 har-dulcify
15

Extracts data from HTTP Archive (HAR) files for an aggregate analysis. HAR files by themselves contain too much data, so the relevant parts need to be extracted. The extracted parts are then broken down into smaller parts that are easier to group and analyze, and added to the data alongside the original. With the expanded data in place, requests are classified by basic measures and matched against external datasets. Scripts are written to perform only a limited task and instead be chained together by piping the data between them. As the scripts generally connect in only one way, the convenience scripts in the one-shot/ folder are used the most. These convenience scripts also leave the files from partial executions, so they can be used for other kinds of analysis.
At the time of writing, there are 67 scripts in har-dulcify. Here is a selection with explanations.

one-shot/all.sh [har-folder-path]

Runs preparations.sh, data.sh, aggregate.sh and questions.sh based on data in the har-folder-path (defaulting to the current folder) outputting the results to the current folder.

one-shot/preparations.sh

Downloads, prepares and analyses third-party datasets, and puts them in the current folder for use by subsequent scripts.

one-shot/data.sh [har-folder-path]

Processes all HAR files in the har-folder-path (defaulting to the current folder), and puts the output in the same folder.

one-shot/aggregate.sh

Prepares data for aggregation by counting occurrences in each domain's data, then adds them together to a single file containing an aggregate analysis.

domains/latest/all.sh

Finds and lists the paths to the most recent HAR files, per domain.

extract/request/parts.sh

Extracts url, status, content-type (mime-type) and other individual pieces of data from the originating domain front page request and the subsequent requests.

extract/request/expand-parts.sh

Keeps the original data, but also expands the url and mime-type into their respective parts, and adds simple grouping to status and mime-type.

classification/basic.sh

Add basic classifications, such as if a request is internal or external to the originating domain, and if the request is secured with HTTPS.

classification/disconnect/prepare-service-list.sh

Prepares Disconnect's blocking list from the original format where blocked domains are stored deep in the structure, to one where domains are top level map keys, prepared for fast lookups.

classification/disconnect/add.sh <prepared-disconnect-dataset-path>

Matches each requested domain against Disconnect's list of domains to block, and adds the results to the output. Disconnect's original service.json
16
(or disconnect-plaintext.json
17
) needs to be prepared through classification/disconnect/prepare-service-list.sh before being used.

classification/disconnect/analysis.sh

Analyses Disconnect's blocking list and collects some aggregate numbers.

classification/effective-tld/add.sh <prepared-disconnect-dataset-path>

Matches each requested domain against Mozilla's Public Suffix list of effective top level domain names, and adds the results to the output. The original effective_tld_names.dat
18
needs to be prepared through classification/effective-tld/prepare-list.sh before being used.

aggregate/prepare*.sh

Running the aggregation before analysis is currently not possible in a single step, as jq requires all data for the reduce step to be in memory. The solution is to first map the data to a suitable format, and then reduce them in chunks repeatedly.

aggregate/analysis.sh

Takes counts and lists of values, and reduces them to easy to present values, percentages and top lists. Results are also grouped in order to draw separate conclusions regarding non-failed domains and internal/external resources. Below, a tree representation of the output after grouping is shown. The origin represents the original request of the subsequently requested URLs.

questions/google-gtm-ga-dc.sh

Analyze the impact of Google Tag Manager on coverage for other Google services, specifically Google Analytics and DoubleClick.

questions/origin-redirects.sh

Analyze requests to see if there are redirects from the origin page initially requested. One of the most interesting things to look at is wether or not domains redirect to or from secure https domains.

questions/ratio-buckets.sh

Collects occurrences of arbitrary things and ratios of other things per domain, puts them into 100+ counter buckets and calculates normalized and cumulative versions. Used to get the number of Disconnect organizations and ratios of internal and secure resources.

multiset/*.sh

A set of scripts that perform tasks on multiple datasets at once – an aggregate of aggregates usually. The scripts were developed late in the process, to extract small pieces of data per dataset for the report. The small pieces of data are combined to files with tab-separated values, which are the source of most data tables and figures in the report.

Appendix C
Detailed results

Differences between datasets

Domain lists chosen for this thesis come in three major categories – top lists, curated lists and random selection from zone files. While the top lists and curated lists are assumed to primarily contain sites with staff or enthusiasts to take care of them and make sure they are available and functioning, the domain lists randomly extracted from TLD zones might not. Results below seem to fall into groups of non-random and randomly selected domains – and result discussions often group them as such.

Failed versus non-failed

HAR data that does not have a parseable HTTP status outcome number (shown as (null) in C.3) is considered a failed request. In order to reduce temporary or intermittent problems, all domains that failed were retried up to two times (B.2.1).
Figure C.1 visualizes the percentage of requests in each HTTP status category and null for no response on the x axis. Non-random domains have a failure rate of below 15% for HTTP, and 70-90% for HTTPS, meaning that 30-10% implement HTTPS. Random zone domains have a failure rate of above 20% for HTTP and above 99% for HTTPS.
The very low HTTPS adoption rates among random sites is both surprising and not surprising – while larger sites might have felt the pressure to implement them, a non-professional site owner might see it as both an unnecessary technical challenge and an unnecessary additional cost. Most X.509 public key infrastructure (PKI) [8] certificates cost money to buy and install. With public IPv4 addresses running out and legacy browsers requiring one IP-address per HTTPS certificate, it can also lead to an additional fee for renting an exclusive IP-address. The random zone domains responding to HTTPS requests exhibit behavior more similar to top and curated sites – see how dataset variation lines follow each other in Figure C.3 and Figure C.8 – suggesting that a similar kind of effort has been put in developing them.
During analysis har-dulcify splits results into unfiltered and filtered, non-failed origin domains. Unless otherwise mentioned, further results are presented based only on non-failed domains in each dataset, as failed origin requests add nothing to the further resource analysis.
There might be an overlap between failed HTTP/HTTPS and non-www/www dataset variations, but at the moment they are treated separately. If domains that failed to respond to HTTP requests are removed from the HTTPS set in a filter step, HTTPS failure statistics might be lower – and the same goes for www variations.
The difference between the number of domains in the dataset and the number of analyzed domains – for example domains in “se.2014-07-10.random.100000-http-www” but HAR files – is due to software crashes. A higher domain response rate means a higher risk of a crash; larger datasets have crashes in 0.25-2.5% of domains. Most crashes were observed to be caused by malformed image data triggering an uncaught software exception in the JPEG image decoder used by phantomjs. Looking at the resulting HAR files means that both local software errors and remote errors are considered failures. The analysis can be improved with additional testing and improvements to phantomjs; missing HAR files have been ignored in further analyses.

HTTP status codes

Choosing the term non-failed instead of successful when it comes to dividing and focusing result discussions has its basis in the HTTP standard, which defines a status code. Successful requests are generally shown with an HTTP status code of 200 (actually the entire 2xx group), or a 304 which means that a previously cached (presumably successful) result is still valid. Many sites respond with a 3xx status, which is not exactly successful as it does not contain actual content, but can not be considered a failure as it will most likely lead to another resource that is successful. While a status response of 4xx or 5xx shows there is a problem of some kind, for the purpose of this thesis a response that contains any HTTP status number is still considered a non-failure, as the remote system has responded with a proper HTTP response parseable by phantomjs and har-heedless.
The table below details the percentage of domains in each response code group, especially for 3xx (redirect) responses, as well as null for no response. Figure C.1 also visualize the results as percentages (x axis) per dataset.
The majority of domains domains are responding to HTTP requests during domain scanning, but many do not either by configuration or chance. While most non-failed domains produce 2xx responses, there is a high ratio of 3xx redirect class responses. With some 20% of random domains and around 50% of top sites redirecting their visitors, it begs further research. Many of them are 301, considered permanent, redirects; they indicate that the information supposedly found at the domain is actually found somewhere else. The difference between http and http-www datasets seems to suggest that redirects lead to the other (above 90%), or a secure variant; this thesis has looked at redirects within the same domain and to HTTPS variations (C.8).
A further analysis might disregard 4xx and 5xx responses as well, but current numbers suggest the difference would not be significant. While this can be because of software problems, they do exist and therefore the software seems to work as intended.

Internal versus external resources

The table shows non-failed origin domains having requests strictly to the same domain, subdomains, superdomains or same primary domain – jointly known as internal domains – or external (non-internal) domains. Origin domains that are not exclusively loading from either internal or external resources are loading from mixed domains. See also Figure C.2 visualizing the proportions (x axis) per dataset.
More detailed examples of distributions of resources are shown in Figure C.3, with ratio of internal resources per domain (as opposed to external resources) on the x axis, and what cumulative ratio of domains exhibit this property on the y axis. The leftmost marker in each dataset shows the ratio of domains (y axis) with 0% internal (strictly external) resources and the right hand marker 99% internal resources. To the right of the 99% marker are domains using strictly internal resources; the vertical difference (y axis) between the two markers shows the ratio of mixed resource usage.
Mixing resources from both internal and external domains is the most common way to compose a web page for datasets not randomly chosen from zones, although it is quite common for random domains as well. Random domains show relatively high tendencies to either extreme, with only either internal or external resources; in addition the usage of external resources is lower in HTTPS variations. As can be noted in both Figure C.2 and C.3, the top domains are very similar in terms of resource distribution between HTTP and HTTPS datasets. This means that active tracking on top sites is as prevalent when surfing a website over a secure, encrypted connection as on an insecure. Random .dk sites has the highest ratio of internal resources, but even so more than two thirds of domains load external resources. Strictly external plus mixed resource usage is above 80% in most and 90% in many datasets. The same is true for for HTTPS, confirming the thesis' hypotheses that tracking is installed through the use of external resources, deliberately or not, on HTTPS sites as well (2.1).
During analysis har-dulcify splits results into unfiltered, internal and external resources, each treated the same way in terms of classifications, allowing separate conclusions to be drawn. Further analysis showing internal/external ratios are generally calculated per domain which has at least one internal/external request. This makes a rather large difference for random zone domains, which have a rather high ratio of domains with no internal or no external resources (C.6).
Looking at internal versus external domains, with regards to the origin domain, is easy to do as it is only a matter of string comparison. The next step would be to use organization grouping, as seen in the Disconnect classifications (C.11). Private CDN domain detection requires more extensive reverse requests' domain usage mapping work plus manual classification work. While the work put in might only be effective for top organizations with many services, it adds to seeing legal entities as information receivers rather than the merely technical domain partitioning. It also adds more questions – could for example a private CDN domain hosted in another organization's datacenter be seen as both internal and external?

Domain and request counts

The table shows counts of domains and domains which have at least one internal/external request, and to how many external domains, primary domains or Disconnect domains the requests was sent. Following that, counts of all requests, internal requests and external requests. External requests matching Disconnect's blocking lists are also shown. This table is mostly interesting to show the scale of the data collection, in terms of number of requests made and analyzed.
If we include all dataset variations domains responded to the request; out of those made at least one internal request and external dittos. A total of requests have been analyzed; were internal, external and matched disconnect's blocking list.
See also (C.6) and (C.12).

Requests per domain and ratios

Shown in the table below are ratios based on the request counts in Section C.5. First the ratio of domains with at least one internal/external request. Requests are then shown as average count per domain, internal per domain with internal requests and external per domain with external requests. External Disconnect requests are shown per domain with external requests. After that comes ratios of requests; internal/external/Disconnect out of all requests, external over internal and Disconnect request ratio out of the external requests.
It is interesting to look at the number of requests, as they differ between datasets. Random TLD domains have a low average of 26-29 requests per domain, random Alexa top sites an average 71-76 – but the very top of the Alexa top list comes in at 83-105. The smaller datasets Reach50 reaches 87, and the even smaller dataset .SE Health Status' media sets the record at 197 requests per domain!

Insecure versus secure resources

Using HTTPS to secure the connection between site and site users is considered an effective way to avoid prying eyes on the otherwise technically quite open and insecure internet. Sites which handle sensitive information, such as e-commerce shops, online payment providers and of course banks often tout being secure to use – and they have strong financial incentives to provide a service that is (or at least comes across as) trustworthy. As browsers will warn users if a site secured with HTTPS loads resources over non-HTTPS connections, site developers will have to make sure each and every request is secure to avoid being labeled not trustworthy. This also applies to third-party services, which have to make sure to provide HTTPS in order to be able to continue providing services to sites making the switch to a fully secured experience.
One of the concerns with mixing in HTTP on an HTTPS site is that an attacker can use traffic sniffers to get a hold of sensitive information leaking out through HTTP, or man in the middle attacks on several kinds of resources to insert malicious code, even though the site is supposed to be protected.
The following table shows to what extent sites manage to take full advantage of HTTPS, and to which extent they fail in requesting either internal or external resources. Note that HTTP domains that redirect to HTTPS (C.8) right away and/or only load HTTPS resources are shown as fully secure, as the analysis excludes the origin request – although in general an initial request to HTTP can potentially nullify all subsequent security measures.
Figure C.4 shows an overview of ratio of domains with strictly secure requests, mixed security or strictly insecure requests (x axis) per dataset.
While the technology has been around a long time, it does not seem as if very many sites actually use HTTPS. Even origin sites that respond to HTTPS requests seem to either redirect to an HTTP site (C.8), or load at least some of its resources over non-HTTPS connections. Typing in an HTTPS address into the browser's address bar will actually only give full HTTPS security on 27-58% of the domains – a number where the random domains surprisingly beat the non-random ones.
The cumulative distribution of domains (y axis) with a certain ratio of secure resources (out of all requested resources) per domain (x axis) is shown on in Figure C.5. The first marker in each dataset shows the ratio of domains with no secure resources at all. The second marker shows 99% secure resources, which marks the start of fully secure domains. The vertical differences between the two markers for each dataset shows the range of sites with mixed security.
We can see that HTTPS datasets have much better security than their HTTP counterparts. Looking at se.2014-07-10.random.100000-http-www which has over 60% completely insecure domains, 30% mixed security and less than 10% domains with only secure resources. Comparing it with se.2014-07-10.random.100000-https-www, we see that it has less than 10% completely insecure domains, a bit more than 30% mixed security and over 55% domains with only secure resources. The ratios of completely insecure and completely secure domains have almost been reversed. We can also see that many domains in the HTTPS datasets have between 90% and 99% secure resources – around 25% of municipalities for example – which seems like a relatively small gap to close to get a completely secure site.
Why is adoption lower for top sites? As high-traffic sites they might have a high system load, and since HTTPS require some extra processing and data exchange, they might have deferred it until the security is really needed – such as when passwords of financial information is entered. Strict HTTPS performance concerns were dismissed by Google engineers in 2010
1
– and Google has since implemented HTTPS as an alternative for most and the default for some services
2
. HTTPS is also a positive “signal” in Google's PageRank algorithm, meaning the use of HTTPS will lead to a better position in Google's search results
3
. There are other effects on network services though, such as reduced ability for ISPs to cache results closer to network edges or companies to easily inspect filter traffic [37].
Another concern is that curated domain lists seem to exhibit an even lower HTTPS adoption than both random and top domains – the domains have been selected as they are deemed important to the public in some way.

HTTP, HTTPS and redirects

The table shows domains, domains with redirect responses to the origin request, the ratio of domains with redirects and the average length of the redirect chain per domain with redirects. Shown next is the ratio of redirected domains making strictly internal, domains mixing internal and external, and strictly external redirect URLs. The same goes for insecure, mixed security and strictly secure redirects – plus a column with the ratio of domains where the final redirect is to a secure URL. The last column shows the ratio of domains with mismatched redirect URLs without a subsequently requested URL.
Figure C.6 shows the distribution of strictly secure, mixed security and strictly insecure redirects (x axis) per dataset. The percentage of mismatched URLs is also shown. An additional mark shows the percentage of final secure redirects with an x, as not all mixed redirects lead to secure sites – sometimes they lead back to an insecure site.
Most domains that redirect make a single redirect, but every few sites make more than one; the average is around 1.23 redirects. With 49-71% of top sites in the HTTP variations redirecting, but only 30-36% of HTTP-www variations redirecting, it seems that the www subdomain still is in use, rather than not using any subdomain. For random domains numbers are a bit more even, with 24-31% of both HTTP and HTTP-www variations redirecting. More detailed data than in the below table clearly shows that domains generally pick either www or no subdomain, and redirect from one to the other; the www subdomain is most common as the final destination, especially for HTTPS variations.
A difference between top and random domains, is that top domains keep their redirects mostly internal, while random domains redirect elsewhere. Curated sites are in between, with a portion of HTTP-www variations redirecting externally. This seems to suggest that top sites has contents, while random domains to a larger extent do not – 39-60% of redirected domains end up being aliases for another domain than a user would have typed in.
When it comes to security, it is no surprise to see HTTP variations redirect mostly to other insecure URLs. The extent of domains implementing HTTPS but then redirecting to HTTP is more surprising – only in 23-35% of top sites will let you stay on a fully secure connection – and that number excludes mixed resource security in later stages of the browsing experience (C.7). An additional few percent of HTTPS domains even mix security usage during redirects and take a detour over an insecure URL in the process of redirecting the user to the final, secure destination – meaning that even if you type in a secure address and end up on a secure address, you may have passed through something completely insecure along the way. While this might be considered a corner case, it nullifies some of the security measures put in place by HTTPS, and could leak for example a carelessly set session identifier.
Here it is a surprise to see that financial institutions in the .SE Health Status domain lists do not take advantage of HTTPS, by redirecting users entering through HTTP – at less than 20% fully secured redirects and a surprising ratio of mixed redirects, they are at about the same level as general Swedish top sites. Even more surprising is that they even elect to redirect users away from HTTPS enabled pages to insecure variants in 60% of HTTPS-www domains. Counties, higher education and media are worse yet – all make no effort in redirecting users from HTTP to HTTPS.
Analyzing domains redirecting to secure URLs would be a good candidate for a refined selection (7.5.6). Do the redirects help, or is the net result that insecure resources are loaded anyways?
The mismatched redirect and request URLs are in part due to the HAR standard not defining recorded redirect URLs as strictly absolute, and phantomjs returning unparsed/unresolved URLs when a redirect is initially detected in an HTTP response (5.8.1). Resolving redirect URLs outside of the browser means not all contexts and rules are considered, thus leading to errors. Both the thesis code, phantomjs software and HAR standard can be improved upon.

Content type group coverage

Bytes sent from the server to the browser generally has an associated type, so the browser can parse and use them properly. The types can be grouped (A.5.3); incorrect or unknown types that did not match a group is shown as (null) below. The difference between what can be achieved between different types of resources makes the distribution interesting. Images
4
and text
5
loaded by a browser generally provide no additional way to load further resources, while html, scripts and styles do. While data resources can trigger downloading additional resources based on the logic that consumes the data, it still requires another type of resource present to do that.
Objects and external documents can also access additional resources, but the use of those types of resources has been very low in the extracted data. There might be several reasons, but the fact that the tests were run on a headless browser without additional plugins installed is probably the biggest in this case. An additional reason might be adoption of HTML5 and client side javascript instead of Flash for visual, dynamic material and animations. This evolution has been fueled by Apple's resistance towards supporting Flash on their handheld devices
6
.
Note that web fonts have fairly low numbers here, but that they can also be served as styles which dynamically load additional fonts URLs. This is how Google Fonts do it, using the fonts.googleapis.com domain (5.4) to serve styles and gstatic.com to serve fonts dynamically selected to match to browser web font compatibility level, which could then be factored into these numbers (C.11.2).

C.9.1 Origin

Practically all successful origin requests result in a html response. The range is 84-100% html, with the difference being seemingly misconfigured responses, part of which are redirects without actual content (C.8).

C.9.2 Internal

The table below shows internal request ratios for domains with at least one internal request, excluding the origin request.

C.9.3 External

The table below shows external resources from each group enjoy almost the same coverage as their internal counterparts. Among non-zone datasets scripts often reach above 90% coverage, showing that active and popular web pages contain a lot of external dynamic material. Images, while not dynamic, as well as styles and html are also popular to load externally.
As some external file types can trigger further HTTP requests, it might be possible to build a hierarchy of requests. The easiest way is to look at resources loaded in HTML <iframe> (or the now less popular <frame>) as they will have an HTTP referer header set to the URL of the frame. Scripts and styles requesting other resources directly, without the use of frames, cannot be detected as a strict hierarchy. With the large number of requests made from different sites, URLs can be cleaned up (for example removing unique identifiers or looking at domain parts only) and connected in a graph and analyzed for similarities.

Public suffix coverage

Resources served from external URLs may well come from other public suffixes; here they have been grouped by TLD. The connections between datasets and TLDs is interesting; .se datasets load more from .se domains than others, and the equivalent is valid for .dk datasets. We can also see that despite Alexa's top being an international list, nearly 19% of them use resources are loaded from .se domains. This points towards those sites being aware of the country of origin for the request, leading to localized content being served. It is also evident that the .com TLD is the most widespread for external resources – it beats same-TLD coverage.
Future downloads could be performed from other countries to measure location-dependent service coverage.

Disconnect's blocking list matches

The table below shows coverage for requested URLs' domains matching Disconnect's blocking list (B.2.3), for internal and external resources separately as well as all resources together. Coverage per domain for top domains is shown later in this section C.11.2. As the blocking list contains details about which category and organization each domain belongs to (A.3), they have been used grouped to display aggregates per category (C.11.3) and organization (C.11.4) as well.
Figure C.7 shows the CDF of the percentage of domains (y axis) which have a certain ratio of all requests matching Disconnect's blocking list (x axis). The leftmost marker for each dataset shows 0% Disconnect matches, and the rightmost marker shows 99% matches.
About 20% of Alexa's top sites make more than 50% of their requests to known tracker domains; similarly 50% of domains from the same datasets load more than 20% of their resources form known tracker domains. Other datasets rely less on this kind of external resource, at around or less than 10% using 50% or more tracker resources – but again the number of requests is less interesting than the number of organizations potentially collecting the leaked user traffic data.

C.11.1 Domain and organization counts

The below table shows the number of domains in the dataset, the number of requests classified as trackers in the Disconnect dataset (see also C.5 for other types of requests) followed by aggregate counts for domains, organizations and categories for these requests. Next we see Disconnect requests per domain in the dataset, and the same number per Disconnect organization. The last section contains the ratios out of the total domains and organizations (see A.3).
Figure C.8 shows the cumulative distribution of domains (y axis) with requests to between 0 and 99 organizations (x axis). Domains with 100 or more organizations are shown in the rightmost segment of the graph.
The .SE Health Status' media category has over 40 tracker requests on average – the highest number of tracker requests per domain, much higher than for example public authorities and random zone domains at 5-7, with top domains having 17-32 requests (C.5). What is more interesting than request counts is the number of tracker organizations per domain – while more information may be leaked as the number of requests to an organization increase, the same amount could potentially leak through with just one or two carefully composed requests to each organization.
While it is hard to compare Disconnect's organizations' coverage across datasets as the number of domains increases the chance of additional organizations being represented, it seems global top sites use a broad range of trackers – more than of the organizations. Looking at Figure C.8 we see that top sites have requests to more organizations than random domains do. Over 40% of random .se HTTP-www domains have no known trackers, about as many have one tracker and the top 1% have six or more. This is compared to 32% of Alexa's top site sharing information with more than five organizations, 10% have 13 or more, and 1% share information with at least 48 organizations. There are even a couple of domains among the Alexa sites which have more than 75 recognized tracker organizations just on the front page – a clear example of how it is impossible to tell where your browsing habits can end up, and even more so how it is used in a second stage. It is also clear that the non-zone domains have as much tracking on when using a secure connection as on an insecure connection, while the difference for zone domains can be explained by the very low HTTPS usage (C.2).

C.11.2 Top domains

A selection of domains, and their coverage across different datasets. Worth noting is that many recognizable domains belong to organization which have multiple domains (A.3.3), so their total organization coverage is higher. This first table excludes top Google domains, which dominate the top list for all datasets.

General

Facebook and Twitter have their like and tweet buttons which often are talked about in terms of social sharing, but it seems that their coverage is comparatively low except for the .SE Health Status' media category. AddThis' service includes some of Facebook's and Twitter's buttons' functionality, among other social sharing sites but they do not have quite the same coverage. Unfortunately, Facebook and Twitter are both in Disconnect's special Disconnect category (A.3.7), and are not reflected in the social category aggregates (C.11.3).
The domain cloudfront.net belonging to Amazon, is hosting services, file and data on behalf of other organizations on subdomains (A.2). While it is clear that Amazon can analyze and use traffic information from hosted services, the data itself can be assumed to belong to the hosted organizations. This is a strong reason why the domain is listed as content in Disconnect's blocking list – because of the variety of services being hosted on subdomains, it cannot be blocked as for example advertisement even if it is being hosted there. It can be seen as a flaw in the Disconnect way of blocking, even though listing individual subdomains in other categories might be a way to override the content bypass – but as these non-branded domains can be seen as throw-away domains, it can become a game of cat and mouse (5.2).
One thing to look closer at is cloudfront.net's subdomains, to analyze their content and which sites use them. Are they used as CDNs local to a single domain, or do third-party services host content there? If so, what kind of content? Are there similar patterns for Amazon AWS and other cloud services? See also public suffixes owned by private companies (A.2).

Google

Google has many domains in Disconnect's blocking list (A.3.2), and many in the top results. One of the reasons is that they have several popular services with content (5.4), but the top Disconnect-recognized domain in most datasets is google-analytics.com, with googleapis.com and www.google.com as the top result in the others.
Here we see DoubleClick's, one of Google's ad services, coverage. Unfortunately it has not been included in Disconnect's advertisement category, and it skews the numbers for the advertisement category (C.11.3), nor has Google Analytics been put in the analytics category. Google Analytics is served from its own domain, but Google is making a push to move site owners to the DoubleClick domain, where they have replicated the Google Analytics engine
7
. The reason given to site owners using Google Analytics is that the DoubleClick tracker offers implementors additional rich audience information such as age, gender and interests on top of their current technical analysis from Google Analytics. The underlying reason is possibly that Google wants DoubleClick, which brings a lot of income, to have greater coverage across websites – greater coverage meaning more and higher quality visitor information. While Google owns both services, site owner/visitor/usage policies
8
might prevent Google from crossmatching information between them without site owner consent. Cookies set to the doubleclick.net domain should be more valuable to Google than those set to google-analytics.com, as they translate to more directed ads [16].

C.11.3 Tracker categories

Disconnect's categories and their coverage across different datasets are shown in the table below, as well as the coverage of domains where any (at least one) external resource matches Disconnect's blocking list. As mentioned earlier, the special Disconnect category contains major Facebook, Google and Twitter domains – domains which could also have been listed as advertising, analytics or social domains (A.3.7). The content category, which bypasses Disconnect's blocking by default, can for this reason be seen as the most accurate in terms of coverage, as domains have presumably been added as content in a manual process of whitelisting (A.3.6).
Figure C.9 shows each category's coverage (x axis) per dataset. The grey bar in the background shows ratio of domains with any (at least one) external request matching Disconnect's blocking list for each dataset; it effectively shows the union of the coverage of all categories per domain. In some cases a single category has the same coverage as the union.
The highest coverage being connected with the Disconnect category explains the low coverage of advertising, analytics and social. If, for example, the two domains facebook.com and twitter.com would be included in the social category, coverage would be 35-56% percentage points higher for top domains and 9-11% percentage points higher for random domains (C.11.2) – and more accurate. The same goes for advertising and doubleclick.net (30-50%, 7-36%) and analytics and google-analytics.com (63-76%, 24-32%) (C.11.2).
What is surprising is the high coverage of content from known trackers. While the Disconnect category has the highest coverage overall, the content category is the second largest – significantly larger than the advertising, analytics and social categories in most datasets. While a large portion of this is due to extensive usage of Google's hosted services (C.11.2), all organizations with only content domains as well as those with “mixed” domains (Table 3.3) are let through to 67-78% of top domains and 38-56% or random domains. Mixing advertisement, or in this case tracking in general, with content has previously been discussed as a way for organizations to avoid in-browser blocking (5.2) – and it seems prevalent.
The current analysis performed for this thesis is built in such a way that the Disconnect blocking list used for matching can easily be replaced with an updated version. This also opens up the possibility of using a locally modified blocking list, re-categorizing each of the Disconnect category's domains as either advertising, analytics or social. The per-organization aggregate analysis would still produce the same numbers (C.11.4).

C.11.4 Top organizations

A selection of organizations, and their coverage across different datasets. Facebook and Twitter are often touted as the big social network sites, but in terms of domain coverage they are far behind Google.
Figure C.10 shows the coverage of these three organizations (x axis) per dataset. As these organizations, and most of their domains, are included in the Disconnect category of Disconnect's blocking list (A.3.7), an x has been added to show their collective Disconnect category coverage (C.11.3).
Google is very popular with all Alexa and most Swedish curated datasets have a coverage above 80% – and many closer to 90%. Random domains have a lower reliance on Google at 47-62% – still about half of all domains. Apart from the .SE Health Status list of Swedish media domains, Facebook doesn't reach 40% in top or curated domains. Facebook coverage on random zone domains is 6-10%, which is also much lower than Google's numbers. Twitter generally has even lower coverage, at about half of that of Facebook on average. As can be seen, Google alone oftentimes has a coverage higher than the domains in the Disconnect category – it shows that Google's content domains are in use (A.3.3). In fact, at around 90% of the total tracker coverage, Google's coverage approaches that of the union of all other known trackers.

Undetected external domains

While all external resources are considered trackers, parts of this thesis has concentrated on using Disconnect.me's blocking list for tracker verification. But how effective is that list of known and recognized tracker domains across the datasets? Below is a comparison with the unique external primary domain count in each dataset; while the count is lower than the total number of external domains as it excludes subdomains (A.2), it matches the blocking style of Disconnect better.
The table below shows the number of unique external domains requested in each dataset, unique external primary domains and unique domains marked as trackers in Disconnect's blocking list. The difference between the number of Disconnect's detected tracker domains and external and primary domains respectively is shown next. The next column group shows the ratio of detected Disconnect domains over all external domains. Lastly, the ratio of domains detected as well as domains undetected by Disconnect, over the number of primary domains.
Figure C.11 shows the ratio of detected and undetected primary domains (x axis) per dataset.
While some of the domains which have not been matched by Disconnect are private/internal CDNs, the fact that less than 10% of external domains are blocked in top website HTTP datasets is notable. The blocking results are also around 10% or lower for random domain HTTP datasets, but it seems it might be connected to the number of domains in the dataset. Only 3% of the external primary domains in .se 100k random domain HTTP dataset were detected. Smaller datasets, including HTTPS datasets with few reachable websites, have a higher detection rate at 30% and more.
Can a privacy tool using a fixed blacklist of domains to block be trusted – or can it only be trusted to be 10% effective? Regular expression based blocking, such as EasyList used by AdBlock, might be more effective, as it can block resources by URL path separate from the URL domain name (7.5.1) – but it's no cure-all. It does seem as if the blacklist model needs to be improved – perhaps by using whitelisting instead of blacklisting. The question then becomes an issue of either cat and mouse (5.2) – if the whitelist is shared by many users – or convenience – if each user maintains their own whitelist. At the moment it seems convenience and blacklists are winning, at the cost of playing cat and mouse with third parties who end up being blocked.
While only aggregate numbers per dataset have been presented here, it would be interesting to use the full list of undetected primary domains to improve Disconnect's blocking list. While it is an endless endeavor, sorting by number of occurrences would at least give a hint as to which domains might be useful to block. The same list could be used to classify some of the domains as private/internal CDNs.

The end

You can find updated document versions, as well as source code and datasets online
9
. Thank you for reading this far – feedback would be very much appreciated!