Tracking the Trackers

We provide the data used in our recent large-scale analysis of third-party trackers on the web. We created an extractor that finds embedded third-party resources from HTML pages and ran it on the 3.5 billion webpages contained in the CommonCrawl 2012 web crawl. We provide the resulting datasets here and redirect the interested reader to our ICWSM paper for details about the extraction process.

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. For further questions, feel free to contact me per email at sebastian.schelter(at)tu-berlin.de or on twitter as @sscdotopen.

Datasets

Data	Size	Description	Line Schema
Pay-Level Domain Index	330M	Index of 41,192,060 domains used in the analysis	domain[TAB]domain_id
Bi-partite Third-Party Network	647M	140,613,762 embeddings of 12,756,244 third-party domains in 41,192,060 domains	domain_id[TAB]thirdparty_domain_id
Bi-partite Tracking Network	138M	36,982,655 embeddings of 355 tracking domains in 41,192,060 domains	domain_id[TAB]tracking_domain_id
Labeled Third-Parties	80KB	1375 hand-labeled third-parties	domain[TAB]registration_org[TAB]registration_country [TAB]num_embeddings[TAB]num_embeddings_javascript [TAB]num_embeddings_iframe[TAB]num_embeddings_image [TAB]num_embeddings_link[TAB]category[TAB]company
DMOZ tags	18KB	450 categorized DMOZ tags	tag

Please cite the following publication if you work with the dataset:

@inproceedings{Schelter2016tracking,
  title={Tracking the Trackers: A Large-Scale Analysis of Embedded Web Trackers},
  author={Schelter, Sebastian and Kunegis, J{\'e}r{\^o}me},
  booktitle={Tenth International AAAI Conference on Web and Social Media},
  year={2016}
}