Tracking the Trackers

We provide the data used in our recent large-scale analysis of third-party trackers on the web. We created an extractor that finds embedded third-party resources from HTML pages and ran it on the 3.5 billion webpages contained in the CommonCrawl 2012 web crawl. We provide the resulting datasets here and redirect the interested reader to our ICWSM paper for details about the extraction process.

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. For further questions, feel free to contact me per email at sebastian.schelter(at)tu-berlin.de or on twitter as @sscdotopen.

Datasets

Data Size Description Line Schema
Pay-Level Domain Index 330M Index of 41,192,060 domains used in the analysis domain[TAB]domain_id
Bi-partite Third-Party Network 647M 140,613,762 embeddings of 12,756,244 third-party domains in 41,192,060 domains domain_id[TAB]thirdparty_domain_id
Bi-partite Tracking Network 138M 36,982,655 embeddings of 355 tracking domains in 41,192,060 domains domain_id[TAB]tracking_domain_id
Labeled Third-Parties 80KB 1375 hand-labeled third-parties domain[TAB]registration_org[TAB]registration_country
[TAB]num_embeddings[TAB]num_embeddings_javascript
[TAB]num_embeddings_iframe[TAB]num_embeddings_image
[TAB]num_embeddings_link[TAB]category[TAB]company
DMOZ tags 18KB 450 categorized DMOZ tags tag

Please cite the following publication if you work with the dataset:
@inproceedings{Schelter2016tracking,
  title={Tracking the Trackers: A Large-Scale Analysis of Embedded Web Trackers},
  author={Schelter, Sebastian and Kunegis, J{\'e}r{\^o}me},
  booktitle={Tenth International AAAI Conference on Web and Social Media},
  year={2016}
}