We provide the data used in our recent large-scale analysis of third-party trackers on the web. We created an extractor that finds embedded third-party resources from HTML pages and ran it on the 3.5 billion webpages contained in the CommonCrawl 2012 web crawl. We provide the resulting datasets here and redirect the interested reader to our ICWSM paper for details about the extraction process.
The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus. For further questions, feel free to contact me per email at sebastian.schelter(at)tu-berlin.de or on twitter as @sscdotopen.
Data | Size | Description | Line Schema |
---|---|---|---|
Pay-Level Domain Index | 330M | Index of 41,192,060 domains used in the analysis | domain[TAB]domain_id |
Bi-partite Third-Party Network | 647M | 140,613,762 embeddings of 12,756,244 third-party domains in 41,192,060 domains | domain_id[TAB]thirdparty_domain_id |
Bi-partite Tracking Network | 138M | 36,982,655 embeddings of 355 tracking domains in 41,192,060 domains | domain_id[TAB]tracking_domain_id |
Labeled Third-Parties | 80KB | 1375 hand-labeled third-parties | domain[TAB]registration_org[TAB]registration_country [TAB]num_embeddings[TAB]num_embeddings_javascript [TAB]num_embeddings_iframe[TAB]num_embeddings_image [TAB]num_embeddings_link[TAB]category[TAB]company |
DMOZ tags | 18KB | 450 categorized DMOZ tags | tag |
@inproceedings{Schelter2016tracking, title={Tracking the Trackers: A Large-Scale Analysis of Embedded Web Trackers}, author={Schelter, Sebastian and Kunegis, J{\'e}r{\^o}me}, booktitle={Tenth International AAAI Conference on Web and Social Media}, year={2016} }