2022 • In 2022 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), (The 1st International Workshop on Ethics in Computer Security (EthiCS 2022)), p. 554-561
Challenges of Protecting Confidentiality in Social Media Data and Their Ethical Import.pdf
Author postprint (133.04 kB)
[en] This article discussed the challenges of pseudonymizing unstructured, noisy social media data for cybersecurity research purposes and presents an open- source package developed to pseudonymize personal and confidential information (i.e., personal names, companies, and locations) contained in such data. Its goal is to facilitate compliance with EU data protection obligations and the upholding of research ethics principles like the respect for the autonomy, privacy and dignity of research participants, the social responsibility of researchers, and scientific integrity. We discuss the limitations of the pseudonymizer package, their ethical import, and the additional security measures that should be adopted to protect the confidentiality of the data.
GDPR compliance; Named Entity Recognition; Pseudonymization; research ethics; security measures; Cyber security; Named entity recognition; Open source package; Personal and confidential informations; Research ethics; Research purpose; Security measure; Social media datum; Computer Networks and Communications; Hardware and Architecture; Information Systems; Information Systems and Management; Safety, Risk, Reliability and Quality
Abstract :
[en] This article discusses the challenges of pseudonymizing unstructured, noisy social media data for cybersecurity research purposes and presents an open-source package developed to pseudonymize personal and confidential information (i.e., personal names, companies, and locations) contained in such data. Its goal is to facilitate compliance with EU data protection obligations and the upholding of research ethics principles like the respect for the autonomy, privacy and dignity of research participants, the social responsibility of researchers, and scientific integrity. We discuss the limitations of the pseudonymizer package, their ethical import, and the additional security measures that should be adopted to protect the confidentiality of the data.
Precision for document type :
Review article
Disciplines :
Computer science
Author, co-author :
Rossi, Arianna; SnT, University of Luxembourg, Luxembourg, Luxembourg
Arenas, Monica P.; SnT, University of Luxembourg, Luxembourg, Luxembourg
Kocyigit, Emre; SnT, University of Luxembourg, Luxembourg, Luxembourg
This work has been partially supported by the Luxembourg National Research Fund (FNR): “Deceptive Patterns Online (Decepticon)” IS/14717072 and No more Fakes “NOFAKES” PoC20 / 15299666 / NOFAKES-PoC.
Commentary :
"We developed a Pseudonymizer Python package that
works on English textual data and released it under a
GPL v2 license . This library works with structured and
unstructured data, but in the case of unstructured data,
and especially highly noisy data such as social media
data, the challenge is greater and thus the performance
is knowingly less accurate. This software has three
independent functionalities applied to different kinds of
data: Companies, Geolocations, and Personal Names."
M. Bellare, A. Boldyreva, and A. O'Neill. Deterministic and Efficiently Searchable Encryption. In A. Menezes, editor, Advances in Cryptology-CRYPTO 2007, pages 535-552, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
D. Boneh, G. Di Crescenzo, R. Ostrovsky, and G. Persiano. Public Key Encryption with Keyword Search. In C. Cachin and J. L. Camenisch, editors, Advances in Cryptology-EUROCRYPT 2004, pages 506-522, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
Z. Brakerski and G. Segev. Better Security for Deterministic Public-Key Encryption: The Auxiliary-Input Setting. Journal of Cryptology, 27 (2): 210-247, apr 2014.
M. Clark. The facts on news reports about facebook data. https: //about. fb. com/news/2021/04/ facts-on-news-reports-about-facebook-data/, Apr 2021.
L. Derczynski, E. Nichols, M. van Erp, and N. Limsopatham. Results of the wnut2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, page 140-147. Association for Computational Linguistics, 2017.
EtaLab IA. Guide à la pseudonymization decisions ce. https: //github. com/etalab-ia/pseudonymisation decisions ce, Jan 2020.
N. Fernandes, M. Dras, and A. McIver. Generalised Differential Privacy for Text Document Processing, volume 11426 of Lecture Notes in Computer Science, page 123-148. Springer International Publishing, 2019.
C. Fiesler, N. Beard, and B. C. Keegan. No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. Proceedings of the International AAAI Conference on Web and Social Media, 14: 187-196, May 2020.
C. Fiesler and N. Proferes. "participant" perceptions of twitter research ethics. Social Media + Society, 4 (1): 2056305118763366, Jan 2018.
A. S. Franzke, A. Bechmann, M. Zimmer, and C. M. Ess. Internet research: Ethical guidelines 3. 0: Association of internet researchers, 2019.
J. Fu, P. Liu, and G. Neubig. Interpretable multi-dataset evaluation for named entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6058-6069, 2020.
C. M. Gray, Y. Kou, B. Battles, J. Hoggatt, and A. L. Toombs. The dark (patterns) side of ux design. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems-CHI '18, page 1-14, Montreal QC, Canada, 2018. ACM Press.
B. Hutchinson, A. Smart, A. Hanna, E. Denton, C. Greer, O. Kjartansson, P. Barnes, and M. Mitchell. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, page 560-575. ACM, Mar 2021.
Ireland Data Protection Commission. Guidance on anonymisation and pseudonymisation, Jun 2019.
M. Jensen, C. Lauradoux, and K. Limniotis. Pseudonymisation techniques and best practices. Recommendations on shaping technology according to data protection and privacy provisions. European Union Agency for Cybersecurity (ENISA), November 2019. DOI 10. 2824/247711.
S. Ji, P. Mittal, and R. Beyah. Graph data anonymization, de-anonymization attacks, and de-anonymizability quantification: A survey. IEEE Communications Surveys & Tutorials, 19 (2): 1305-1326, 2016.
L. K. Kaye, C. Hewson, T. Buchanan, N. Coulsoun, Branley-Bell, C. Fullwodd, and L. Devlin. Ethics Guidelines for Internetmediated Research. The British Psychological Society, 2021.
R. P. Khandpur, T. Ji, S. Jan, G. Wang, C.-T. Lu, and N. Ramakrishnan. Crowdsourcing cybersecurity: Cyber attack detection using social media. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, page 1049-1057, Singapore Singapore, Nov 2017. ACM.
M. Lablans, A. Borg, and F. Ückert. A restful interface to pseudonymization services in modern web applications. BMC Medical Informatics and Decision Making, 15 (1): 2, Feb 2015.
N. Marres and E. Weltevrede. Scraping the social issues in live social research. Journal of Cultural Economy, 6 (3): 313-335, Aug 2013.
MISP. Information sharing and cooperation enabled by gdpr. https: //www. misp-project. org/compliance/GDPR/, Jan 2018.
J. Oates, D. Carpenter, M. Fisher, S. Goodson, B. Hannah, R. Kwiatkowski, K. Prutton, D. Reeves, and T. Wainwright. BPS Code of Human Research Ethics. The British Psychological Society, Apr 2021. ISBN 978-1-85433-792-4.
A. W. Party. Opinion 05/2014 on anonymisation techniques, 2014.
J. Peters. Personal data of 533 million facebook users leaks online. https: //www. Theverge. com/2021/4/4/22366822/ facebook-personal-data-533-million-leaks-online-email-phonenumbers, Apr 2021.
N. Proferes. Information flow solipsism in an exploratory study of beliefs about twitter. Social Media + Society, 3 (1): 2056305117698493, Jan 2017.
A. Rossi, A. Kumari, and G. Lenzini. Unwinding a Legal and Ethical Ariadne's Thread out of the Twitter's Scraping Maze. Springer Nature, Venice, sebastien ziegler, adrian quesada rodriguez and stefan schiffner edition, In press.
W. Stallings. Operating system security (Chapter 24), pages 24. 1-24. 21. Wiley, 6 edition, 2014.
L. Townsend and C. Wallace. Chapter 8: The Ethics of Using Social Media Data in Research: A New Framework, volume 2, page 189-207. Emerald Publishing Limited, Dec 2017.
E. van der Walt, J. H. P. Eloff, and J. Grobler. Cyber-security: Identity deception detection on social media platforms. Computers & Security, 78: 76-89, Sep 2018.
J. Vitak, N. Proferes, K. Shilton, and Z. Ashktorab. Ethics regulation in social computing research: Examining the role of institutional review boards. Journal of Empirical Research on Human Research Ethics, 12 (5): 372-382, Dec 2017.
M. L. Williams, P. Burnap, L. Sloan, C. Jessop, and H. Lepps. Users' Views of Ethics in Social Media Research: Informed Consent, Anonymity, and Harm, volume 2, page 27-52. Emerald Publishing Limited, Dec 2017.
S. Zong, A. Ritter, G. Mueller, and E. Wright. Analyzing the perceived severity of cybersecurity threats reported on social media. In Proceedings of NAACL-HLT, page 1380-1390, Minneapolis, Minnesota, USA, 2019. Association for Computational Linguistics.