Knowledge Graphs (KGs) are attracting many giant-tech companies due to their flexibility in illustrating users’ information: attributes and relationships. The information allows modern applications to exploit not only users’ attributes but also those of their neighbors to improve the quality of the applications. The latter covers a wide range of categories such as search engines (e.g., Google Search Engine, Microsoft Bing Search, Facebook Graph Search), recommendation systems (e.g., Amazon and eBay Product Knowledge Graph), and drug discovery (e.g., AstraZeneca). The usage requires the companies to build huge KGs that need a large resource and financial investment.

The companies can address the issue by sharing their KGs together. Since the KGs may contain sensitive information about users, the KGs must be anonymized to protect users’ privacy. However, anonymizing KGs incorrectly leaks users’ sensitive information. For instance, a naive anonymization technique removing the users’ explicit identifiers (e.g., name, email, address) cannot protect them from state-of-the-art privacy attacks: identity and attribute leakage. The former exploits re-identify users in anonymized KGs, whereas the latter infers the users’ sensitive values (e.g., disease, salary).

In this project, we have analyzed privacy attacks in various usages of KGs and designed privacy-preserving techniques to protect user privacy. The techniques aim for three properties: (1) privacy, (2) quality, and (3) flexibility.

The privacy property protects users from popular privacy attacks: identity and attribute leakage. We assume that adversaries’ exploitable information includes well-known background knowledge from attacks in relational data (i.e., attribute values) and graphs (i.e., relationship out-/in-degrees). The protection holds even when adversaries exploit all published KGs.

The quality property aims at maximizing the quality of anonymized KGs by minimizing the information loss according to two information loss metrics: the Attribute and Degree Information Loss Metric (ADM) and the Truthfulness Attribute and Degree information Loss Metric (ATDM). While ADM measures the information loss of attributes and relationship degrees, ATDM evaluates the truthfulness of attribute values and degrees. An attribute value is truthful if associating it with another creates a truthful association. We extended node2vec to train a classifier from associations of “truth” KGs and used the classifier to decide whether an association is truthful.

The flexibility property allows data providers to choose clustering algorithms to anonymize their KGs. Moreover, we allow them to freely update their KGs (inserting/deleting/updating/re-inserting users) and publish new anonymized versions without compromising users’ sensitive information.

We have conducted experiments in six real-life datasets: Yago, Freebase, Email-Eu-core, Email-temp, Google+, and DBLP. The experiential results showed that the anonymized KGs are good enough for both general and deep learning usages. General usage was assessed by measuring the information loss of anonymized KGs while deep learning one was evaluated by evaluating the classification accuracy of modern knowledge graph embedded models (i.e., Relational Graph Convolution Network).

We are extending our project in two directions. We plan to design differential privacy mechanisms to generate differentially private statistics that can be used in various scenarios (e.g., drug discovery). In addition, we are designing federated learning approaches to train knowledge graph embeddings in decentralized networks.

This project is funded by the EU H2020 CONCORDIA project and its results have been published in various high-quality conferences/journals:

  1. Anh-Tu Hoang, Barbara Carminati, Elena Ferrari. Cluster-Based Anonymization of Directed Graphs. CIC. 2019. 91-100.
  2. Anh-Tu Hoang, Barbara Carminati, Elena Ferrari. Cluster-Based Anonymization of Knowledge Graphs. ACNS (2). 2020. 104-123.
  3. Anh-Tu Hoang, Barbara Carminati, Elena Ferrari. Privacy-Preserving Sequential Publishing of Knowledge Graphs. ICDE. 2021. 2021-2026.
  4. Anh-Tu Hoang, Barbara Carminati, Elena Ferrari. Time-Aware Anonymization of Knowledge Graphs. TOPS. 2022.