Graph Anonymization: A comparison between Dynamic Programing and Greedy-swap Algorithm¶

Sanchayan Bhunia $(4849650)$, Sahitya Reddy Bollavaram $(4849759)$
Date: $12^{th}$ March $2021$

1. Introduction¶

1.1. What is a Graph?¶

A Graph is a structure to represent a set of objects where some pairs of objects are "related". Each object can be represented by a node and the relation between pairs can be represented by an edge connecting them.

figure:A Simple Graph

Direction of a Graph¶

Based upon the relationship between a pair of nodes, a graph can be categorized between a Directed Graph or an Undirected Graph.

Directed Graph¶

Directed Graphs are the graphs where the relationship between pairs is directional.

figure: A Simple Directed Graph

1.2. Undirected Graph¶

Undirected graphs are the graphs where the relationship between pairs do not have a direction. Each of the nodes can be uniquely represented with the Index of the node and number of Edges(relationships) it has with other nodes.

figure: A Simple Undirected Graph

If, $V$ is the set of all vertices and $E$ is the set of all edges. Then the whole graph, $G$ can be represented as a collection of Vertices$(v)$ and Edges$(e)$.

$$ \begin{equation} G = \big\{(v_{1}, e_{1}), \, (v_{2}, e_{2}), \, ... \, , \, (v_{i}, e_{i})\big\} \end{equation} $$

where $v_{i} \in V$ and $e_{i} \in E$

1.3. Degree Sequence¶

A degree of a vertice $(v)$ is the the number of edges, $(e)$ it has. We can collect of all degrees corresponding to all of the vertices and make a set. E.g. such a set for our example graph is : $\big\{2,3,2,3,3,1\big\}$. The Degree Sequence $(d_{G})$ of a Graph $(G)$ is just re-ordering such a set in a decreasing sequence. E.g. The degree sequence of our example graph is : $\big\{3,3,3,2,2,1\big\}$.

As we can see here that the degree sequence has same number of elements as the number of vertices in the graph. i.e. if there are $n$ number of nodes in the graph, degree sequence will be a vector of length $n$.

2. Graph Anonymization¶

In order for an organization to sell or share graph data, the data needs to anonymized in order to respect its privacy policies. In a typical graph database there are three types of anonymizations those need to be performed namely Identitiy Protection, Content Protection and Link Protection.
$k$-Degree anonimization is a graph modification technique used to anonymize the identity of the nodes with certain $k$ degree.

2.1. $k$-Degree Anonymity¶

The main goal of the k-degree anonimization is to make sure that, for each node $(v)$ in the graph $G$ there exists alteast $k-1$ other nodes with the same degree as $v$.
If the graph $G$ has $n$ nodes, the degree sequence $(d_{G})$ of $G$ is an $n$ dimensional vector with, $d_{G}(i) \ge d_{G}(j) \; \forall \; j > i$ where $i$ and $j$ are two the indices of the vector $d_{G}$.

The vector $d_{G}$ is k-anonymous if every distinct value in $d_{G}$ appears atleast $k$ times. The graph $G$ corresponding to $d_{G}$ will be called $k$-Degree anonymous iff $d_{G}$ is k-anonymous.

For our example graph to be $3$-Degree anonymous, we have to modify the degree sequence, $\big\{3,3,3,2,2,1\big\}$ to $\big\{3,3,3,2,2,2\big\}$.

2.1.1. Cost function¶

One of the ways to quantify the anonymization cost is by using a cost function $L_{1}$. Given a graph $G$ and its anonymized form $\hat{G}$ the cost function $L_{1}$ returns the total cost of turning $d_{G}$ into $d_{\hat{G}}$.

$$ \begin{equation} L_{1}\big(d_{G}, \, d_{\hat{G}}\big) = \sum_{i=1}^{n} \Bigg| \, d_{\hat{G}}(i) - d_{G}(i) \, \Bigg| \end{equation} $$

For our example graph, the cost of anonymization, $$L_{1} = |3-3| + |3-3| + |3-3| + |2-2| + |2-2| + |2-1| = 1$$

2.2. $k$-Degree Anonymization Algorithms¶

The job of a $k$-Degree Anonymization algorithm is to take an input graph $G$ and an integer $k$ to output a $k$-Degree anonymized graph $G'$ keeping the cost $L_{1}$ minimum.
$k$-Degree Anonymity can be achieved both by using dynamic programming and greedy-swap algorithms but anonymization cost $L_{1}$, might differ form one method to another.

2.2.1. Dynamic Programing¶

Dynamic programing is implemented by breaking the problem into smaller recurring sub-problems.

Let suppose, we have a degree sequence $d$ of dimension $n$ and by definition it is sorted. i.e. $$ \begin{equation} d(1) \, \ge \, d(2) \, \ge \, ... \, \ge \, d(n) \end{equation} $$
Let, $Da\big(d[1,i]\big)$ be the degree anonymization cost of the sub-sequence $d[1,i]$.
Also let, $I\big(d[i,j]\big)$ be the anonymization cost when all the elements from $d(i), d(i+1), ..., d(j)$ are assigned the same value as $d(i)$. i.e. $$ \begin{equation} I(d[i,j]) = \sum_{l=i}^{j}\big(d(i) - d(l)\big) \end{equation} $$

Given these assumptions, a dynamic algorithm can be formed as the following,

Condition 1:¶
When, $k \gt \frac{i}{2}$, it is impossible to form two groups with distinct values. So, we put them in the same group. E.g. for a 3-Degree anonymization of a sub-sequence $d[1,4] = \big\{3,3,2,2\big\}$ has to be $\big\{3,3,3,3\big\}$. So, the cost of anonymization can be calculated like the following, $$ \begin{equation} Da\big(d[1,i]\big) = I\big(d[1,i]\big) \end{equation} $$

Condition 2:¶
When, $k \le \frac{i}{2}$ the degree sequence can be broken down in smaller sub-sequences recursively until the size of the sub-sequence follows Condition 1. At this stage, every recursive step will have some chunk of the degree sequence with a length $\lt 2k$ and for the rest of the sub-sequence we can apply the same method recurssively. And the total cost for that step will be sum over the cost of putting the a chunk in the same group and the cost of anonymizing the rest of the degree sequence recursively. And then we can vary the size of each chunk $t$ and take the take the anonymized sequence with mininum cost and return it to the previous recurssive stage in the heap.
$$ \begin{equation} Da\big(d[1, i]\big) = \min\bigg\{ \min_{k \, \le \, t \, \le \, i-k}\Big\{Da\big(d[1, t]\big)+I\big(d[t+1, \, i-k]\big)\Big\}, \; I\big(d[1, i]\big)\bigg\} \end{equation} $$

The Dynamic Programing Algorithm¶

figure: Recursion Heap

# This function takes a full or part of a sequence and converts all of the values to the heighest value in the sequence
# then it returns the cost of the operation and the converted sequence.

# To eliminate any delay due to searching in a list, we will rather use key,value pairs to store them in a dictionary

def putInSameGroup(sequence): 
    cost = 0
    for i in range(len(sequence)):
        cost +=  sequence[0] - sequence[i]
    
    return cost, [sequence[0]]*len(sequence)
    

# Input: degree sequence and the value of k as input
# Output: the cost of anonymization, anonymized sequence
def dynamicPrograming(k, degree_sequence): 

    number_of_nodes = len(degree_sequence)
    cost_vs_degree_sequences = dict()
    
    # Condition 1
    if k > number_of_nodes / 2:
        return putInSameGroup(degree_sequence)

    # Condition 2
    else:
        cost_vs_degree_sequences.clear()
        lower_range = k
        upper_range = number_of_nodes - k+1
        
        for chunk in range(lower_range, upper_range):
            # This is the sub-sequence which will be Anonymized
            to_anonymize = degree_sequence[0: chunk]     
            
            # Rest of the sequence to put in the same group
            in_same_group = degree_sequence[chunk: number_of_nodes] 
            
            # The recurssion step
            cost_anonymize, sequence_anonymized = dynamicPrograming(k, to_anonymize)
            
            # part where we put them in same group
            cost_same_group, same_group_squence = putInSameGroup(in_same_group)
            
            # Total cost is calculated for this level in the stack
            chunk_cost = cost_anonymize + cost_same_group
            
            # And the anonymized sequence in this level
            anonymized_chunk = sequence_anonymized + same_group_squence
            
            # Cost and corresponding sequence is stored as (key, value) pair
            cost_vs_degree_sequences[chunk_cost] = anonymized_chunk 
            
            #print((chunk, number_of_nodes), anonymized_chunk)
        
        # Now we will search for the minimum cost in key-space and return corresponding (cost, anonymized sequence) pair 
        return min(cost_vs_degree_sequences.items(), key = lambda x: x[0])

Artificial Data Generator¶

Now, we will define a function to generate degree sequence of desired length, lowest and highest degree value. As oppose to the real data, this artificial data will help us to find out the run-time of the algorithm for degree list of different sizes which is useful to quantify the effects of additional modifications to the algorithm.

import numpy as np
from numpy import random

# Given a length this function will output a degree sequence. 
# Additionally it is also possible to mention the highest and athe lowest degree
def generateDegreeSequence(length, lowest_degree = 1, highest_degree = None):
    
    if highest_degree == None:
        highest_degree = length
        
    degree_list = random.randint(high = highest_degree+1, low = lowest_degree, size=length)
    degree_sequence = np.sort(degree_list)[::-1]
    return degree_sequence.tolist()

Checking if our algorithm actually works¶

Now let's generate a degree sequence and check our dynamic programing graph anonymization algorithm.

k = random.randint(low=3, high=10) 
degree_sequence = generateDegreeSequence(20, lowest_degree = 1, highest_degree = 15)
print("Not Anonymized Sequence: {}".format(degree_sequence))
anonymization_cost, anonymized_sequence = dynamicPrograming(k, degree_sequence)
print("{}-Anonymized Sequence:{}".format(k, anonymized_sequence))
print("Anonymization Cost = {}".format(anonymization_cost))

Not Anonymized Sequence: [14, 14, 13, 12, 12, 11, 11, 9, 8, 8, 6, 6, 5, 5, 5, 5, 3, 3, 2, 1]
3-Anonymized Sequence:[14, 14, 14, 12, 12, 12, 12, 9, 9, 9, 6, 6, 6, 5, 5, 5, 3, 3, 3, 3]
Anonymization Cost = 9

Time Complexity¶

Now let us focus on the time complexity of this algorithm. We will take a bunch of degree lists of varying size and plot the computation time to anonymize them.

import time
from matplotlib import pyplot as plt

for k in range(3, 7):
    size_of_sequence = list() 
    performance = list()
    
    for size in range(10, 51):
        degree_sequence = generateDegreeSequence(size) # Generates the degree sequence
        start_timer = time.time()               # start timer
        dynamicPrograming(k, degree_sequence)  # Algorithm called for observation
        end_timer = time.time() # stop timer
        time_taken = end_timer - start_timer
        
        size_of_sequence.append(size)        # stores the size^(2) of a degree sequence
        performance.append(time_taken)       # stores the time taken by the algorithm for that degree sequence 
        
        # Shell Massage
        print('{}-Degree Anonymizing Sequence of length {}. Please wait ...'.format(k, size), end='\r', flush=True)
    
    plt.plot(size_of_sequence, performance, label='k = {}'.format(k))


print('{}\r'.format('Done'+' '*100)) # Clearing Shell Message  
plt.xlabel("Size of the Degree Sequence ($n$)")
plt.ylabel("Time taken to Anonymize (s)")
plt.legend()
plt.show()

Done

Observation¶

Lower the degree anonymization factor $k$, higher it takes to anonymize the sequence. With this small amount of data it can not be verified that time complexity is $O(n^{2})$. It took around $100$ seconds to $3$-degree anonymize a sequence of length $50$. There are multiple ways to bring down the anonymization time. One of them is by optimizing the algorithm.

Optimization¶

From observation, it can be confirmed that no anonymous group is larger than $2k − 1$. If any group is larger than or equal to $2k$, it can be broken into two subgroups with equal or lower overall degree-anonymization cost. Which essencially turns out that the preprocessing step $I\big(d[i,j]\big)$ does not need to consider all of the combinations of $\big(i,j\big)$ but only, $k \, \le \, j-i+1 \, \le \, 2k-1$.
Similarly, for every $i$, we do not have to consider all $t$’s in the range $k \le t \le i − k$ in Recursion, but only $t$’s in the range $\max\big\{k, i − 2k + 1\big\} \le t \le i − k$.

$$ \begin{equation} Da\big(d[1, i]\big) = \min\bigg\{ \min_{\max\{k, \, i − 2k + 1\} \, \le \, t \, \le \, i − k}\Big\{Da\big(d[1, t]\big)+I\big(d[t+1, i-k]\big)\Big\}, \; I\big(d[1, i]\big)\bigg\} \end{equation} $$

Python Implementation¶

So, the optimization is essencially done by changing the lower limit of the range of value that every iteration of $t$ (i.e. chunk in code).

# Input: degree sequence and the value of k as input
# Output: the cost of anonymization, anonymized sequence
def optimizedDynamicPrograming(k, degree_sequence): 

    number_of_nodes = len(degree_sequence)
    cost_vs_degree_sequences = dict()
    
    if k > number_of_nodes / 2:
        return putInSameGroup(degree_sequence)

    else:
        cost_vs_degree_sequences.clear()
        lower_range = max(k, (number_of_nodes - 2*k + 1)) #The lower range is replaced with optimized range
        upper_range = number_of_nodes - k+1
        for chunk in range(lower_range, upper_range):
            to_anonymize = degree_sequence[0: chunk]                
            in_same_group = degree_sequence[chunk: number_of_nodes] 
            
            cost_anonymize, sequence_anonymized = optimizedDynamicPrograming(k, to_anonymize)
            cost_same_group, same_group_squence = putInSameGroup(in_same_group)
        
            chunk_cost = cost_anonymize + cost_same_group
            anonymized_chunk = sequence_anonymized + same_group_squence
            cost_vs_degree_sequences[chunk_cost] = anonymized_chunk
        
        return min(cost_vs_degree_sequences.items(), key = lambda x: x[0])

for k in range(3, 7):
    size_of_sequence = list()
    performance = list()
    
    for size in range(10, 51):
        degree_sequence = generateDegreeSequence(size)
        start_timer = time.time()
        optimizedDynamicPrograming(k, degree_sequence)
        end_timer = time.time()
        time_taken = end_timer - start_timer
        
        size_of_sequence.append(size)
        performance.append(time_taken)
        
        # Shell Message
        print('{}-Degree Anonymizing Sequence of length {}. Please wait ...'.format(k, size), end='\r', flush=True)
    
    
    plt.plot(size_of_sequence, performance, label='k = {}'.format(k))

    
print('{}\r'.format('Done' + ' '*100)) # Clearing Shell Message

plt.xlabel("Size of the Degree Sequence ($n$)")
plt.ylabel("Time taken to Anonymize (s)")
plt.legend()
plt.show()

Done

Observation¶

Again, lower the value of $k$ higher it takes to anonymize the sequence. For this small set of degree sequences it can not be proved that the complexity is $O(kn)$ but, it is certain that range optimization improves performance. It took around $1.5$ seconds to $3$-degree anonymize a sequence of same length. Which is approximately $100$ times faster than unoptimized algorithm.

2.2.2. Memorized Dynamic Programing¶

There is one more way to bring down the run-time of this algorithm, that is by memorization. But in order to find out what to memorize, we have to look at the dynamic programing algorithm closely.

def dynamicPrograming_observe(k, degree_sequence):  #As we are making some changes in the code, we want to give a new name 

    number_of_nodes = len(degree_sequence)
    cost_vs_degree_sequences = dict()

    if k > number_of_nodes / 2:
        return putInSameGroup(degree_sequence)

    else:
        cost_vs_degree_sequences.clear()
        for chunk in range(k, number_of_nodes - k+1):
            to_anonymize = degree_sequence[0: chunk]     
            in_same_group = degree_sequence[chunk: number_of_nodes]
            
            cost_anonymize, sequence_anonymized = dynamicPrograming_observe(k, to_anonymize)
            cost_same_group, same_group_squence = putInSameGroup(in_same_group)
            
            chunk_cost = cost_anonymize + cost_same_group
            anonymized_chunk = sequence_anonymized + same_group_squence
            cost_vs_degree_sequences[chunk_cost] = anonymized_chunk 
            
            #print((chunk, number_of_nodes), anonymized_chunk) #Print this for observation
            
        return min(cost_vs_degree_sequences.items(), key = lambda x: x[0])
    
    
# Generating a degree sequence of length 20 for testing
degree_sequence = generateDegreeSequence(20)
k = 3
dynamicPrograming_observe(k, degree_sequence)

(14, [20, 20, 20, 18, 18, 18, 15, 15, 15, 9, 9, 9, 9, 5, 5, 5, 3, 3, 3, 3])

Analysis¶

In the whole recursion heap some anonymized sub-sequence (i.e. anonymized_chunk in the coed) appears multiple number of times. To be more specific, every time when the tuple (chunk, number_of_nodes) appears. But we are spending computational resources every time.
Instead we can memorize the anonymization cost and anonymized chunk and recall it every time we encounter the specific tuple through out the entire heap. We will use (chunk, number_of_nodes) as key to store (anonymization cost, anonymized sub-sequence) as its value in a Dictionary called cache. By default Python Dictionary will store the key as a hash value in its key space. So, serching for a value using key is constant in time. Which, as opposed to storing them as lists will not affect the time complexity of the algorithm.

As optimization discussed earlier just modifies the range of iteration, memorized algorithm can also benifit from it.

2.3. The "Dynamic Module" and its Usage

We have created a python module called dynamic which not only inclueds both unoptimised and optimized dynamic programing algorithm but also includes memorized dynamic algorithm as well. Oppose to the code written in this notebook, the module is flexible in terms of values of k, accepts degree list/tuples/numpy arrays which is unordered and provide easy debugging features like Exceptions.

a. Using the module
import dynamic
b. dynamic.calculateCost(original_list, modified_list)

Calculates the cost of transforming one list into another list.
- Input:
  - original_list-- A list as with which to comapair. Accepted data types list/tuples/numpy arrays.
  - modified_list-- A list to compair with original_list. Accepted data types list/tuples/numpy arrays.
- Output:
  - Cost of transforming one into another. Datatype is int/float.
c. dpGraphAnonymization(k, degree_sequence, optimization)

This method is essencially both optimized and unoptimized the dynamic graph anonymization algorithm.
- Input:
  - k-- A value of $k$, the degree anonymization factor, of type int.
  - degree_sequence-- Degree sequence (ordered/unordered) of data type lists/tuples/numpy arrays.
  - optimization-- This argument decides if the algorithm is going to be an optimized dynamic algorithm or an unoptimized dyanmic algorithm. This argument is optional and takes a boolean input. Which is False by default. I.e. by default dpGraphAnonymization() runs unoptimized algorithm.
- Output:
  - The method returns two outputs. The first output is the cost of anonymization which is of type int and the second output is the anonymized degree sequence, which is a list.
d. memorizedDPGraphAnonymization(k, degree_sequence, optimization)

This method is essencially both optimized and unoptimized the memorized dynamic graph anonymization algorithm.
- Input:
  - k-- A value of $k$, the degree anonymization factor, of type int.
  - degree_sequence-- Degree sequence (ordered/unordered) of data type lists/tuples/numpy arrays.
  - optimization-- This argument decides if the algorithm is going to be an optimized memorized dynamic algorithm or an unoptimized memorized dyanmic algorithm. This argument is optional and takes a boolean input. Which is False by default. I.e. by default memorizedDPGraphAnonymization() is unoptimized algorithm.
- Output:
  - The method returns two outputs. The first output is the cost of anonymization which is of type int and the second output is the anonymized degree sequence, which is a list.

Conformance Checking¶

From now on we will use the algorithms from the dynamic module. With that in mind, lets check if the memorized dynamic algorithm produces identical outcomes as the original algorithm and if so, what is the running time of the process.

#Compairing Memorized vs optimized dynamic programming from paper
import dynamic

for k in range(3, 11):
    match_list = list()
    
    for size in range(10, 51):
        
        degree_sequence = generateDegreeSequence(size)
        cost_dy, anonymized_seq_dy = dynamic.dpGraphAnonymization(k, degree_sequence, optimization=True)
        cost_mem, anonymized_seq_mem = dynamic.memorizedDPGraphAnonymization(k, degree_sequence, optimization=True)
        
        if cost_dy == cost_mem and anonymized_seq_dy == anonymized_seq_mem:
            match_list.append("match")
        
        else:
            match_list.append("mis_match")
            
    total_matches = match_list.count("match")
    match_percentage = total_matches/len(match_list)*100
    
    print("for k = {} outcomes from dynamic and memorized matches {}%".format(k, match_percentage))

for k = 3 outcomes from dynamic and memorized matches 100.0%
for k = 4 outcomes from dynamic and memorized matches 100.0%
for k = 5 outcomes from dynamic and memorized matches 100.0%
for k = 6 outcomes from dynamic and memorized matches 100.0%
for k = 7 outcomes from dynamic and memorized matches 100.0%
for k = 8 outcomes from dynamic and memorized matches 100.0%
for k = 9 outcomes from dynamic and memorized matches 100.0%
for k = 10 outcomes from dynamic and memorized matches 100.0%

Performance Checking¶

Now, as we have checked that results are identical for both of the algorithms, let's check the running time of memorized dynamic graph anonymization algorithm. We will check both optimized and unoptimized memorized algorithm.

import time
# Just the memorized algorithms with and without optimization 

figure,(not_opt, opt) = plt.subplots(1, 2,figsize=(15,5))
figure.subplots_adjust(hspace=0.4, wspace=0.4)

for k in range(3, 7): #Range of value of k
    size_of_sequence = list()
    opt_performance = list()
    not_opt_performance = list()
    
    for size in range(10, 151, 20): #Size of the degree Sequence 
        degree_sequence = generateDegreeSequence(size)
        size_of_sequence.append(size) # store the size for plotting
        
        # For Degree sequence of this size Call not optimized algorithm and measure time 
        not_opt_start_timer = time.time()   
        dynamic.memorizedDPGraphAnonymization(k, degree_sequence, optimization = False)
        not_opt_end_timer = time.time()
        not_opt_time_taken = not_opt_end_timer - not_opt_start_timer 
        not_opt_performance.append(not_opt_time_taken)
        
        # For Degree sequence of this size Call not optimized algorithm and measure time 
        opt_start_timer = time.time()
        dynamic.memorizedDPGraphAnonymization(k, degree_sequence, optimization = True)
        opt_end_timer = time.time()
        opt_time_taken = opt_end_timer - opt_start_timer
        opt_performance.append(opt_time_taken)
    
    #Ploting size vs Performance for unoptimized Memorized Algorithm 
    not_opt.plot(size_of_sequence, not_opt_performance, label='k = {}'.format(k)) 
    not_opt.legend()                                                 
    
    #Ploting size vs Performance for unoptimized Memorized Algorithm 
    opt.plot(size_of_sequence, opt_performance, label='k = {}'.format(k))       
    opt.legend()                                                               
    
    
not_opt.set_xlabel("Size of the Degree Sequence ($n$)")
not_opt.set_ylabel("Time taken to Anonymize (s)")
not_opt.set_title("Unoptimized Memorized DP")

opt.set_xlabel("Size of the Degree Sequence ($n$)")
opt.set_ylabel("Time taken to Anonymize (s)")
opt.set_title("Optimized Memorized DP")


plt.show()

Conclusion¶

First of all it is very clear that Memorized Dynamic Programing works way better than any not memorized algorithm, even if thats optimized.

figure: Performance for $3$-Degree Anonymization

From the diagram above it can also be said that the Optimized Memorized Dynamic Programing algorithm is the best in terms of performance (almost $10 \times$ faster than unoptimized memorized). **So, from now onwards**,

Dynamic Programing algorithm $\Longrightarrow$ Optimized Memorized Dynamic Programing Algorithm

3. Comparison between Greedy Swap and Dynamic Programing

The Greedy swap algorithm is implemented in the "greedy" module. The module can be used by import greedy. The algorithm can be put to use by calling greedy.greedy_rec_algorithm(k_degree, array_degrees, pos_init, extension).

Input:

This method takes four arguments:
- k_degree-- Degree anonymization factor of data type int.
- array_degrees-- A sorted degree sequence of data type list.
Output:
- It returns the anonymized degree sequence.

3.1. Running-time Performance Check¶

We are not going to discuss the formulatation of the algorythm but we will test it against dynamic algorithm. Let's first test the running-time performance of the Greedy Algorithm with compaired to Dynamic Algorithm for synthetic data and then we will move to Real data to test accuracy wise performance.

import dynamic
import greedy

figure,(figure_dynamic, figure_greedy) = plt.subplots(1, 2,figsize=(15,5))
figure.subplots_adjust(hspace=0.4, wspace=0.4)

for k in range(3, 7): #Range of value of k
    size_of_sequence = list()
    dynamic_performance = list()
    greedy_performance = list()
    
    for size in range(100, 401, 50): #Size of the degree Sequence 
        degree_sequence = generateDegreeSequence(size)
        size_of_sequence.append(size) # store the size for plotting
        
        # For Degree sequence of this size Call dynamic algorithm and measure time 
        dynamic_start_timer = time.time()   
        dynamic.memorizedDPGraphAnonymization(k, degree_sequence, optimization = True)
        dynamic_end_timer = time.time()
        dynamic_time_taken = dynamic_end_timer - dynamic_start_timer 
        dynamic_performance.append(dynamic_time_taken)
        
        # For Degree sequence of this size Call greedy algorithm and measure time 
        greedy_start_timer = time.time()
        greedy.greedy_rec_algorithm(k, degree_sequence)
        greedy_end_timer = time.time()
        greedy_time_taken = greedy_end_timer - greedy_start_timer
        greedy_performance.append(greedy_time_taken)
    
    #Ploting size vs Performance for Dynamic Algorithm 
    figure_dynamic.plot(size_of_sequence, dynamic_performance, label='k = {}'.format(k)) 
    figure_dynamic.legend()                                                 
    
    #Ploting size vs Performance for Greedy Algorithm 
    figure_greedy.plot(size_of_sequence, greedy_performance, label='k = {}'.format(k))       
    figure_greedy.legend()                                                               
    
    
figure_dynamic.set_xlabel("Size of the Degree Sequence ($n$)")
figure_dynamic.set_ylabel("Time taken to Anonymize (s)")
figure_dynamic.set_title("Dynamic Algorithm")

figure_greedy.set_xlabel("Size of the Degree Sequence ($n$)")
figure_greedy.set_ylabel("Time taken to Anonymize (s)")
figure_greedy.set_title("Greedy Algorithm")


plt.show()

With degree lists from size $100 \to 400$ we can spot that Memorized dynamic programing is $O(n)$. But greedy algorithm seems constant in time. So, running time performance of Greedy Algorithm is better than any Dynamic Programming Algorithm.

3.2. Accuracy Performance Check¶

Performance can be in terms of accuracy of the outcome as well. In that case, the lose function is the key to measure such performance. Here we define an indicator called Performance Ratio $(R)$,
$$ \begin{equation} R = \frac{L\big(\hat{d}_{Greedy} - d\big)}{L\big(\hat{d}_{Dyanmic} - d\big)} \end{equation} $$
The dynamic programing algorithm in the module already returns the cost of anonymization. So, we do not have to calculate that. But for the Greedy algorithm only returns a anonymized sequence. So, we will use a method from "greedy" module called calculateCost() to calculate the anonymization cost.

Experiment¶

For experiments, we will switch from synthetic data to "quakers data" in the dataset and calculate $R$ for different degree anonymization factors $(k)$.

a) Getting the Degree Sequence and correspondin Array Sequence from "quakers data"¶

import numpy as np
import networkx as nx
import csv

G = nx.Graph()
    
with open('Datasets/quakers_nodelist.csv' , 'r') as nodecsv:  # Open the file
    nodereader = csv.reader(nodecsv)  # Read the csv
        # Retrieve the data (using Python list comprhension and list slicing to remove the header row, see footnote 3)
    nodes = [ n for n in nodereader ][ 1: ]

node_names = [ n[ 0 ] for n in nodes ]  # Get a list of only the node names

with open('Datasets/quakers_edgelist.csv' , 'r') as edgecsv:  # Open the file
    edgereader = csv.reader(edgecsv)  # Read the csv
    edges = [ tuple(e) for e in edgereader ][ 1: ]  # Retrieve the data
G.add_nodes_from(node_names)
G.add_edges_from(edges)

degree_array_quaker = np.array(list(map(lambda x:x[1], G.degree())))

degree_sequence_quaker = np.sort(degree_array_quaker)[::-1].tolist() 
index_sequence_quaker = np.argsort(degree_array_quaker)[::-1].tolist()

b) Getting the Degree Sequence and correspondin Array Sequence from "netscience.gml"¶

G = nx.read_gml('Datasets/netscience.gml')
degree_array_netscience = np.array(list(map(lambda x:x[1], G.degree())))

degree_sequence_netscience = np.sort(degree_array_netscience)[::-1].tolist() 
index_sequence_netscience = np.argsort(degree_array_netscience)[::-1].tolist()

A Function to Compute the value of R for a $k$ and a Degree List¶

def computeRvalue(k, degree_sequence):
    
    # Calculating cost of dynamic anonymization for k
    dynamic_cost, dynamic_anonymized_seq = dynamic.memorizedDPGraphAnonymization(k, degree_sequence, optimization=True)

    # Calculating cost of greedy anonymization for k
    greedy_anonymized_seq = greedy.greedy_rec_algorithm(k, degree_sequence)
    greedy_cost = dynamic.calculateCost(degree_sequence, greedy_anonymized_seq)
    
    r_value = greedy_cost/dynamic_cost    
    
    return r_value

Plotting R vs $k$ Barchart¶

import matplotlib
import matplotlib.pyplot as plt
import time

start_timer = time.time() # starting timer

list_of_k = [k for k in range(3, 21)]

list_of_R_quaker = [computeRvalue(k, degree_sequence_quaker) for k in list_of_k]
list_of_R_netscience = [computeRvalue(k, degree_sequence_netscience) for k in list_of_k]

end_timer = time.time() # stopping timer

#Plotting using subplots
bar_width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar([x - bar_width/2 for x in list_of_k], list_of_R_quaker, 
                width = bar_width, label='quaker, Size = {}'.format(len(degree_sequence_quaker)))

rects2 = ax.bar([x + bar_width/2 for x in list_of_k], list_of_R_netscience, 
                width = bar_width, label='netscience, size = {}'.format(len(degree_sequence_netscience)))

ax.set_ylabel('Performance Ratio (R)')
ax.set_xlabel('Degree Anonymity $k$')
ax.set_xticks(list_of_k)
ax.legend()

print('Time taken: {}s'.format(end_timer - start_timer))
plt.show()

Time taken: 575.0818371772766s

Observation¶

We can see that in most of the cases, $$ \begin{equation} R \gt 1 \implies \frac{L\big(\hat{d}_{Greedy} - d\big)}{L\big(\hat{d}_{Dyanmic} - d\big)} \gt 1 \implies L\big(\hat{d}_{Greedy} - d\big) \gt L\big(\hat{d}_{Dyanmic} - d\big) \end{equation} $$
The anonymization cost of Greedy Algorithm is more than the cost of Dynamic Algorithm. And as we increase the size of the degree sequence, this difference becomes clearer.

4. Graph Reconstruction¶

All Degree sequences that we get after anonymization do not represent a valid graph. So, after we have anonymized the Degree sequence it is necessary to reconstruct the graph.

# A function to reconstruct the graph
def construct_graph(tab_index, anonymized_degree):
    graph = nx.Graph()
    if sum(anonymized_degree) % 2 == 1:
        return None

    while True:
        if not all(di >= 0 for di in anonymized_degree):
            return None
        if all(di == 0 for di in anonymized_degree):
            return graph
        v = np.random.choice((np.where(np.array(anonymized_degree) > 0))[0])
        dv = anonymized_degree[v]
        anonymized_degree[v] = 0
        for index in np.argsort(anonymized_degree)[-dv:][::-1]:
            if index == v:
                return None
            if not graph.has_edge(tab_index[v], tab_index[index]):
                graph.add_edge(tab_index[v], tab_index[index])
                anonymized_degree[index] = anonymized_degree[index] - 1

4.1. Quaker dataset: Graph Reconstruction of Anonymized Sequences from Both of the Algorithms for various $k$¶

dict_of_graph_dynamic_quaker = dict()
dict_of_graph_greedy_quaker = dict()

for k in list_of_k:
    dynamic_cost, dynamic_anonymized_seq = dynamic.memorizedDPGraphAnonymization(k, degree_sequence_quaker,
                                                                                 optimization=True)
    
    greedy_anonymized_seq = greedy.greedy_rec_algorithm(k, degree_sequence_quaker)
    
    # Graphs are added in dictionaries with k as keys
    dict_of_graph_dynamic_quaker[k] = construct_graph(index_sequence_quaker, dynamic_anonymized_seq)
    dict_of_graph_greedy_quaker[k] = construct_graph(index_sequence_quaker, greedy_anonymized_seq)

Values of $k$ where ghaphs can not be reconstructed with both Algorithms.

print('Values of k For Dynamic where graphs cannot be reconstructed in Quaker Dataset')
for k, graph in dict_of_graph_dynamic_quaker.items():
    if graph is None:
        print(k, end=', ')
        
print('\ntotal:{}'.format(list(dict_of_graph_dynamic_quaker.values()).count(None)))       


print('\n\nValues of k For Greedy where graphs cannot be reconstructed in Quaker Dataset')
for k, graph in dict_of_graph_greedy_quaker.items():
    if graph is None:
        print(k, end=', ')

print('\ntotal:{}'.format(list(dict_of_graph_greedy_quaker.values()).count(None))) 


print('\n\nValues of k and R For Both Algorithms where graphs can be reconstructed in Quaker Dataset')

for k in list_of_k:
    if dict_of_graph_dynamic_quaker.get(k) is not None and dict_of_graph_greedy_quaker.get(k) is not None:
        print('(k:{}, R:{})'.format(k, list_of_R_quaker[list_of_k.index(k)]), end='; ')

Values of k For Dynamic where graphs cannot be reconstructed in Quaker Dataset
7, 10, 11, 12, 14, 15, 17, 18, 20, 
total:9


Values of k For Greedy where graphs cannot be reconstructed in Quaker Dataset
3, 4, 6, 7, 8, 10, 12, 14, 15, 16, 18, 20, 
total:12


Values of k and R For Both Algorithms where graphs can be reconstructed in Quaker Dataset
(k:5, R:1.05); (k:9, R:1.0); (k:13, R:1.0); (k:19, R:1.0289855072463767);

4.2. Netscience dataset : Graph Reconstruction of Anonymized Sequences from Both of the Algorithms for various $k$¶

dict_of_graph_dynamic_NS = dict()
dict_of_graph_greedy_NS = dict()

for k in list_of_k:
    dynamic_cost, dynamic_anonymized_seq = dynamic.memorizedDPGraphAnonymization(k, degree_sequence_netscience,
                                                                                optimization=True)
    greedy_anonymized_seq = greedy.greedy_rec_algorithm(k, degree_sequence_netscience)
    
    # Graphs are added in dictionaries with k as keys
    dict_of_graph_dynamic_NS[k] = construct_graph(index_sequence_netscience, dynamic_anonymized_seq)
    dict_of_graph_greedy_NS[k] = construct_graph(index_sequence_netscience, greedy_anonymized_seq)

print('Values of k For Dynamic where graphs cannot be reconstructed')
for k, graph in dict_of_graph_dynamic_NS.items():
    if graph is None:
        print(k, end=' ')
print('\ntotal:{}'.format(list(dict_of_graph_dynamic_NS.values()).count(None)))        

print('\n\nValues of k For Greedy where graphs cannot be reconstructed')
for k, graph in dict_of_graph_greedy_NS.items():
    if graph is None:
        print(k, end=' ')
print('\ntotal:{}'.format(list(dict_of_graph_greedy_NS.values()).count(None)))


print('\n\nValues of k and R For Both Algorithms where graphs can be reconstructed in Netscience Dataset')

for k in list_of_k:
    if dict_of_graph_dynamic_NS.get(k) is not None and dict_of_graph_greedy_NS.get(k) is not None:
        print('(k:{}, R:{})'.format(k, list_of_R_netscience[list_of_k.index(k)]), end='; ')

Values of k For Dynamic where graphs cannot be reconstructed
4 5 10 11 15 18 19 
total:7


Values of k For Greedy where graphs cannot be reconstructed
6 7 9 10 11 12 14 15 17 18 19 20 
total:12


Values of k and R For Both Algorithms where graphs can be reconstructed in Netscience Dataset
(k:3, R:1.4); (k:8, R:1.489795918367347); (k:13, R:1.1616161616161615); (k:16, R:1.3412698412698412);

Observation¶

As we can see here, for both data sets that there are more number of values of $k$ for the Greedy algorithm where it is not possible to construct a graph. Which means Dynamic Programing is better than Greedy Algorithm in terms of performance.

Netscience Data: Example of a reconstructed Graph for $k = 8$ from both Algorithms¶

k=8
graph_from_dynamic = dict_of_graph_dynamic_NS.get(k)
graph_from_greedy = dict_of_graph_greedy_NS.get(k)

print('Anonymized With Dynamic')
print('\nNodes of Anonymized Graph: {}'.format(graph_from_dynamic.nodes))
print('\nEdges of Anonymized Graph: {}'.format(graph_from_dynamic.edges))

print('\n\nAnonymized With Greedy')
print('\nNodes of Anonymized Graph: {}'.format(graph_from_greedy.nodes))
print('\nEdges of Anonymized Graph: {}'.format(graph_from_greedy.edges))

Anonymized With Dynamic
Nodes of Anonymized Graph: [452, 33, 34, 78, 54, 1429, 1431, 1430, 628, 294, 1398, 417, 952, 465, 8, 1384, 1204, 1138, 4, 1292, 1182, 637, 1414, 308, ...]

Edges of Anonymized Graph: [(452, 33), (452, 34), (452, 78), (452, 54), (452, 1429), (452, 1431), (452, 1430), (33, 628), (33, 465), (33, 1204), (33, 4), (33, 1414), (33, 308), (33, 63), (33, 88), (33, 1112), (33, 123), (33, 14), ...]


Anonymized With Greedy

Nodes of Anonymized Graph: [23, 33, 34, 1280, 78, 54, 1429, 977, 1431, 1430, 294, 518, 1327, 678, 1299, 1046, 509, 175, 1027, 313, 1251, 567, 1320, 1231, ...]

Edges of Anonymized Graph: [(23, 33), (23, 34), (33, 977), (33, 509), (33, 1251), (33, 215), (33, 276), (33, 1154), (33, 1367), (33, 992), (33, 860), (33, 986), (33, 1322), (33, 1586), (33, 675), (33, 1549), (33, 1557), (33, 411), ...]

Quaker Data: Example of a reconstructed Graph for $k = 9$ from both Algorithms¶

k=9

graph_from_dynamic = dict_of_graph_dynamic_quaker.get(k)
graph_from_greedy = dict_of_graph_greedy_quaker.get(k)


figure,(figure_dynamic, figure_greedy) = plt.subplots(1, 2,figsize=(20,9))


for v, e in graph_from_dynamic.edges:
    graph_from_dynamic.add_edge(v, e)

nx.draw(graph_from_dynamic, with_labels = False, alpha=0.8, node_size = 40, ax=figure_dynamic )
figure_dynamic.set_title('Graph from dynamic')



for v, e in graph_from_greedy.edges:
    graph_from_greedy.add_edge(v, e)

nx.draw(graph_from_greedy, with_labels = False, alpha=0.8, node_size = 40, ax=figure_greedy, node_color='g')
figure_greedy.set_title('Graph from Greedy')

plt.show()

5. Conclusion¶

The Memorized Dynamic Programing algorithm is the only algorithm that is outside the article and we have created ourselves which in practice performs $500$ times faster than the Optimized Dynamic Programing Algorithm mentioned in the paper with $100$% identical results with time complexity $~O(n)$. But it is slower than Greedy Algorithm.
The same algorithm has a higher accuracy than the Greedy Algorithm and produces more reconstructable Degree sequences in terms of Graph Anonymization.
The reason could be the fact that, dynamic programing digs deeper in checking more numer of possible anonymization factors than the greedy algorithm which as the name suggests always looks for lowest cost and misses many deeper anonymization factor. This explanation can be supported by the fact that dynamic programing is computationally less expensive and takes less time to execute but as a cost of lower performance in terms of accuracy.

6. References¶

Towards Identity Anonymization on Graphs. Kun Liu, Evimaria Terzi

7. Github Link to the Project¶

A comparison between Dynamic Programing and Greedy-swap Algorithm.

Graph Anonymization: A comparison between Dynamic Programing and Greedy-swap Algorithm¶

Sanchayan Bhunia $(4849650)$, Sahitya Reddy Bollavaram $(4849759)$ Date: $12^{th}$ March $2021$

1. Introduction¶

1.1. What is a Graph?¶

Direction of a Graph¶

Directed Graph¶

1.2. Undirected Graph¶

1.3. Degree Sequence¶

2. Graph Anonymization¶

2.1. $k$-Degree Anonymity¶

2.1.1. Cost function¶

2.2. $k$-Degree Anonymization Algorithms¶

2.2.1. Dynamic Programing¶

Condition 1:¶

Condition 2:¶

The Dynamic Programing Algorithm¶

Artificial Data Generator¶

Checking if our algorithm actually works¶

Time Complexity¶

Observation¶

Optimization¶

Python Implementation¶

Observation¶

2.2.2. Memorized Dynamic Programing¶

Analysis¶

2.3. The "Dynamic Module" and its Usage

a. Using the module

b. dynamic.calculateCost(original_list, modified_list)

Input:

Output:

c. dpGraphAnonymization(k, degree_sequence, optimization)

Input:

Output:

d. memorizedDPGraphAnonymization(k, degree_sequence, optimization)

Input:

Output:

Conformance Checking¶

Performance Checking¶

Conclusion¶

3. Comparison between Greedy Swap and Dynamic Programing

Input:

Output:

3.1. Running-time Performance Check¶

3.2. Accuracy Performance Check¶

Experiment¶

a) Getting the Degree Sequence and correspondin Array Sequence from "quakers data"¶

b) Getting the Degree Sequence and correspondin Array Sequence from "netscience.gml"¶

A Function to Compute the value of R for a $k$ and a Degree List¶

Plotting R vs $k$ Barchart¶

Observation¶

4. Graph Reconstruction¶

4.1. Quaker dataset: Graph Reconstruction of Anonymized Sequences from Both of the Algorithms for various $k$¶

4.2. Netscience dataset : Graph Reconstruction of Anonymized Sequences from Both of the Algorithms for various $k$¶

Observation¶

Netscience Data: Example of a reconstructed Graph for $k = 8$ from both Algorithms¶

Quaker Data: Example of a reconstructed Graph for $k = 9$ from both Algorithms¶

5. Conclusion¶

6. References¶

7. Github Link to the Project¶

Sanchayan Bhunia $(4849650)$, Sahitya Reddy Bollavaram $(4849759)$
Date: $12^{th}$ March $2021$