꾸준히 써보는 공부 기록

AM-GCN 본문

Paper Review

AM-GCN

jisu1013 2020. 12. 20. 22:25

ABSTRACT

Weakness of GCN

capability of the state-of-the-art GCNs in fusing node features and topological structures is distant from optimal or even satisfactory

⇒ GCN은 topological structures과 node features 사이의 상관 정보를 adaptively 학습한다

Solution

GCN 성능은 유지하면서 node features와 topological structures을 같이 고려하는 AM-GCN

Adaptive Multi-channel Graph Convolutional Networks for semi-supervised classification

The cental idea is that we extract the specific and common embeddings from node features, topological structures, and their combinations simultaneously, and use the attention mechanism to learn adaptive importance weights of the embeddings.

중심 아이디어는 동시에 node feature와 topological structures 그리고 그들의 combinations로부터 명확하고 공통된 Embedding 값을 뽑아낸다는 것이다. 그리고 Attention mechanism을 사용해서 Embedding들의 Adaptive Importance Wieghts를 학습한다.

⇒ AM-GCN은 node features와 topological structures로부터 가장 correlated information을 뽑아내고 정확도를 향상시키게 된다.

1. INTRODUCTION

Typical GCN

usually follow a message-passing manner

A key step : feature aggregation - a node aggregates feature information from its topological neighbors in each convolutional layer

Q) In this way, feature information propagates over network topology to node embedding, and then node embedding learned as such is used in classification tasks.

Q) The whole process is supervised partially by the node labels.

우선 GCN에서도 fusion strategy를 사용한다. 그리고 이 fusion process는 end-to-end learning framework에 의해 supervised된다.

그러나,

최근 GCN 연구들에서 node feature와 topological structure을 fusing하는 것의 한계를 볼 수 있다.

1) Li et al.[15] shows that GCNs actually perform the Laplacian smoothing on node features, and make the node embedding in the whole network gradually converge.(모이다)

2) Nt and Maehara[20] and Wu et al.[30] prove that topological structure play the role of low-pass filtering on node features when the feature information propagates over network topological structure.

3) Gao et al. [8] design a Conditional Random Field (CRF) layer in GCN to explicitly preserve connectivity between nodes.

  • Hidden Markov Model (HMM)

    Markov Model은 어떠한 날씨, 주식 가격 등과 같은 어떠한 현상의 변화를 확률 모델로 표현한 것이다. Hidden Markov Model은 이러한 Markov model에 은닉된 state와 직접적으로 확인 가능한 observation을 추가하여 확장한 것이다.

    HMM은 observation을 이용하여 간접적으로 은닉된 state를 추론하기 위한 문제를 풀기 위해 사용된다.

    위 그림은 은닉된 state와 그에 따른 observation의 개념을 나타낸다.

    HMM을 이용해 우리가 풀고자 하는 문제는 관측 가능한 것은 오직 yty_t 뿐이며,

    yty_tqtq_t에 종속적으로 발생한다고 할 때, yty_t의 sequence를 통해 qtq_t의 sequence를 추

    론하는 것이다.

    → 현재는 필기체 인식, 동작 인식, 품사 태깅 등과 같이 이전 시간의 데이터에 영향을

    받는 순차 데이터에서 패턴을 인식하기 위해 이용되고 있다

    1. HMM의 구조

    HMM은 <Q,Y,π,T,E><Q,Y,\pi,T,E>라는 tuple로 정의. 각각의 요소가 의미하는 내용은 다음과 같다.

    • Q={q1,q2,...,qN}Q=\{q_1,q_2,...,q_N\} : 은닉된 state들의 집합
    • Y={y1,y2,...,yM}Y=\{y_1,y_2,...,y_M\}: 은닉된 state에서 발생할 수 있는 observation들의 집합
    • π\pi : RN\mathbb{R}^N- 초기 state가 qiq_i일 확률을 나타내는 initial probability p(qi)p(q_i)의 집합
    • AA : RN×N\mathbb{R}^{N\times N} - qiq_i에서 qjq_j로 이동할 확률을 나타내는 transition probability

      p(qjqi)p(q_j|q_i)의 집합

    • BB : RN×M\mathbb{R}^{N\times M} - qiq_i에서 yjy_j가 발생할 확률을 나타내는 emission probability

      p(yjqi)p(y_j|q_i)의 집합

    그래프에서 sis_i는 은닉된 state를 나타내며, QQ에 있는 원소 중 한

    가지 값을 갖는다. ojo_jsis_i에서 발생한 observation을 의미하며,

    Y의 원소 중 한 가지 값을 갖는다.

    예시)

    지출내역을 통해 과거의 행동을 유추하는 문제를 HMM으로 정

    의할 수 있다. 오늘 하려는 행동은 무엇을 했는지에 영향을 받으

    며, 과거에 했던 행동은 과거로 돌아가지 않는 한 다시는 관측할

    수 없기 때문에 과거의 행동은 은닉된 state에 해당한다.

    이 예제에서 Q={study,friend,game}Q=\{study,friend,game\}

    지출 내역은 카드명세서를 통해 과거의 내역이더라도 직접적으

    로 관찰할 수 있기 때문에 observation에 해당됨. 이 예제에서

    Y={1,2,3,4}Y=\{1,2,3,4\}이며, 값이 클 수록 돈을 많이 사용했다는 것을

    의미.

    → HMM은 이러한 확률적 사실들을 바탕으로 observation을 통

    해 은닉된 state를 추론.

    2. HMM의 동작

  • Conditional Random Field (CRF)

    하나의 sequence를 보고 판단하는 것이 아니라 이전 혹은 이후를 참고하여 지금 상태를 결정하는 것 ⇒ Conditional Random Field (CRF)


What information do GCNs really learn and fuse from topological structures and node features?

⇒ fundamental question인데 그 이유는 GCN이 end-to-end learning framework이기 때문이다.

GCN cannot adequately fuse node features and topological structures to extract the most correlated information.

*** 이 논문의 목적은 GCN의 성능은 그래도 유지하면서 동시에 topological structures와 node features를 fusing하는 능력을 향상 시키는 것이다 !!

Central Idea

we learn the node embedding based on node features, topological structures, and their combinations simultaneously !!

Technically,

in order to fully exploit the information in feature space, we derive k-nearest neighbor graph generated from node features as the feature structural graph.

1) With the feature graph and the topology graph, we propagate node features over both topology space and feature space, so as to extract two specific embeddings in these two spaces with two specific convolution modules.

2) Considering the common characteristics between two spaces, we disign a common convolution module with a parameter sharing strategy to extract the common embedding shared by them.

3) utilize the attention mechanism to automatically learn the importance weights for different embeddings, so as to adaptively fuse them.

4) design the consistency and disparity constraints to ensure the consistency and disparity of the learned embeddings.

2. FUSION CAPABILITY OF GCNS : AN EXPERIMENTAL INVESTIGATION

GCN이 topological structure와 node feature로부터 정보를 adaptively extract하지 못한다는 것을

보여주는 실험

The main idea is that we will clearly establish the high correlation between node label with network topology and node features, respectively.

⇒ GCN의 좋은 fusion 능력이란 correlated information을 adaptively extract하는 것이다.

2.1 Case 1: Random Topology and Correlated Node Features

  • For each class randomly select 20 nodes for training and another 200 nodes for test
  • Train : GCN and MLP (baseline)
  • The node features are highly correlated with the node labels, MLP shows excellent performance
  • GCN extracts information from both node features and topological structures but cannot adaptively fuse them to avoid the interference from topological structures.

    → Cannot match the high performace of MLP

2.2 Case 2: Correlated Topology and Random Node Features

  • Generate another network with 900 nodes
  • The node features, each of 50 dimensions, are randomly generated
  • For the topological structure, employ the Stochastic Blockmodel(SBM) to split nodes into 3 communities
  • Within each community, the probability of building an edge is set to 0.03, and the probability of building an edge between nodes in different communities is set to 0.0015.
  • In this data set, the node labels are highly correlated with the topological structures, but not the node features
  • For each class randomly select 20 nodes for training and another 200 nodes for test
  • Train : GCN and DeepWalk
  • DeepWalk performs well because it models network topological structures throughly
  • GCN extracts information from both the node features and the topological structures, but cannot adaptively fuse them to avoid the inference from node features.

    → Cannot match the high performance of DeepWalk

Summary

  • The current fusion mechanism of GCN is distant from optimal
  • Even the correlation between node label with network topology or node features is very high, the current GCN cannot make full use of the supervision by node label to adaptively extract the most correlated information
  • Rethink the current mechanism of GCN because it is hard to know whether the topology or the node features are more correlated with the final task

3. AM-GCN : THE PROPOSED MODEL

Problem Setting

semi-supervised node classification

in an attributed graph G=(A,X)G=(A,X),

where ARn×nA \in\mathbb{R}^{n\times n} is the symmetric adjacency matrix with n nodes,

XRn×dX\in \mathbb{R}^{n \times d} is the node feature matrix,

dd is the dimension of node features.

Suppose each node belongs to one out of C C\ classes

(위의 그림은 overall framework of AM-GCN)

Key Idea

  • Node features to propagate not only in topology space and in feature space
  • The most correlated information with node label should be extracted from both of these two spaces
  • Construct a feature graph based on node features XX
  • XX propagate over both of feature graph and topology graph to learn two specific embeddings ZFZ_F and ZTZ_T
  • Design common convolution module with parameter sharing strategy to learn the common embedding ZCFZ_{CF} and ZCTZ_{CT}
  • Consistency constraint Lc\mathcal{L}_c is employed to enhance the "common" property of ZCFZ_{CF} and ZCTZ_{CT}
  • Disparity constraint Ld\mathcal{L}_d is to ensure the independence between ZFZ_F and ZCF, Z_{CF},\ ZTZ_T and ZCTZ_{CT}
  • Utilizes an Attention Mechanism to adaptively fuse these embeddings and extract the most correlated information ZZ for the final classification task

3.1 Specific Convolution Module

In order to capture the underlying structure of nodes

in Feature Space

  • construct a kk-nearest neighbor(kNNkNN) graph Gf=(Af,X)G_f=(A_f, X) based on node feature matrix XX
  • AfA_f : The adjacency matrix of kNNkNN  graph
  • nn개의 node들 사이에서 similarity matrix SRn×nS\in \mathbb{R}^{n\times n}를 계산

SS를 연산하는데 많은 방법들이 있지만, 가장 많이 쓰이는 2가지

나열. (xix_i and xjx_j are feature vectors of nodes ii and jj)

1) Cosine Similarity (논문에서는 이거 사용함)

It uses the cosine value of the angle between two vectors to measure the similarity

Sij=xixjxixjS_{ij}=\frac{x_i\cdot x_j}{|x_i||x_j|}

2) Heat Kernel

tt is the time parameter in heat conduction equation and we set t=2t=2

Sij=exixj2tS_{ij}=e^{-\frac{|||x_i-x_j||^2}{t}}


결론

  • Choose the Consine Similarity to obtain the similarity matrix SS
  • Choose top kk similar node pairs for each node to set edges and

    finally get the adjacency matrix AfA_f

  • With the input graph (Af,XA_f,X) in feature graph, the ll-th layer

    output Zf(l)Z_f^{(l)} can be represented as :

    Zf(l)=ReLU(D~12Af~D~12Zf(l1)Wf(l))Z_f^{(l)}=ReLU(\tilde{D}^{-\frac{1}{2}}\tilde{A_f}\tilde{D}^{-\frac{1}{2}}Z_f^{(l-1)}W_f^{(l)})

    Wf(l)W_f^{(l)} : weight matrix of the ll-th layer in GCNGCN,

    ReLUReLU : Relu activation function,

    The initial Zf(0)=XZ_f^{(0)}=X,

    A~f=Af+If\tilde{A}_f=A_f+I_f,

    D~f\tilde{D}_f : the diagonal degree matrix of A~f\tilde{A}_f,

    ZFZ_F : the last layer output embedding

이 과정을 통해서, we can learn the node embedding which captures the specific information ZFZ_F in feature space

As for the topology space,

we have the original input graph Gt=(At,Xt)G_t=(A_t,X_t)

where At=AA_t=A and Xt=XX_t=X

→ Then, the learned output embedding ZTZ_T based on topology graph can be calculated in the same way as in feature space.


3.2 Common Convolution Module

Not only extract the node specific embedding in these two spaces, but also to extract the common information shared by the two spaces.

⇒ common한 것을 찾는 이유가 정보의 어느 부분이 가장 correlated한지 결정하는 task를 하는데

유용하다. 그리고 common information을 찾기 위해서 Common-GCN을 디자인함.

First,

1) Common-GCN to extract the node embedding Zct(l)Z_{ct}^{(l)}  from topology graph (At,X)(A_t,X) as follows

Zct(l)=ReLU(Dt~12At~Dt~12Zct(l1)Wc(l))Z_{ct}^{(l)}=ReLU(\tilde{D_t}^{-\frac{1}{2}}\tilde{A_t}\tilde{D_t}^{-\frac{1}{2}}Z_{ct}^{(l-1)}W_c^{(l)}) (4)

where Wc(l)W_c^{(l)} is the ll-th layer weight matrix of CommonGCNCommon-GCN and Zct(l1)Z_{ct}^{(l-1)} is the node embedding in the (l1)(l-1)th layer and Zct(0)=XZ_{ct}^{(0)}=X.

2) When utilizing CommonGCNCommon-GCN to learn the node embedding from feature graph (Af,X)(A_f,X),

in order to extract the shared information, we share the same weight matrix Wc(l)W_c^{(l)} for every layer of CommonGCNCommon-GCN as follows:

Zcf(l)=ReLU(D~f12A~fD~f12Zcf(l1)Wc(l))Z_{cf}^{(l)}=ReLU(\tilde{D}_f^{-\frac{1}{2}}\tilde{A}_f\tilde{D}_f^{-\frac{1}{2}}Z_{cf}^{(l-1)}W_c^{(l)})

where Zcf(l)Z_{cf}^{(l)} is the ll-layer output embedding and Zcf(0)=XZ_{cf}^{(0)}=X.

The shared weight matrix can filter out the shared characteristics from two spaces.

⇒ 이 부분 이해안됨 !!

According to different input graphs,

we can get two output embedding ZCTZ_{CT} and ZCFZ_{CF} and the common embedding ZCZ_C of the two spaces is :

ZC=(ZCT+ZCF)/2Z_C=(Z_{CT}+Z_{CF})/2

3.3 Attention Mechanism

Now we have two specific embeddings ZT,ZFZ_T, Z_F and one common embedding ZCZ_C

Considering the node label can be correlated with one of them or even their combinations,

we use the attention mechanism att(ZT,ZC,ZF)att(Z_T,Z_C,Z_F) to learn their corresponding importance

(αt,αc,αf)Rn×1(\alpha_t,\alpha_c,\alpha_f)\in \mathbb{R}^{n\times 1} indicate the attention values of nn nodes with embeddings ZT,ZC,ZFZ_T,Z_C,Z_F

Here we focus on node i,i, where its embedding in ZTZ_T is zTiR1×hz_T^i\in \mathbb{R}^{1\times h} (i.e., the ii-th row of ZTZ_T)

Firstly,

transform the embedding through a nonlinear transformation, and then use one shared attention vector qRh×1q\in \mathbb{R}^{h' \times 1} to get attention value ωTi\omega_T^i as follows:

ωTi=qTtanh(W(zTi)T+b)\omega_T^i=q^T\cdot tanh(W\cdot(z_T^i)^T+b)

WRh×hW\in \mathbb{R}^{h'\times h} is the weight matrix and bRh×1b\in \mathbb{R}^{h' \times 1} is the bias vector.

We can get the attention values ωCi\omega_C^i and ωFi\omega_F^i for node ii in embedding matrices ZCZ_C  and ZFZ_F, respectively.

Normalization

normalize the attention values ωTi,ωCi,ωFi\omega_T^i,\omega_C^i,\omega_F^i with softmax function to get the final weight:

αTi=softmax(ωTi)=exp(ωTi)exp(ωTi)+exp(ωCi)+exp(ωFi)\alpha_T^i=softmax(\omega_T^i)=\frac{exp(\omega_T^i)}{exp(\omega_T^i)+exp(\omega_C^i)+exp(\omega_F^i)}

⇒ Larger αTi\alpha_T^i implies the corresponding embedding is more important !!

Similarly, aCi=softmax(ωCi)a_C^i=softmax(\omega_C^i) and αFi=softmax(ωFi)\alpha_F^i=softmax(\omega_F^i)

For all the nn nodes, we have the learned weights αt=[αTi],αc=[αCi],αf=[αFi]Rn×1\alpha_t=[\alpha_T^i],\alpha_c=[\alpha_C^i], \alpha_f=[\alpha_F^i] \in \mathbb{R}^{n\times 1}

αT=diag(αt),αC=diag(αc),αF=diag(αf)\alpha_T=diag(\alpha_t), \alpha_C=diag(\alpha_c),\alpha_F=diag(\alpha_f)

⇒ Combine these three embeddings to obtain the final embedding ZZ :

Z=αTZT+αCZC+αFZFZ=\alpha_T\cdot Z_T+ \alpha_C\cdot Z_C+\alpha_F\cdot Z_F

3.4 Objective Function

3.4.1 Consistency Constraint

For the two output embeddings ZCTZ_{CT} and ZCFZ_{CF} of Common-GCN,

(despite the Common-GCN has the shared weight matrix)

we design a consistency constraint to further enhance their commonality.

Firstly,

we use L2-normalization to normalize the embedding matrix as ZCTnor,ZCFnorZ_{CTnor},Z_{CFnor}

Then, the two normalized matrix can be used to capture the similarity of nn nodes as STS_T  and SFS_F

ST=ZCTnorZCTnorTS_T=Z_{CTnor}\cdot Z_{CTnor}^T

SF=ZCFnorZCFnorTS_F=Z_{CFnor}\cdot Z_{CFnor}^T

The consistency implies that the two similarity matrices should be similar, which gives rise to following constraint :

Lc=STSFF2\mathcal{L}_c=||S_T-S_F||^2_F

3.4.2 Disparity Constraint

Because embedding ZTZ_T and ZCTZ_{CT} are learned from the same graph Gt=(At,Xt)G_t=(A_t,X_t), to ensure they can capture different information, We employ the Hilbert-Schmidt Independence Criterion (HSIC),

a simple but effective measure of independence, to enhance the disparity of these two embeddings.

'Paper Review' 카테고리의 다른 글

Neural Collaborative Filtering (WWW '17)  (0) 2022.05.14
Comments