`tutorial.Rmd`

This is a short tutorial to show how to employ the `ghypernet`

R package for the estimation of gHypEG. In the following, we use Zachary’s Karate Club dataset as baseline. The data is provided within this package and can be simply loaded. It corresponds to a weighted adjacency matrix whose entries report the number of interactions between individuals. The graph specified by the adjacency matrix is *undirected* and does not have *self-loops*. For this reason, we specify the two parameters `directed=FALSE`

and `selfloops=FALSE`

.

library(ghypernet) data("adj_karate") directed <- FALSE selfloops <- FALSE print(adj_karate[1:10,1:10]) #> Mr Hi Actor 2 Actor 3 Actor 4 Actor 5 Actor 6 Actor 7 Actor 8 Actor 9 #> Mr Hi 0 4 5 3 3 3 3 2 2 #> Actor 2 4 0 6 3 0 0 0 4 0 #> Actor 3 5 6 0 3 0 0 0 4 5 #> Actor 4 3 3 3 0 0 0 0 3 0 #> Actor 5 3 0 0 0 0 0 2 0 0 #> Actor 6 3 0 0 0 0 0 5 0 0 #> Actor 7 3 0 0 0 2 5 0 0 0 #> Actor 8 2 4 4 3 0 0 0 0 0 #> Actor 9 2 0 5 0 0 0 0 0 0 #> Actor 10 0 0 1 0 0 0 0 0 0 #> Actor 10 #> Mr Hi 0 #> Actor 2 0 #> Actor 3 1 #> Actor 4 0 #> Actor 5 0 #> Actor 6 0 #> Actor 7 0 #> Actor 8 0 #> Actor 9 0 #> Actor 10 0

GHypEG Models can be fitted using the general function `ghype`

. However, we provide some default functions to fit specific models. These functions take an adjacency matrix (not sparse) as main argument, or an `igraph`

graph object. They further allow to specify whether the model should be directed or not, and whether it should have selfloops or not.

The function `regularm`

allows to fit the simplest possible model that can be formulated with gHypEGs: gnp-like graph where only the average degree is preserved.

(regularmodel <- regularm(graph = adj_karate, directed = directed, selfloops = selfloops)) #> Call: #> ghype.matrix(graph = graph, directed = directed, selfloops = selfloops, #> unbiased = TRUE, regular = TRUE) #> ghype undirected , no selfloops #> 34 vertices, 231 edges #> Loglikelihood: #> -584.7139 #> df: 1

The function `scm`

allows to fit the soft-configuration model, where all degrees are preserved.

(confmodel <- scm(graph = adj_karate, directed = directed, selfloops = selfloops)) #> Call: #> ghype.matrix(graph = graph, directed = directed, selfloops = selfloops, #> unbiased = TRUE, regular = FALSE) #> ghype undirected , no selfloops #> 34 vertices, 231 edges #> Loglikelihood: #> -434.5449 #> df: 34

From the models, we can generate random realisations that can be used, e.g., as null-models to test hypothesis about the data. This can be achieved by means of the function `rghype`

, specifying the number of realisations to take.

random_graphs_rm <- rghype(nsamples = 10, model = regularmodel, seed = 123) random_graphs_scm <- rghype(nsamples = 10, model = confmodel, seed = 123)

Finally, we can perform model selection and hypothesis testing by comparing statistics of two models. We can follow two different approaches, comparing information scores such as AIC, or performing a likelihood ratio test. One simple question we can ask, for example, is whether the degree sequence of the Karate Club can be simply explained by a regular model, or there is the need to use a configuration model. AIC scores can be obtained using the standard R function `AIC`

. Likelihood-ratio tests are performed using the function `lr.test`

. However, we provide some wrappers for the function `lr.test`

for some commonly done tests like the one above.

Comparing AIC scores for the regular model and the configuration model yields the following result:

AIC(regularmodel) #> [1] 1171.428 AIC(confmodel) #> [1] 937.0897 # difference in AICs, high value means more complex model is better AIC(regularmodel) - AIC(confmodel) #> [1] 234.338

This shows that the there is strong evidence for the employing the configuration model, because the degree sequence of the graph strongly deviates from the regular one.

To increase confidence in this result, we can compare the result above with that obtained from random data generated from the regular model.

# Generate regular models and configuration models for random graphs regularmodels <- lapply(X = random_graphs_rm, FUN = regularm, directed=directed, selfloops=selfloops) confmodels <- lapply(X = random_graphs_rm, FUN = scm, directed=directed, selfloops=selfloops) # Compute AICs AIC_regularmodels <- sapply(X = regularmodels,FUN = AIC) AIC_confmodels <- sapply(X = confmodels,FUN = AIC) # differences in AIC, high value means more complex model is better AIC_regularmodels - AIC_confmodels #> [1] -46.55258 -32.36789 -20.92125 -28.61791 -34.33781 -39.78901 -31.21541 #> [8] -30.15379 -18.92063 -41.49301

As can be seen by the experiment above, in the case of random graphs generated from the regular model, the AIC of the configuration model is *higher* than that of the regular model, confirming that the added complexity is not justified.

The (empirical) likelihood-ratio test gives a similar result, with the benefit of providing a p-value. The function `conf.test`

provides a simple way to perform the test without the need of computing the models.

conf.test(graph = adj_karate, directed = directed, selfloops = selfloops, seed=123) #> #> LR test -- gnp vs CM #> #> data: #> lr = 299.79, df = 33, p-value < 2.2e-16 #> alternative hypothesis: one.sided #> 95 percent confidence interval: #> 19.63033 51.83806

Similarly to what done above, we can perform the test using the random graphs. Now we expect large p-values.

The next model that can be estimated is the block constrained configuration model. The Karate Club has a well-known partitioning into two communities that can be loaded from the package. We fit a bccm using the `bccm`

function, specifying the vertex labels.

data("vertexlabels") (blockmodel <- bccm(adj = adj_karate, labels = vertexlabels, directed = directed, selfloops = selfloops)) #> Call: #> bccm(adj = adj_karate, labels = vertexlabels, directed = directed, #> selfloops = selfloops) #> block ghype undirected , no selfloops #> 34 vertices, 231 edges #> Loglikelihood: #> -336.8385 #> df: 36 #> #> Coefficients: #> Estimate Std.Err t value Pr(>t) #> 1 1.00000000 0.00000000 Inf < 2.2e-16 *** #> 2 0.09044494 0.00021692 416.95 < 2.2e-16 *** #> 4 0.91049902 0.00040434 2251.79 < 2.2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 print(blockmodel$blockOmega) #> 1 2 #> 1 1.00000000 0.09044494 #> 2 0.09044494 0.91049902

By default, the function fits a ‘full’ block structure, where every parameter for in- out-blocks relations are different. In this case this corresponds to three parameters: one for block 1, one for block 2, one for the edges between the two blocks. However, in some cases we see that the parameters of in-blocks relations are similar to each other, as can be seen above looking at the diagonal entries of the block matrix. In this case, a full block structure may overfit the data, while a simpler model could be better. This can be fitted setting the parameter `homophily=TRUE`

. This corresponds to a model where there are only two parameters, one for in-block edges, one for out-block edges, irrespectively of the number of blocks. The result is shown below.

(blockmodel_2 <- bccm(adj = adj_karate, labels = vertexlabels, directed = directed, selfloops = selfloops, homophily = TRUE)) #> Call: #> bccm(adj = adj_karate, labels = vertexlabels, directed = directed, #> selfloops = selfloops, homophily = TRUE) #> block ghype undirected , no selfloops #> 34 vertices, 231 edges #> Loglikelihood: #> -337.0666 #> df: 35 #> #> Coefficients: #> Estimate Std.Err t value Pr(>t) #> 1 1.00000000 0.00000000 Inf < 2.2e-16 *** #> 2 0.09512422 0.00020628 461.14 < 2.2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first test we need to perform is whether either of the model just fitted significantly improve the fit to the data. We do so by performing a likelihood ratio test between the configuration model and the bccms.

lr.test(nullmodel = confmodel, altmodel = blockmodel, seed = 123) #> #> LR test #> #> data: #> lr = 195.41, df = 2, p-value < 2.2e-16 #> alternative hypothesis: one.sided #> 95 percent confidence interval: #> 1.306644e-08 8.230181e+00 lr.test(nullmodel = confmodel, altmodel = blockmodel_2, seed = 123) #> #> LR test #> #> data: #> lr = 194.96, df = 1, p-value < 2.2e-16 #> alternative hypothesis: one.sided #> 95 percent confidence interval: #> 1.474561e-08 8.031860e+00

Unsurprisingly, the test gives low p-values, confirming the presence of a block structure. Again, we can perform a similar analysis using AIC:

From this, we note that while specifying the full bccm gives an improvement in the score compared to the two parameters model, such improvement is relatively small, and appears to not justify the increased complexity. We can verify again this with a likelihood-ratio test:

lr.test(nullmodel = blockmodel_2, altmodel = blockmodel, seed=123) #> #> LR test #> #> data: #> lr = 0.45622, df = 1, p-value = 0.7645 #> alternative hypothesis: one.sided #> 95 percent confidence interval: #> 0.03722328 6.94467855

The result shows that the added complexity of the second model is not justified by the data. We investigate this further by comparing the results with what obtained from random variations.

# First generate random sample from blockmodel random_graphs_bccm2 <- rghype(nsamples = 100, model = blockmodel_2, seed = 123) # Generate the two models for random graphs blockmodels <- lapply(X = random_graphs_bccm2, FUN = bccm, labels = vertexlabels, directed=directed, selfloops=selfloops) blockmodel_2s <- lapply(X = random_graphs_bccm2, FUN = bccm, labels = vertexlabels, directed=directed, selfloops=selfloops, homophily = TRUE) # Compute AICs AIC_blockmodels <- sapply(X = blockmodels, FUN = AIC) AIC_blockmodel_2s <- sapply(X = blockmodel_2s, FUN = AIC) # mean difference in AIC, high value means more complex model is better summary(AIC_blockmodel_2s - AIC_blockmodels) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> -2.0000 -1.5366 -0.6637 -0.1704 0.6268 8.3348

This confirms the fact that the asymmetry in the two blocks can ascribed to random variations.

Finally, our framework allows to evaluate the goodness-of-fit of model to data. Similarly to a multinomial goodness-of-fit, we perform a test of the chosen against the *maximally complex model* that can be formulated. This model provides the baseline against which comparing simpler models in terms of their goodness-of-fit. The ‘full model’ is specified such that the observed graph is the expected one. It is fitted using the function `ghype`

with the flag `unbiased=FALSE`

. The likelihood ratio test can be performed either manually or using the function `gof.test`

.

fullmodel <- ghype(graph = adj_karate, directed = directed, selfloops = selfloops, unbiased = FALSE) lr.test(nullmodel = blockmodel_2, altmodel = fullmodel, seed = 123) #> #> LR test #> #> data: #> lr = 454.89, df = 559, p-value = 7.719e-14 #> alternative hypothesis: one.sided #> 95 percent confidence interval: #> 270.4443 340.7790 gof.test(model = blockmodel_2, seed = 123) #> #> LR test -- GOF #> #> data: #> lr = 454.89, df = 559, p-value = 7.719e-14 #> alternative hypothesis: one.sided #> 95 percent confidence interval: #> 270.4443 340.7790

The reason for this result is the fact that, although the bccm is a good fit for the empirical graph, the latter is characterised by some particular structures that are not encoded by the bccm. For example, the empirical graph is characterised by few ‘bridges’ between the two communities, managed by relatively low degree vertices. Instead, the bccm assumes that *all* vertices connect weakly the two communities, with an amount of edges proportional to the degree.