In Python available: St. Nicolas House Algorithm (SNHA) with bootstrap support for improved performance in dense networks

Tim Hake; Bernhard Bodenberger; Detlef Groth

doi:10.52905/hbph2023.1.63

Authors

Tim Hake University of Potsdam, Institute of Biochemistry and Biology, Bioinformatics Group, 14469 Potsdam, Germany
Bernhard Bodenberger University of Potsdam, Institute of Biochemistry and Biology, Bioinformatics Group, 14469 Potsdam, Germany
Detlef Groth University of Potsdam, Institute of Biochemistry and Biology, Bioinformatics Group, 14469 Potsdam, Germany https://orcid.org/0000-0002-9441-3978

DOI:

https://doi.org/10.52905/hbph2023.1.63

Keywords:

Python, correlation, network reconstruction, bootstrap, St. Nicolas house algorithm

Abstract

The St. Nicolas House algorithm (SNHA) finds association chains of direct dependent variables in a data set. The dependency is based on the correlation coefficient, which is visualized as an undirected graph. The network prediction is improved by a bootstrap routine. It enables the computation of the empirical p-value, which is used to evaluate the significance of the predicted edges. Synthetic data generated with the Monte Carlo method were used to firstly compare the Python package with the original R package, and secondly to evaluate the predicted network using the sensitivity, specificity, balanced classification rate and the Matthew's correlation coefficient (MCC). The Python implementation yields the same results as the R package. Hence, the algorithm was correctly ported into Python. The SNHA scores high specificity values for all tested graphs. For graphs with high edge densities, the other evaluation metrics decrease due to lower sensitivity, which could be partially improved by using bootstrap,while for graphs with low edge densities the algorithm achieves high evaluation scores. The empirical p-values indicated that the predicted edges indeed are significant.

References

Barabási, A.-L./Albert, R. (1999). Emergence of Scaling in Random Networks. Science 286 (5439), 509–512. https://doi.org/10.1126/science.286.5439.509.

Brodersen, K. H./Ong, C. S./Stephan, K. E./Buhmann, J. M. (2010). The Balanced Accuracy and Its Posterior Distribution. In: 20th International Conference on Pattern Recognition, 3121–3124.

Burger, L./Nimwegen, E. (2010). Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments. PLoS computational biology 6, e1000633. https://doi.org/10.1371/journal.pcbi.1000633.

Carbonnelle, Pierre (2022). PYPL PopularitY of Programming Language 2022. Available online at https://statisticstimes.com/tech/top-computer-languages.php (accessed 12/15/2022).

Colby, S. M./McClure, R. S./Overall, C. C./Renslow, R. S./McDermott, J. E. (2018). Improving network inference algorithms using resampling methods. BMC bioinformatics 19 (1), 376.

Cortez, Valeria (2017). Visualising stocks correlations with Networks 2017. Available online at https://towardsdatascience.com/visualising-stocks-correlations-with-networkx-88f2ee25362e (accessed 12/15/2022).

Davison, A./Hinkley, D. (1997). Bootstrap Methods and Their Application. Journal of the American Statistical Association 94. https://doi.org/10.2307/1271471.

Dorjee, B./Saha, P./Sen, J. (2021). Hierarchy of Associations Between BMI-for-Agez-Scores, Growth and Family Social Status Among Urban Bengali Girls of Siliguri Town, West Bengal: A St. Nicolas House Analysis. Journal of the Anthropological Survey of India 70 (2), 224–239. https://doi.org/10.1177/2277436X211043631.

Dunn, S./Wahl, L. M./Gloor, G. (2008). Mutual Information Without the Influence of Phylogeny or Entropy Dramatically Improves Residue Contact Prediction. Bioinformatics (Oxford, England) 24, 333–340. https://doi.org/10.1093/bioinformatics/btm604.

Feizi, S./Marbach, D./Médard, M./Kellis, M. (2013). Corrigendum: Network deconvolution as a general method to distinguish direct dependencies in networks. Nature biotechnology 33. https://doi.org/10.1038/nbt.2635.

Groth, D. (2023). snha: St. Nicolas House Algorithm for R. R package version 0.1.3. Available online at https://github.com/mittelmark/snha (accessed 7/5/2023).

Groth, D./Scheffler, C./Hermanussen, M. (2019). Body height in stunted Indonesian children depends directly on parental education and not via a nutrition mediated pathway? Evidence from tracing association chains by St. Nicolas House Analysis. Anthropologischer Anzeiger 76 (5), 445–451. https://doi.org/10.1127/anthranz/2019/1027.

Hake, T. (2023). Snha4py: a Python implementation of the St. Nicholas House algorithm. Available online at https://github.com/thake93/snha4py (accessed 2/1/2023).

Hemelrijk, C. (1990). A matrix partial correlation test used in investigations of reciprocity and other social interaction patterns at group level. Journal of Theoretical Biology 143, 405–420. https://doi.org/10.1016/S0022-5193(05)80036-0.

Hermanussen, M./Aßmann, C./Groth, D. (2021). Chain Reversion for Detecting Associations in Interacting Variables—St. Nicolas House Analysis. International Journal of Environmental Research and Public Health 18 (4). https://doi.org/10.3390/ijerph18041741.

Hesterberg, T. (2011). Bootstrap. WIREs Computational Statistics 3 (6), 497–526. https://doi.org/10.1002/wics.182.

Hopf, T./Colwell, L./Sheridan, R./Rost, B./Sander, C./Marks, D. (2012). Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing. Cell 149, 1607–1621. https://doi.org/10.1016/j.cell.2012.04.012.

La Fuente, A. de/Bing, N./Hoeschele, I./Mendes, P. (2005). Discovery of Meaningful Associations in Genomic Data Using Partial Correlation Coefficients. Bioinformatics (Oxford, England) 20, 3565–3574. https://doi.org/10.1093/bioinformatics/bth445.

Lapedes, A./Giraud, B./Liu, L./Stormo, G. (1997). Correlated Mutations in Protein Sequences: Phylogenetic and Structural Effects. Santa Fe Institute, Working Papers 33. https://doi.org/10.1214/lnms/1215455556.

Li, S./Hsu, L./Peng, J./Wang, P. (2011). Bootstrap inference for network construction with an application to a breast cancer microarray study. The Annals of Applied Statistics 7. https://doi.org/10.1214/12-AOAS589.

Marbach, D./Costello, J./Küffner, R./Vega, N./Prill, R./Camacho, D./Allison, K./Aderhold, A./Bonneau, R./Chen, Y./Collins, J./Cordero, F./Crane, M./Dondelinger, F./Drton, M./Esposito, R./Foygel, R./La Fuente, A. de/Gertheiss, J./Zimmer, R. (2012). Wisdom of crowds for robust gene network inference. Nature Methods 9, 796–804. https://doi.org/10.1038/nmeth.2016.

Marbach, D./Prill, R./Schaffter, T./Mattiussi, C./Floreano, D./Stolovitzky, G. (2010). Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences of the United States of America 107, 6286–6291. https://doi.org/10.1073/pnas.0913357107.

Marks, D./Colwell, L./Sheridan, R./Hopf, T./Pagnani, A./Zecchina, R./Sander, C. (2011). Protein 3D Structure Computed from Evolutionary Sequence Variation. PloS one 6, e28766. https://doi.org/10.1371/journal.pone.0028766.

Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9.

Metropolis, N./Ulam, S. (1949). The Monte Carlo Method. Journal of the American Statistical Association 44 (247), 335–341. Available online at http://www.jstor.org/stable/2280232 (accessed 9/12/2022).

North, B./Curtis, D./Sham, P. (2003). A note on calculation of empirical P values from Monte Carlo procedure. American journal of human genetics 72, 498–499. https://doi.org/10.1086/346173.

Novine, M./Mattsson, C. C./Groth, D. (2022). Network reconstruction based on synthetic data generated by a Monte Carlo approach. Human Biology and Public Health 3. https://doi.org/10.52905/hbph2021.3.26.

R Core Team (2022). R: A Language and Environment for Statistical Computing. Vienna, Austria 2022. Available online at https://www.R-project.org/.

Scheffler, C./Nguyen, T. H./Hermanussen, M. (2021). Vietnamese migrants are as tall as they want to be. Human Biology and Public Health 2. https://doi.org/10.52905/hbph.v2.12.

University of California, Berkeley (2022). 11 Most In-Demand Programming Languages in 2022 2022. Available online at https://bootcamp.berkeley.edu/blog/most-in-demand-programming-languages/ (accessed 12/15/2022).

van Rossum, G./Drake, F. L. (2009). Python 3 Reference Manual. Scotts Valley, CA, CreateSpace.

Veiga, D./Vicente, F./Grivet, M./La Fuente, A. de/Vasconcelos, A. (2007). Genome-wide partial correlation analysis of Escherichia coli microarray data. Genetics and molecular research : GMR 6, 730–742.

Yan Holtz (2018). Network from Correlation Matrix 2018. Available online at https://www.python-graph-gallery.com/327-network-from-correlation-matrix (accessed 12/15/2022).