gsnn.proc.bio
Functions
|
All (uniprot, gene_symbol, func_name) triples observed on any edge endpoint. |
|
Map Ensembl gene IDs (ENSG) to HGNC gene symbols using PyPath. |
|
Retrieve and standardise directed biological interactions from the OmniPath knowledge base suite. |
|
Map gene symbols to UniProt accession IDs using PyPath. |
|
Map UniProt accession IDs to HGNC gene symbols using PyPath. |
- gsnn.proc.bio.build_uniprot_symbol_map(func_edges)[source]
All (uniprot, gene_symbol, func_name) triples observed on any edge endpoint.
- Parameters:
func_edges (pd.DataFrame) – The second return value of
get_bio_interactions(). Must contain columnssrc,dst,source_uniprot,target_uniprot.- Returns:
Columns (in order):
uniprot,gene_symbol,func_name,node_kind('PROTEIN'or'RNA'). One row per unique(uniprot, func_name)pair. Many-to-many: a single uniprot may appear with multiple func_names, and a single func_name may carry multiple uniprots.- Return type:
pd.DataFrame
Notes
Excludes synthetic
COMPLEX:...accessions emitted by the OmniPath complexes endpoint (they are not real UniProt IDs).Excludes miRBase-style identifiers carried on miRNA edges (anything that does not look like a UniProt accession; see
_UNIPROT_RE).Pure pandas; no network I/O.
The unique uniprots here are a strict superset of the well-formed UniProt accessions in
func_nodes['uniprot'](i.e. those matching_UNIPROT_RE), becausefunc_nodeskeeps only one accession per node via.last()while this table retains every distinct(uniprot, func_name)pair seen on any edge endpoint.
- gsnn.proc.bio.ensg2symbol(ensg_ids, allow='1:m', drop_na=True)[source]
Map Ensembl gene IDs (ENSG) to HGNC gene symbols using PyPath.
A convenience wrapper around :pyfunc:`pypath.utils.mapping.map_name` that translates Ensembl gene IDs into their corresponding HGNC gene symbols.
Two mapping strategies are available (allow):
'1:m'- keep all gene symbols associated with an Ensembl ID (one-to-many, default).'1:1'- keep only the first gene symbol returned by PyPath for each Ensembl ID (one-to-one).
- Parameters:
ensg_ids (Sequence[str] or pandas.Series) – Iterable of Ensembl gene IDs. Duplicate IDs are collapsed to the unique set for the lookup, but the returned
DataFramecontains one row per combination of Ensembl ID and gene symbol.allow (str, optional) – Mapping strategy; must be either
'1:m'or'1:1'. Defaults to'1:m'.drop_na (bool, optional) – If
True, drop rows where the Ensembl ID could not be mapped to a gene symbol. Defaults toTrue.
- Returns:
A two-column DataFrame with
'ensg_id'- Ensembl gene ID (str)'gene_symbol'- HGNC gene symbol (str) or None if the Ensembl ID could not be mapped.
- Return type:
pandas.DataFrame
Example
>>> from gsnn.proc.bio import ensg2symbol >>> ensgs = pd.Series(['ENSG00000100030', 'ENSG00000171862', 'INVALID']) >>> ensg2symbol(ensgs, allow='1:m').head() ensg_id gene_symbol 0 ENSG00000100030 MAPK1 1 ENSG00000171862 PTEN
- gsnn.proc.bio.get_bio_interactions(undirected=False, include_tf_mirna=False, include_pathway_extra=False, include_kinase_extra=False, include_ligrec_extra=False, include_collecTRI=False, include_dorothea=True, include_omnipath=True, dorothea_levels=['A', 'B'], gene_symbol=True, complex_handling='link', min_n_references=None, min_curation_effort=None, return_uniprot_map=False, verbose=True)[source]
Retrieve and standardise directed biological interactions from the OmniPath knowledge base suite.
The function downloads, harmonises and concatenates several curated interaction resources that are exposed through the omnipath Python package and converts them into a single DataFrame with unified node identifiers. Each identifier is prefixed with the molecular entity type so that the downstream GSNN pipeline can easily distinguish between RNA and protein nodes:
PROTEIN__<gene_symbol>RNA__<gene_symbol>
In addition, an explicit translation edge (
RNA → PROTEIN) is created for every gene that is found in both the RNA and the protein namespace.- Parameters:
undirected (bool, optional (default=False)) – If
True, the graph is made undirected by adding a reverse edge for every existing interaction.include_tf_mirna (bool, optional (default=False)) – Whether to augment the graph with TF-miRNA and miRNA-target interactions.
include_pathway_extra (bool, optional (default=False)) – Whether to include additional pathway interactions that lack direct literature support.
include_kinase_extra (bool, optional (default=False)) – Whether to include additional kinase-substrate interactions that lack direct literature support.
include_ligrec_extra (bool, optional (default=False)) – Whether to include additional ligand-receptor interactions that lack direct literature support.
include_collecTRI (bool, optional (default=False)) – Whether to include CollecTRI transcription-factor regulon interactions.
include_dorothea (bool, optional (default=True)) – Whether to include DoRothEA transcription-factor regulon interactions.
include_omnipath (bool, optional (default=True)) – Whether to include curated OmniPath protein-protein interactions.
dorothea_levels (list[str], optional (default=['A', 'B'])) – Confidence levels to retain from the DoRothEA transcription-factor regulon resource. Valid levels are
['A', 'B', 'C', 'D'].gene_symbol (bool, optional (default=True)) – If
Truethe identifiers are returned as HGNC gene symbols. Otherwise uniprot gene identifiers are used.complex_handling ({'none', 'remove', 'expand', 'link'}, optional) –
How to deal with protein-complex entities (OmniPath encodes complexes as underscore-concatenated member gene symbols, e.g.
PROTEIN__AEBP2_EED_EZH2_RBBP4_SUZ12for PRC2):'none'- leave complex nodes untouched (legacy behaviour, kept for backwards compatibility).'remove'- drop every edge that involves a complex.'expand'- replace each complex with its constituent members, fanning out one edge per member. This recovers gene-level coverage at the cost of introducing approximate member-level edges that were not literally curated.'link'- rename complex nodes into a dedicatedCOMPLEX__namespace and add explicitPROTEIN__<member> -> COMPLEX__<...>“assembly” edges, so the GSNN can learn complex activity from member activity while preserving the unit-level semantics of the curated interaction.
min_n_references (int or None, optional (default=None)) – If set, retain only edges supported by at least this many PubMed references (OmniPath
n_referencesfield). Datasets that do not expose the column are dropped entirely when this filter is active.min_curation_effort (int or None, optional (default=None)) – If set, retain only edges whose OmniPath curation-effort score is at least this value. Datasets that do not expose the column are dropped entirely when this filter is active.
return_uniprot_map (bool, optional (default=False)) – If
True, return a third element: the many-to-many UniProt mapping table produced bybuild_uniprot_symbol_map().verbose (bool, optional (default=True)) – Whether to print progress updates.
- Returns:
pandas.DataFrame – One row per function-graph node with columns
['func_name', 'uniprot', 'gene_symbol'].func_nameis the prefixed node id (e.g.PROTEIN__TP53);gene_symbolis the suffix after__;uniprotis the last-seen OmniPath accession for that node.pandas.DataFrame – DataFrame with columns
['src', 'dst', 'edge_type', 'source_uniprot', 'target_uniprot']describing the directed interaction graph.source_uniprotandtarget_uniprotare the OmniPathsource/targetaccession columns (typically UniProt; complexes may useCOMPLEX:...; miRNA resources may use miRBase IDs). A handful of syntheticassemblyedges may lack a member-levelsource_uniprot.pandas.DataFrame, optional – Returned only when
return_uniprot_map=True. The many-to-many UniProt ↔ gene-symbol ↔ func_name mapping table frombuild_uniprot_symbol_map().
Notes
The function prints the number of automatically generated translation edges. Depending on the local cache state, the first call may take a few seconds because the interaction tables are lazily downloaded from the OmniPath server.
Duplicate edges are collapsed on
(src, dst, edge_type)only; if multiple accessions map to the same symbol after alias stripping, the first row is kept.Several miRNA-related tables surface miRBase legacy alias strings of the form
"HSA-MIR-675B;HSA-MIR-675*"– multiple historical names of the same mature miRNA joined by';'. Only the first alias is retained as the canonical node identifier.Examples
>>> from gsnn.proc.bio import get_bio_interactions >>> nodes, edges = get_bio_interactions(undirected=True, include_tf_mirna=True) >>> nodes.columns.tolist(), edges.shape
- gsnn.proc.bio.symbol2uniprot(gene_symbols, allow='1:m', drop_na=True)[source]
Map gene symbols to UniProt accession IDs using PyPath.
A convenience wrapper around :pyfunc:`pypath.utils.mapping.map_name` that translates gene symbols into their corresponding UniProt accession IDs.
Two mapping strategies are available (allow):
'1:m'- keep all gene symbols associated with a UniProt ID (one-to-many, default).'1:1'- keep only the first gene symbol returned by PyPath for each UniProt ID (one-to-one).
- Parameters:
gene_symbols (Sequence[str] or pandas.Series) – Iterable of gene symbols. Duplicate symbols are collapsed to the unique set for the lookup, but the returned
DataFramecontains one row per combination of symbol and UniProt ID.allow (str, optional) – Mapping strategy; must be either
'1:m'or'1:1'. Defaults to'1:m'.
- Returns:
A two-column DataFrame with
'gene_symbol'- Gene symbol (str)'uniprot_id'- UniProt accession (str) or None if the symbol could not be mapped.
- Return type:
pandas.DataFrame
Example
>>> from gsnn.proc.map import symbol2uniprot >>> symbols = pd.Series(['MAPK1', 'PTEN', 'INVALID']) >>> symbol2uniprot(symbols, mapping='1:m').head() gene_symbol uniprot_id 0 MAPK1 P38398 1 PTEN Q9Y243 2 INVALID None
- gsnn.proc.bio.uniprot2symbol(uniprot_ids, allow='1:m', drop_na=True)[source]
Map UniProt accession IDs to HGNC gene symbols using PyPath.
A convenience wrapper around :pyfunc:`pypath.utils.mapping.map_name` that translates protein accessions into their corresponding gene symbols.
Two mapping strategies are available (allow):
'1:m'- keep all gene symbols associated with a UniProt ID (one-to-many, default).'1:1'- keep only the first gene symbol returned by PyPath for each UniProt ID (one-to-one).
- Parameters:
uniprot_ids (Sequence[str] or pandas.Series) – Iterable of UniProt accession IDs. Duplicate IDs are collapsed to the unique set for the lookup, but the returned
DataFramecontains one row per combination of accession and gene symbol.allow (str, optional) – Mapping strategy; must be either
'1:m'or'1:1'. Defaults to'1:m'.
- Returns:
A two-column DataFrame with
'uniprot_id'- UniProt accession (str)'gene_symbol'- Gene symbol (str) or None if the accession could not be mapped.
- Return type:
pandas.DataFrame
Example
>>> from gsnn.proc.map import uniprot2symbol >>> ids = pd.Series(['P38398', 'Q9Y243', 'INVALID']) >>> uniprot2symbol(ids, mapping='1:m').head() uniprot_id gene_symbol 0 P38398 MAPK1 1 Q9Y243 PTEN 2 INVALID None