pygenprop package¶
Subpackages¶
- pygenprop.testing package
- Submodules
- pygenprop.testing.test_assign module
- pygenprop.testing.test_database_reference module
- pygenprop.testing.test_evidence module
- pygenprop.testing.test_functional_element module
- pygenprop.testing.test_genome_property module
- pygenprop.testing.test_lib module
- pygenprop.testing.test_literature_reference module
- pygenprop.testing.test_parse module
- pygenprop.testing.test_parse_genome_properties_assignments module
- pygenprop.testing.test_parse_genome_properties_file module
- pygenprop.testing.test_results module
- pygenprop.testing.test_step module
- pygenprop.testing.test_tree module
- Module contents
Submodules¶
pygenprop.assign module¶
Created by: Lee Bergstrand (2017)
Description: Functions for assigning genome properties.
-
class
pygenprop.assign.
AssignmentCache
(interpro_member_database_identifiers: list = None, sample_name=None)[source]¶ Bases:
object
This class contains a representation of precomputed assignment results and InterPro member database matches.
-
cache_property_assignment
(genome_property_identifier: str, assignment: str)[source]¶ Stores cached assignment results for a genome property.
Parameters: - genome_property_identifier – The identifier of genome property.
- assignment – An assignment of YES, NO or PARTIAL for the given genome property.
-
cache_step_assignment
(genome_property_identifier: str, step_number: int, assignment: str)[source]¶ Stores cached assignment results for a genome property step.
Parameters: - genome_property_identifier – The identifier of the genome property for which the step belongs.
- step_number – The steps number.
- assignment – An assignment of YES or NO for the given step.
-
flush_property_from_cache
(genome_property_identifier)[source]¶ Remove a genome property from the cache using its identifier.
Parameters: genome_property_identifier – The identifier of the property to remove from the cache.
-
genome_property_identifiers
¶ Creates a set of identifiers belonging to the genome properties cached.
Returns: A set of genome property identifiers.
-
get_property_assignment
(genome_property_identifier)[source]¶ Retrieves cached assignment results for a genome property.
Parameters: genome_property_identifier – The identifier of genome property. Returns: An assignment of YES, NO or PARTIAL for the given genome property.
-
get_step_assignment
(genome_property_identifier: str, step_number: int)[source]¶ Retrieves cached assignment results for a genome property step.
Parameters: - genome_property_identifier – The identifier of the genome property for which the step belongs.
- step_number – The steps number.
Returns: An assignment of YES or NO for the given step.
-
-
pygenprop.assign.
assign_evidence
(assignment_cache: pygenprop.assign.AssignmentCache, current_evidence: pygenprop.evidence.Evidence)[source]¶ Assigns a result (YES, NO) to a evidence based of the presence or absence of InterPro member identifiers or the assignment of evidence child genome properties.
Parameters: - assignment_cache – A cache containing step and property assignments and InterPro member database matches.
- current_evidence – The current evidence which needs assignment.
Returns: The assignment for the evidence.
-
pygenprop.assign.
assign_functional_element
(assignment_cache: pygenprop.assign.AssignmentCache, functional_element: pygenprop.functional_element.FunctionalElement)[source]¶ Assigns a result (YES, NO) to a functional element based on assignments of its evidences.
Parameters: - assignment_cache – A cache containing step and property assignments and InterPro member database matches.
- functional_element – The current functional_element which needs assignment.
Returns: The assignment for the functional element.
-
pygenprop.assign.
assign_genome_property
(assignment_cache: pygenprop.assign.AssignmentCache, genome_property: pygenprop.genome_property.GenomeProperty)[source]¶ Recursively assigns a result to a genome property and its children.
Parameters: - assignment_cache – A cache containing step and property assignments and InterPro member database matches.
- genome_property – The genome property to assign the results to.
Returns: The assignment results for the genome property.
-
pygenprop.assign.
assign_step
(assignment_cache: pygenprop.assign.AssignmentCache, step: pygenprop.step.Step)[source]¶ Assigns a result (YES, NO) to a functional element based on assignments of its functional elements.
Parameters: - assignment_cache – A cache containing step and property assignments and InterPro member database matches.
- step – The current step element which needs assignment.
Returns: The assignment for the step.
-
pygenprop.assign.
calculate_property_assignment_from_all_steps
(child_assignments: list)[source]¶ Takes the assignment results from all child results and uses them to assign a result for the parent itself. This algorithm is used to assign results to a single step from child functional elements and for genome properties that have no required steps such as “category” type genome properties. This is a more generic version of the algorithm used in assign_property_result_from_required_steps()
If all child assignments are No, parent should be NO. If all child assignments are Yes, parent should be YES. Any thing else in between, parents should be PARTIAL.
Parameters: child_assignments – A list of assignment results for child steps or genome properties. Returns: The parents assignment result.
-
pygenprop.assign.
calculate_property_assignment_from_required_steps
(required_step_assignments: list, threshold: int = 0)[source]¶ Takes the assignment results for each required step of a genome property and uses them to assign a result for the property itself. This is the classic algorithm used by EBI Genome Properties.
From: https://genome-properties.readthedocs.io/en/latest/calculating.html
To determine if the GP resolves to a YES (all required steps are present), NO (too few required steps are present) or PARTIAL (the number of required steps present is greater than the threshold, indicating that some evidence of the presence of the GP can be assumed).
Child steps must be present (‘YES’) not partial.
In Perl code for Genome Properties:
Link: https://github.com/ebi-pf-team/genome-properties/blob/ a76a5c0284f6c38cb8f43676618cf74f64634d33/code/pygenprop/GenomeProperties.pm#L646
#Three possible results for the evaluation if($found == 0 or $found <= $def->threshold){
$def->result(‘NO’); #No required steps found- }elsif($missing){
- $def->result(‘PARTIAL’); #One or more required steps found, but one or more required steps missing
- }else{
- $def->result(‘YES’); #All steps found.
}
If no required steps are found or the number found is less than or equal to the threshold –> No Else if any are missing –> PARTIAL ELSE (none are missing) –> YES
So for problem space ALL_PRESENT > THRESHOLD > NONE_PRESENT:
YES when ALL_PRESENT = CHILD_YES_COUNT PARTIAL when CHILD_YES_COUNT > THRESHOLD NO when CHILD_YES_COUNT <= THRESHOLD
Parameters: - required_step_assignments – A list of assignment results for child steps or genome properties.
- threshold – The threshold of ‘YES’ assignments necessary for a ‘PARTIAL’ assignment.
Returns: The parent’s assignment result.
-
pygenprop.assign.
calculate_step_or_functional_element_assignment
(child_assignments: list, sufficient_scheme=False)[source]¶ Assigns a step result or functional element result based of the assignments of its children. In the case of steps, this would be functional element assignments. In the case of functional elements this would be evidences.
For assignments from child genome properties YES or PARTIAL is considered YES.
See: https://github.com/ebi-pf-team/genome-properties/blob/ a76a5c0284f6c38cb8f43676618cf74f64634d33/code/modules/GenomeProperties.pm#L686
- if($evObj->gp){
- if(defined($self->get_defs->{ $evObj->gp })){
# For properties a PARTIAL or YES result is considered success if( $self->get_defs->{ $evObj->gp }->result eq ‘YES’ or
$self->get_defs->{ $evObj->gp }->result eq ‘PARTIAL’ ){ $succeed++;- }elsif($self->get_defs->{ $evObj->gp }->result eq ‘UNTESTED’){
- $step->evaluated(0);
Parameters: - sufficient_scheme – If false, any child NOs mean NO. If true, any child YES/PARTIAL means YES
- child_assignments – A list containing strings of YES, NO or PARTIAL
Returns: The assignment as either YES or NO.
pygenprop.assignment_file_parser module¶
Created by: Lee Bergstrand (2017)
Description: A parser for parsing genome properties longform files.
pygenprop.database_file_parser module¶
Created by: Lee Bergstrand (2017)
Description: A parser for parsing genome properties flat files into a rooted DAG of genome properties.
-
pygenprop.database_file_parser.
create_marker_and_content
(genome_property_flat_file_line)[source]¶ Splits a list of lines from a genome property file into marker, content pairs.
Parameters: genome_property_flat_file_line – A line from a genome property flat file line. Returns: A tuple containing a marker, content pair.
-
pygenprop.database_file_parser.
extract_identifiers
(identifier_string)[source]¶ Parse database or Genprop identifiers from an EV or TG tag content string.
Parameters: identifier_string – The contents string from a EV or TG tag. Returns: A list of identifiers.
-
pygenprop.database_file_parser.
parse_database_references
(genome_property_record)[source]¶ Parses database reference from a genome properties record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A list of DatabaseReference objects.
-
pygenprop.database_file_parser.
parse_evidences
(genome_property_record)[source]¶ Parses evidences from a genome properties record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A list of evidence objects.
-
pygenprop.database_file_parser.
parse_functional_elements
(genome_property_record)[source]¶ Parses functional_elements from a genome properties record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A list of functional_element objects.
-
pygenprop.database_file_parser.
parse_genome_properties_flat_file
(genome_property_file)[source]¶ A parses a genome property flat file.
Parameters: genome_property_file – A genome property file handle object. Returns: A GenomePropertyTree object.
-
pygenprop.database_file_parser.
parse_genome_property
(genome_property_record)[source]¶ Parses a single genome property from a genome property record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A single genome property object.
-
pygenprop.database_file_parser.
parse_literature_references
(genome_property_record)[source]¶ Parses literature references from a genome properties record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A list of LiteratureReference objects.
-
pygenprop.database_file_parser.
parse_single_evidence
(current_evidence_dictionary)[source]¶ The creates an Evidence object from a pair of EV and TG tag content strings.
Parameters: current_evidence_dictionary – A dictionary containing EV and TG to content string mappings. Returns: An Evidence object.
-
pygenprop.database_file_parser.
parse_steps
(genome_property_record)[source]¶ Parses steps from a genome properties record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A list of Step objects.
-
pygenprop.database_file_parser.
unwrap_genome_property_record
(genome_property_record)[source]¶ The standard genome property record wraps every 80 lines. This function unwraps the record.
Parameters: genome_property_record – A list of marker, content tuples representing genome property flat file lines. Returns: A list of reduced redundancy markers, content tuples representing genome property flat file lines. Consecutive markers (often ‘CC’ and ‘**’) markers are collapsed to one tuple.
pygenprop.database_reference module¶
Created by: Lee Bergstrand (2017)
Description: The database reference class.
pygenprop.evidence module¶
Created by: Lee Bergstrand (2017)
Description: The evidence class.
-
class
pygenprop.evidence.
Evidence
(evidence_identifiers=None, gene_ontology_terms=None, sufficient=False, parent: pygenprop.functional_element.FunctionalElement = None)[source]¶ Bases:
object
A piece of evidence (ex. InterPro HMM hit or GenProp) that supports the existence of a functional element.
-
consortium_identifiers
¶ - Gets the InterPro consortium signature identifiers (PFAM, TIGRFAM, etc.) representing a piece of evidence.
Returns: A set of genome property identifiers.
-
genome_properties
¶ Get genome properties that are used by this evidence.
Returns: A list of genome properties.
-
genome_property_identifiers
¶ Gets the genome properties identifiers representing a piece of evidence.
Returns: A list of genome property identifiers.
-
has_genome_property
¶ Is the evidence a genome property?
Returns: Return True if evidence is a genome property.
-
interpro_identifiers
¶ Gets the InterPro (IPRXXXXXX) identifiers representing a piece of evidence.
Returns: A list of genome property identifiers.
-
pygenprop.functional_element module¶
Created by: Lee Bergstrand (2017)
Description: The functional element class.
pygenprop.genome_property module¶
Created by: Lee Bergstrand (2017)
Description: The genome property class.
-
class
pygenprop.genome_property.
GenomeProperty
(accession_id, name, property_type, threshold=0, parents=None, children=None, references=None, databases=None, steps=None, public=True, description=None, private_notes=None, tree=None)[source]¶ Bases:
object
Represents a EBI genome property. Each represents specific capabilities of an organism as proven by the presence of genes found in its genome.
-
child_genome_property_identifiers
¶ Collects the genome property identifiers of child genome properties.
Returns: A list of genome property identifiers.
-
required_steps
¶ Returns a list of all the required steps of the genome property.
Returns: All required steps as list.
-
pygenprop.literature_reference module¶
Created by: Lee Bergstrand (2017)
Description: The literature reference class.
pygenprop.results module¶
Created by: Lee Bergstrand (2018)
Description: The genome property tree class.
-
class
pygenprop.results.
GenomePropertiesResults
(*genome_properties_results, properties_tree: pygenprop.tree.GenomePropertiesTree)[source]¶ Bases:
object
This class contains a representation of a table of results from one or more genome properties assignments.
-
differing_property_results
¶ Property results where all properties differ in assignment in at least one sample. :return: A property result data frame where properties with the all the same value are filtered out.
-
differing_step_results
¶ Step results where all steps differ in assignment in at least one sample. :return: A step result data frame where properties with the all the same value are filtered out.
-
generate_json_tree
(genome_properties_root)[source]¶ Creates a tree based representation of the genome properties assignment results.
Parameters: genome_properties_root – The root element of the genome properties tree. Returns: A nested dict of assignment results.
-
get_property_result
(genome_property_id)[source]¶ Gets the assignment results for a given genome property.
Parameters: genome_property_id – The id of the genome property to get results for. Returns: A list containing the assignment results for the genome property in question.
-
get_results
(*property_identifiers, steps=False, names=False)[source]¶ Creates a results dataframe for only a subset of genome properties.
Parameters: - property_identifiers – The id of one or more genome properties to get results for.
- steps – Add steps to the dataframe.
- names – Add property and or step names to the dataframe.
Returns: A dataframe with results for a specific set of genome properties.
-
get_results_summary
(*property_identifiers, steps=False, normalize=False)[source]¶ Creates a summary table for yes, no and partial assignments of a given set of properties or property steps. Display counts or percentage of yes no partial assignment for the given properties or steps of the given properties.
Parameters: - property_identifiers – The id of one or more genome properties to get results for.
- steps – Summarize results for the steps of the input properties
- normalize – Display the summary as a percent rather than as counts.
Returns: A summary table dataframe
-
get_step_name
(property_identifier, step_number)[source]¶ Helper function to quickly acquire a property steps name.
Parameters: - property_identifier – The id of the genome property.
- step_number – The step number of the step.
Returns: The steps name.
-
get_step_result
(genome_property_id, step_number)[source]¶ Gets the assignment results for a given step of a genome property.
Parameters: - genome_property_id – The id of the genome property that the step belongs too.
- step_number – The step number of the step.
Returns: A list containing the assignment results for the step in question.
Filter out results where all samples have the same value. :param results: A step or property results data frame. :param only_drop_no_assignments: Only drop results where values are all NO. :return: A step or property data frame with certain properties filtered out.
-
supported_property_results
¶ Property results where properties which are not supported in any sample are removed. :return: A property result data frame where properties with the all NO values are filtered out.
-
supported_step_results
¶ Step results where steps which are not supported in any sample are removed. :return: A step result data frame where steps with the all NO values are filtered out.
-
-
pygenprop.results.
bootstrap_assignments
(assignment_cache, genome_properties_tree)[source]¶ Recursively fills in assignments for all genome properties in the genome properties tree based of existing cached assignments and InterPro member database identifiers.
Parameters: - assignment_cache – A cache containing step and property assignments and InterPro member database matches.
- genome_properties_tree –
Returns:
-
pygenprop.results.
create_assignment_tables
(genome_properties_tree: pygenprop.tree.GenomePropertiesTree, assignment_cache: pygenprop.assign.AssignmentCache)[source]¶ Takes a results dictionary from the long form parser and creates two tables. One for property results and one for step results. The longform results file has only leaf assignment results. We have to bootstrap the rest.
Parameters: - genome_properties_tree – The global genome properties tree.
- assignment_cache – Per-sample genome properties results from the long form parser.
Returns: A tuple containing an property assignment table and step assignments table.
-
pygenprop.results.
create_step_table_rows
(step_assignments)[source]¶ Unfolds a step result dict of dict and yields a step table row.
Parameters: step_assignments – A dict of dicts containing step assignment information ({gp_key -> {stp_key –> result}})
-
pygenprop.results.
create_synchronized_assignment_cache
(assignment_cache, genome_properties_tree)[source]¶ Remove genome properties from the assignment cache that are not found in both the genome properties tree and the assignment cache. This prevents situations where different versions of the cache and tree cannot find each others genome properties.
Parameters: - assignment_cache – A cache containing step and property assignments and InterPro member database matches.
- genome_properties_tree – The global genome properties tree.
Returns: An assignment cache containing data for genome properties shared between the tree and cache.
pygenprop.step module¶
Created by: Lee Bergstrand (2017)
Description: The step class.
-
class
pygenprop.step.
Step
(number, functional_elements: list = None, parent: pygenprop.genome_property.GenomeProperty = None)[source]¶ Bases:
object
A class representing a step that supports the existence of a genome property.
-
genome_properties
¶ Collects all the child genome properties supporting a step.
Returns: A list of child genome properties for a step.
-
genome_property_identifiers
¶ Collects all the genome properties identifiers supporting a step.
Returns: A list of the steps child genome property identifiers.
-
name
¶ Get the name for a step based on combine the names of its functional elements.
Returns: The name of the step.
-
required
¶ Checks if the step is required by checking if any of the functional elements are required.
Returns: True if the step is required.
-
pygenprop.tree module¶
Created by: Lee Bergstrand (2018)
Description: The genome property tree class.
-
class
pygenprop.tree.
GenomePropertiesTree
(*genome_properties)[source]¶ Bases:
object
This class contains a representation of a set of nested genome properties. Internally, the instantiated object contains a rooted DAG of genome properties connected from root to leaf (parent to child). A dictionary is also included which points to each tree node for fast lookups by genome property identifier.
-
build_genome_property_connections
()[source]¶ Build connections between parent-child genome properties in the dictionary. This creates the rooted DAG.
-
consortium_identifiers
¶ All InterPro consortium signature identifiers (PFAM, TIGRFAM, etc.) used by the genome properties database.
Returns: A set of all unique consortium identifiers used in genome properties.
-
create_graph_links_json
(as_list=False)[source]¶ Creates a JSON representation of a genome property links.
Parameters: as_list – Return as a list instead of a JSON formatted string. Returns: A JSON formatted string of a list of each properties JSON representation.
-
create_graph_nodes_json
(as_list=False)[source]¶ Creates a JSON representation of a genome property dictionary.
Parameters: as_list – Return as a list instead of a JSON formatted string. Returns: A JSON formatted string of a list of each properties JSON representation.
-
create_metabolism_database_mapping_file
(file_handle)[source]¶ Writes a mapping file which maps each genome property to KEGG and MetaCyc.
Parameters: file_handle – A python file handle object.
-
create_nested_json
(current_property=None, as_dict=False)[source]¶ Converts the object to a nested JSON representation.
Parameters: - current_property – The current root genome property (for recursion)
- as_dict – Returns Return a dictionary for incorporation into other json objects.
Returns: A JSON formatted string or dictionary representing the object.
-
genome_property_identifiers
¶ The identifiers all genome properties in the database.
Returns: A set of all genome property identifiers.
-
get_evidence_identifiers
(consortium=False)[source]¶ Gets evidence identifiers from all genome properties in the database.
Parameters: consortium – If true, list the consortium signature identifiers (PFAM, TIGRFAM) Returns: A set of all unique evidence identifiers used in genome properties.
-
interpro_identifiers
¶ All global InterPro identifiers (IPRXXXX, etc.) used by the genome properties database.
Returns: A set of all unique InterPro identifiers used in genome properties.
-
leafs
¶ Returns the leaf nodes of the rooted DAG.
Returns: A list of all genome property objects with no children.
-
root
¶ Gets the top level genome properties object in a genome properties tree.
Returns: The root genome property of the genome properties tree.
-