EDMLib Functions

Grade Data

class edmlib.gradeData(sourceFileOrDataFrame)

Class for manipulating grade datasets.

df

dataframe containing all grade data.

Type

pandas.dataframe

sourceFile

Name of source .CSV file with grade data (optional).

Type

str

Initialization

edmlib.gradeData.__init__(self, sourceFileOrDataFrame)

Class constructor, creates an instance of the class given a .CSV file or pandas dataframe.

Used with gradeData(‘fileName.csv’) or gradeData(dataFrameVariable).

Parameters

sourceFileOrDataFrame (object) – name of the .CSV file (extension included) in the same path or pandas dataframe variable. Dataframes are copied so as to not affect the original variable.

edmlib.gradeData.defineWorkingColumns(self, finalGrade, studentID, term, classID='classID', classDept='classDept', classNumber='classNumber', studentMajor='studentMajor', studentYear='studentYear', classCredits='classCredits', facultyID='facultyID', classCode='classCode')

Defines the column constants to target in the pandas dataframe for data manipulation. Required for proper functioning of the library.

Note

Either both classDept and classNumber variables need to be defined in the dataset’s columns, or classCode needs to be defined for the library to function. The opposite variable(s) are then optional. classDept needs to be defined for major related functions to work.

Parameters
  • finalGrade (str) – Name of the column with grades given to a student in the respective class, grades are expected to be on a 1.0 - 4.0+ scale.

  • studentID (str) – Name of the column with student IDs, which does not need to follow any format.

  • term (str) – Name of the column with the term the class was taken, does not need to follow any format.

  • classID (str) – Number or string specific to a given class section, does not need to follow any format.

  • classDept (str, optional) – Name of the column stating the department of the class, e.g. ‘Psych’. Defaults to ‘classDept’.

  • classNumber (str, optional) – Name of the column stating a number associated with the class, e.g. ‘1000’ in ‘Psych1000’ or ‘Intro to Psych’. Defaults to ‘classNumber’.

  • studentMajor (str, optional) – Name of the column stating the major of the student. Optional, but required for functions involving student majors.

  • classCredits (str, optional) – Name of the column stating the number of credits a class is worth. Optional, but can be used to make student GPA calculations more accurate.

  • facultyID (str, optional) – Name of the column with faculty IDs, which does not need to follow any format. This is the faculty that taught the class. Optional, but required for instructor effectiveness functions.

  • classCode (str, optional) – Name of the column defining a class specific name, e.g. ‘Psych1000’. Defaults to ‘classCode’.

Functions

Filtering / Getting Data

edmlib.gradeData.dropMissingValuesInColumn(self, column)

Removes rows in the dataset which have missing data in the given column.

Parameters

column (str) – Column to check for missing values in.

edmlib.gradeData.filterByGpaDeviationMoreThan(self, minimum, outputDropped=False, droppedCSVName='droppedData.csv')

Filters data to only include classes which have a standard deviation more than or equal to a given minimum (0.0 to 4.0 scale).

Parameters
  • minimum (float) – Minimum standard deviation of grades a class must have.

  • outputDropped (bool, optional) – Whether to output the dropped data to a file. Default is False.

  • droppedCSVName (str, optional) – Name of file to output dropped data to. Default is ‘droppedData.csv`.

edmlib.gradeData.filterColumnToValues(self, col, values=[])

Filters dataset to only include rows that contain any of the given values in the given column.

Parameters
  • col (str) – Name of the column to filter.

  • values (list) – Values to filter to.

edmlib.gradeData.filterToMultipleMajorsOrClasses(self, majors=[], classes=[])

Reduces the dataset to only include entries of certain classes and/or classes in certain majors. This function is inclusive; if a class in ‘classes’ is not of a major defined in ‘majors’, the class will still be included, and vice-versa.

Note

The ‘classDept’ column as set by defineWorkingColumns must be defined in your dataset to filter by major.

Parameters
  • majors (list, optional) – List of majors to include. Filters by the ‘classDept’ column.

  • classes (list, optional) – List of classes to include. Filters by the ‘classCode’ column, or the conjoined version of ‘classDept’ and ‘classNumber’ columns.

edmlib.gradeData.filterStudentsByMajors(self, majors)

Filters the dataset down to students who were ever recorded as majoring in one of the given majors.

Parameters

majors (list, optional) – List of student majors to include when finding matching students. Filters by the ‘studentMajor’ column.

edmlib.gradeData.getCorrelationsWithMinNSharedStudents(self, nSharedStudents=20, directed=False, classDetails=False, sequenceDetails=False)

Returns a pandas dataframe with correlations between all available classes based on grades, after normalization.

Parameters
  • nSharedStudents (int, optional) – Minimum number of shared students a pair of classes must have to compute a correlation. Defaults to 20.

  • directed (bool, optional) – Whether or not to include data specific to students who took class A before B, vice versa, and concurrently. Defaults to ‘False’.

  • classDetails (bool, optional) – Whether or not to include means of student grades, normalized grades, and standard deviations used. Defaults to ‘False’.

Returns

Pandas dataframe with at least columns “course1”, “course2”, “corr”, “P-value”, and “#students”, which store class names, their correlation coefficient (0 least to 1 most), the P-value of this calculation, and the number of students shared between these two classes.

Return type

pandas.dataframe

edmlib.gradeData.getColumn(self, column)

Returns a given column.

Parameters

column (str) – name of the column to return.

Returns

column contained in pandas dataframe.

Return type

pandas.series

edmlib.gradeData.getDictOfStudentMajors(self)

Returns a dictionary of students and their latest respective declared majors. Student ID, Student Major, and Term columns are required.

Parameters

N/A

Returns

str): Dictionary of students and their latest respective declared majors.

Return type

dict`(:obj:`str

edmlib.gradeData.getListOfClassCodes(self)

Returns a list of unique class codes currently in the dataset from the ‘classCode’ column, which is the conjoined ‘classDept’ and ‘classNumber’ columns by default (e.g. ‘Psych1000’ from ‘Psych’ and ‘1000’).

Returns

List of unique class codes in the dataset.

Return type

list

edmlib.gradeData.getUniqueIdentifiersForSectionsAcrossTerms(self)

Used internally. If a column ‘classCode’ is unavailable, a new column is made by combining ‘classDept’ and ‘classNumber’ columns. Also makes a new ‘classIdAndTerm’ column by combining the ‘classID’ and ‘term’ columns, to differentiate specific class sections in specific terms.

edmlib.gradeData.getNormalizationColumn(self)

Used internally. Creates a normalization column ‘norm’ that is a “normalization” of grades recieved in specific classes. This is equivelant to the grade given to a student minus the mean grade in that class, all divided by the standard deviation of grades in that class.

edmlib.gradeData.getGPADeviations(self)

Used internally. Makes a new column called ‘gpaStdDeviation’, the standard deviation of grades of the respective class for each entry.

edmlib.gradeData.getGPAMeans(self)

Used internally. Makes a new column called ‘gpaMean’, the mean of grades recieved in the respective class of each entry.

edmlib.gradeData.getPandasDataFrame(self)

Returns the pandas dataframe of the dataset.

Returns

Dataframe of the current dataset.

Return type

pandas.dataframe

edmlib.gradeData.substituteSubStrInColumn(self, column, subString, substitute)

Replace a substring in a given column.

Parameters
  • column (str) – Column to replace substring in.

  • subString (str) – Substring to replace.

  • substitute (str) – Replacement of the substring.

Export

edmlib.gradeData.exportCSV(self, fileName='csvExport.csv')

Export the current state of the dataset to a CSV file.

Parameters

fileName (str, optional) – Name of the file to export. Defaults to ‘csvExport.csv’.

edmlib.gradeData.exportCorrelationsWithMinNSharedStudents(self, filename='CorrelationOutput_EDMLIB.csv', nStudents=20, directedCorr=False, detailed=False, sequenced=False)

Exports CSV file with all correlations between classes with the given minimum number of shared students. File format has columns ‘course1’, ‘course2’, ‘corr’, ‘P-value’, ‘#students’.

Parameters
  • fileName (str, optional) – Name of CSV to output. Default ‘CorrelationOutput_EDMLIB.csv’.

  • nStudents (int, optional) – Minimum number of shared students a pair of classes must have to compute a correlation. Defaults to 20.

  • directedCorr (bool, optional) – Whether or not to include data specific to students who took class A before B, vice versa, and concurrently. Defaults to ‘False’.

  • detailed (bool, optional) – Whether or not to include means of student grades, normalized grades, and standard deviations used. Defaults to ‘False’.

edmlib.gradeData.gradePredict(self, priorGrades, futureClasses, method='nearest', excludedStudents=None, normalized=False)

Predicts grades given a student’s past grades and classes to predict for. Still being developed.

:param priorGrades (dict`(:obj:`str: float)): Dictionary of past courses and the respective grade recieved. :param futureClasses: List of courses to predict grades for. :type futureClasses: list`(:obj:`str) :param method: Method to use to predict grades. Current methods include ‘nearest’ which gives the grade recieved by the most similar student on record and ‘nearestThree’ which gives the grade closest to the mean of the grades recieved by the nearest three students on record. Set to ‘nearest’ by default. :type method: str, optional :param excludedStudents: List of students to exclude when making calculation. Used for accuracy testing purposes. Set to None by default. :type excludedStudents: list, optional :param normalized: Whether or not normalized grades are given as input. Used for accuracy testing purposes and should generally be set to False. Set to False by default. :type normalized: bool, optional

Returns

float): Dictionary of grade predictions, where the key is the class and the value is the grade predicted.

Return type

dict`(:obj:`str

edmlib.gradeData.instructorRanks(self, firstClass, secondClass, fileName='instructorRanking', minStudents=1)

Create a table of instructors and their calculated benefit to students based on a class they taught and future performance in a given class taken later. Exports a CSV file and returns a pandas dataframe.

Parameters
  • firstClass (str) – Class to look at instructors / their students from.

  • secondClass (str) – Class to look at future performance of students who had relevant professors from the first class.

  • fileName (str, optional) – Name of CSV file to save. Set to ‘instructorRanking’ by default.

  • minStudents (int, optional) – Minimum number of students to get data from for an instructor to be included in the calculation. Set to 1 by default.

Returns

Pandas dataframe with columns indicating the instructor, the normalized benefit to students, the grade point benefit to students, and the number of students used to calculate for that instructor.

Return type

pandas.dataframe

edmlib.gradeData.instructorRanksAllClasses(self, fileName='completeInstructorRanks', minStudents=20, directionality=0.8, outputSubjectAverages=False, subjectFileName='instructorAverages', otherRank=None)

Create a table of instructors and their calculated benefit to students based on all classes they taught and future performance in all classes taken later. Exports a CSV file and returns a pandas dataframe.

Parameters
  • fileName (str, optional) – Name of CSV file to save. Set to ‘completeInstructorRanks’ by default.

  • minStudents (int, optional) – Minimum number of students to get data from for an instructor’s entry to be included in the calculation. Set to 1 by default.

  • directionality (float, optional) – Minimum directionality (percentage of students who took one class before another). Range 0.0 to 1.0. Set to 0.8 by default.

  • outputSubjectAverages (bool, optional) – Output a file with averages of all the data in this file, by instructor, by subject. Set to False by default.

  • subjectFileName (str, optional) – File to output instructor/subject averages to. Set to ‘instructorAverages’ by default.

Returns

Pandas dataframe with columns indicating the instructor, the class taken, the future class, the normalized benefit to students, the grade point benefit to students, the number of students used to calculate for that instructor / class combination, as well as the number of students on the opposite side of that calculation (students in future class who did not take that instructor before).

Return type

pandas.dataframe

edmlib.gradeData.outputGpaDistribution(self, makeHistogram=False, fileName='gpaHistogram', graphTitle='GPA Distribution', minClasses=36)

Prints to console an overview of student GPAs in increments of 0.1 between 1.0 and 4.0. Optionally, outputs a histogram as well.

Parameters
  • makeHistogram (bool, optional) – Whether or not to make a histogram graph. Default false.

  • fileName (str) – Name of histogram files to output. Default ‘gpaHistogram’.

  • graphTitle (str) – Title to display on graph. Default ‘GPA Distribution’.

  • minClasses (int) – Number of classes a student needs to have on record to count GPA. Default 36.

edmlib.gradeData.sankeyGraphByCourseTracks(self, courseGroups, graphTitle='Track Distribution', outputName='sankeyGraph', consecutive=True, minEdgeValue=None, termThreshold=None)

Exports a sankey graph according to a given course track. Input is organized in a jagged array, with the first array the first set of classes a student can take, the second set the second possible class a student can take, etc..

Parameters
  • courseGroups (list`(:obj:`list)) – List of course groups (also lists) to make the sankey graph with. Minimum two course groups.

  • graphTitle (str) – Title that goes on the sankey graph. Defaults to ‘Track Distribution’.

  • outputName (str) – Name of sankey files (.csv, .html) to output. Defaults to ‘sankeyGraph’.

  • consecutive (bool) – Whether or not students must complete the entire track consecutively, or start at a group other than what is designated. This mostly affects students who needed to retake a class. Defaults to True (students must complete track from beginning / as designated for data to be recorded).

  • minEdgeValue (int, optional) – Minimum value for an edge to be included on the sankey graph. Defaults to None, or no minimum value needed.

  • termThreshold (float, optional) – If defined, attempts to use the ‘termOrder’ column where terms are given a numbered order and a given maximum threshold for what counts as a “consecutive” term.

edmlib.gradeData.sankeyGraphByCourseTracksOneGroup(self, courseGroup, requiredCourses=None, graphTitle='Track Distribution', outputName='sankeyGraph', minEdgeValue=None)

Exports a sankey graph according to a given course track. Input is organized as an array of classes included in the track, and optionally a subgroup of classes required for a student to be counted in the graph can be designated as well.

Parameters
  • courseGroup (list`(:obj:`str)) – List of courses to make the sankey graph with. Minimum two courses.

  • requiredCourses (list`(:obj:`str)) – List of courses required for a student to count towards the graph. All courses in ‘courseGroup’ by default.

  • graphTitle (str) – Title that goes on the sankey graph. Defaults to ‘Track Distribution’.

  • outputName (str) – Name of sankey files (.csv, .html) to output. Defaults to ‘sankeyGraph’.

  • minEdgeValue (int, optional) – Minimum value for an edge to be included on the sankey graph. Defaults to None, or no minimum value needed.

Logging / Troubleshooting

edmlib.gradeData.printColumn(self, column)

Prints the given column.

Parameters

column (str) – Column to print to console.

edmlib.gradeData.printUniqueValuesInColumn(self, column)

Prints to console the unique variables in a given column.

Parameters

column (str) – Column to get unique variables from.

edmlib.gradeData.printEntryCount(self)

Prints to console the number of entries (rows) in the current dataset.

edmlib.gradeData.printFirstXRows(self, rows)

Prints to console the first X number of rows from the dataset.

Parameters

rows (int) – Number of rows to print from the dataset.

Correlational Data

class edmlib.classCorrelationData(sourceFileOrDataFrame)

Class for manipulating and visualizing pearson correlations generated by the gradeData class.

df

dataframe containing all correlational data.

Type

pandas.dataframe

sourceFile

Name of source .CSV file with correlational data.

Type

str

Initialization

edmlib.classCorrelationData.__init__(self, sourceFileOrDataFrame)

Class constructor, creates an instance of the class given a .CSV file or pandas dataframe. Typically should only be used manually with correlation files outputted by the gradeData class.

Used with classCorrelationData(‘fileName.csv’) or classCorrelationData(dataFrameVariable).

Parameters

sourceFileOrDataFrame (object) – name of the .CSV file (extension included) in the same path or pandas dataframe variable. Dataframes are copied so as to not affect the original variable.

Functions

Filtering / Getting Data

edmlib.classCorrelationData.dropMissingValuesInColumn(self, column)

Removes rows in the dataset which have missing data in the given column.

Parameters

column (str) – Column to check for missing values in.

edmlib.classCorrelationData.filterColumnToValues(self, col, values=[])

Filters dataset to only include rows that contain any of the given values in the given column.

Parameters
  • col (str) – Name of the column to filter.

  • values (list) – Values to filter to.

edmlib.classCorrelationData.filterToMultipleMajorsOrClasses(self, majors=[], classes=[], twoWay=True)

Reduces the dataset to only include entries of certain classes and/or classes in certain majors. This function is inclusive; if a class in ‘classes’ is not of a major defined in ‘majors’, the class will still be included, and vice-versa.

Note

The ‘classDept’ column as set by defineWorkingColumns must have been defined in your dataset to filter by major.

Parameters
  • majors (list, optional) – List of majors to include. Filters by the ‘classDept’ column in the original dataset.

  • classes (list, optional) – List of classes to include. Filters by the ‘classCode’ column in the original dataset, or the conjoined version of ‘classDept’ and ‘classNumber’ columns.

  • twoWay (bool, optional) – Whether both classes in the correlation must be in the given majors / classes, or only one of them. Set to True, or both classes, by default.

edmlib.classCorrelationData.getCliques(self, minCorr=None, minSize=2)

Returns a list of lists / cliques present in the correlational data. Cliques are connected sub-graphs in the larger overall graph.

Parameters
  • minCorr (None or float, optional) – Minimum correlation to consider a correlation an edge on the graph. ‘None’, or ignored, by default.

  • minSize (int, optional) – Minimum number of nodes to look for in a clique. Default is 2.

edmlib.classCorrelationData.getNxGraph(self, minCorr=None)

Returns a NetworkX graph of the correlational data, where the nodes are classes and the weights are the correlations.

Parameters

minCorr (float, optional) – Minimum correlation between classes for an edge to be included on the graph. Should be in the 0.0-1.0 range. Defaults to None (or do not filter).

edmlib.classCorrelationData.substituteSubStrInColumn(self, column, subString, substitute)

Replace a substring in a given column.

Parameters
  • column (str) – Column to replace substring in.

  • subString (str) – Substring to replace.

  • substitute (str) – Replacement of the substring.

Export / Graphs

edmlib.classCorrelationData.exportCSV(self, fileName='csvExport.csv')

Export the current state of the dataset to a CSV file.

Parameters

fileName (str, optional) – Name of the file to export. Defaults to ‘csvExport.csv’.

edmlib.classCorrelationData.chordGraphByMajor(self, coefficient=0.5, pval=0.05, outputName='majorGraph', outputSize=200, imageSize=300, showGraph=True, outputImage=True)

Creates a chord graph between available majors through averaging and filtering both correlation coefficients and P-values. Outputs to an html file, PNG file, and saves the underlying data by default.

Note

The ‘classDept’ column as set by defineWorkingColumns must have been defined in your dataset to filter by major.

Parameters
  • coefficient (float, optional) – Minimum correlation coefficient to filter correlations by.

  • pval (float, optional) – Maximum P-value to filter correlations by.

  • outputName (str, optional) – First part of the outputted file names, e.g. fileName.csv, fileName.html, etc.

  • outputSize (int, optional) – Size (units unknown) of html graph to output. 200 by default.

  • imageSize (int, optional) – Size (units unknown) of image of the graph to output. 300 by default. Increase this if node labels are cut off.

  • showGraph (bool, optional) – Whether or not to open a browser and display the interactive graph that was created. Defaults to True.

  • outputImage (bool, optional) – Whether or not to export an image of the graph. Defaults to True.

edmlib.classCorrelationData.outputCliqueDistribution(self, minCorr=None, countDuplicates=False, makeHistogram=False, fileName='cliqueHistogram', graphTitle='Class Correlation Cliques', logScale=False)

Outputs the clique distribution from the given correlation data. Prints to console by default, but can also optionally export a histogram.

Parameters
  • minCorr (None or float, optional) – Minimum correlation to consider a correlation an edge on the graph. ‘None’, or ignored, by default.

  • countDuplicates (bool, optional) – Whether or not to count smaller sub-cliques of larger cliques as cliques themselves. False by default.

  • makeHistogram (bool, optional) – Whether or not to generate a histogram. False by default.

  • fileName (str, optional) – File name to give exported histogram files. ‘cliqueHistogram’ by default.

  • graphTitle (str, optional) – Title displayed on the histogram. ‘Class Correlation Cliques’ by default.

  • logScale (bool, optional) – Whether or not to output graph in Log 10 scale on the y-axis. Defaults to False.

Logging / Troubleshooting

edmlib.classCorrelationData.printClassesUsed(self)

Prints to console the classes included in the correlations.

edmlib.classCorrelationData.printMajors(self)

Prints to console the majors of the classes present in the correlational data.

Note

The ‘classDept’ column as set by defineWorkingColumns must have been defined in your dataset to print majors.

edmlib.classCorrelationData.printEntryCount(self)

Prints to console the number of entries (rows) in the current dataset.

edmlib.classCorrelationData.printFirstXRows(self, rows)

Prints to console the first X number of rows from the dataset.

Parameters

rows (int) – Number of rows to print from the dataset.