edX-datascrub is used for scrubbing edX data into a format that is easy to analyze. This repository is forked from HarvardX-Tools.
Output: The final processed data of each class is stored in csv file with the following fields:
- time
- seconds to next action
- actor: user
- verb: action
- object_name: in the format of chapter/sequential/vertical/item_name
- object_type
- result: correct or incorrect if verb is problem_check, empty otherwise
- meta
- ip
- event_type
- page
- agent
The output rows are partially sorted by times. More specifically, if you only consider rows associated to one user, those rows are sorted by times.
This repository is forked from HarvardX-Tools and modified by Phitchaya Mangpo Phothilimthana.
- Add the following directories to your environment path
- edX-datascrub/src
- edX-datascrub/src/logs
- edX-datascrub/shellscripts
- edX-courseaxis
- Make sure all shellscripts and python scripts inside these 4 directories are executable.
- Decrypt all files from edX and store them in the same directories.
Then, you're ready to go!
Every class comes with class information and content packaged in classXXX.xml.tar.gz. Run generate_courseaxis
script in the directory that contains classXXX.xml.tar.gz:
generate_courseaxis dir_contains_class_xml_targz
The script will generate a directory named csv_files containing many files including:
- info.csv collecting course names (e.g. BerkeleyX-CS191x-Spring_2013), start dates, and end dates of all classes
- one course_name_axis.csv for each class
- axis.error logging all the errors occurred during generating course axes. Check the error messages in this file to investigate why the course axis of any particular class is not being generated.
info.csv and course axes will be useful in the next step. Note that if there is an error generating course axis or there is no start or end date for a specific class in its xml.tar.gz, that class is excluded from info.csv.
You can choose to process activity logs of all classes at once or just one log of a specific class at a time.
In the directory that contains prod-edx* directories in which contain the raw activity logs, run:
processLogData.py course_name1,course_name2 start_date end_date
You can get course_name
, start_date
, and end_date
from info.csv.
The first argument to the script is a list of course names, separated ,
. The list can be of any abitrary size. Most courses do not have the exactly same start and end dates. However, you can group the ones that have similar start and end dates together (e.g. the ones offered in the same semester), and specify start and end dates that cover all of the classes in the list. This will make the overall log processing run faster.
processLogData.py
will:
- generate a separate log file for each class inside each prod-edx* directory. The log file is named after the class name.
- combine the separated log files of the same class located in different prod-edx* directories into one log file and store the combined log in the directory in which the script is run.
- generate
ClassList.csv
to keep track of between which dates the course have already been processed.
The combined log file course_name.log
for each course and ClassList.csv
will be generated in the directory in which the script is running.
If you have already processed couresA
between date1
and date2
, and you want to process more logs between date2
and date3
. You have to use -
as the start date as follows:
processLogData.py courseA - date3
In this case, the script will append new logs to course_name.log
After you obtain the combined log from, you then run
transformOneLog.sh course_name.log path_to_course_axis.csv
transformOneLog.sh
takes a generated course_name.log
from processLogData.py
and the corresponding course axis generated from the Obtaining Course Axis section as its inputs. It then transforms the combined log file into a nicely formatted csv file for each class.
Note that only processLogData.py
is incremental (appending new logs to the ones that have already been processed). transfromOneLog.sh
is not incremental. It will transform the entire given log file.
Caution: You can move around or rename the directory that contains course axes and info.csv generated from the previous step, but make sure that all course axes and info.csv are still in the same directory.
Then, in the directory that contains prod-edx* directories in which contain the raw activity logs, simply run:
processAll.py path_to_info.csv
The script will call processLogData.py
and transformOneLog.sh
for every course appeared in info.csv. Note that running this script is not as efficient as running processLogData.py
and transformOneLog.sh
manually. This is because it will not separate logs of different courses at the same time, unlike processLogData.py
when a list of courses is given.