Instructor's Note

Start with 01-files.jl, which covers file handling in Julia. Begin by emphasizing the significance of working in the correct directory before reading or writing data and how omitting this consideration could lead to errors. Show how to use the pwd function to verify the present working directory and how to use cd to navigate to another directory if needed. Some users might find it more convenient to right click on the file and use the Julia: Change to This Directory option, which will automatically move the Julia REPL to the directory containing the selected file. If there are participants who know how to use shell commands, you can mention how to enter the shell> mode in the REPL by typing ;. Next, focus on the CSV format. Make sure to highlight the importance of this format and provide an in-depth explanation of how to read and write CSV files to the present working directory and to a different data folder. One of the examples provided involves using the rename function, so make sure to go over how it can be used to change column names in a DataFrame.

Next, go over the use of the XLSX.jl package to read Excel files. Start by explaining how to read an Excel file using XLSX.readtable, emphasizing that it is required to provide the sheet name as an argument and that most of the time, you will want to convert the output from XLSX.readtable to a DataFrame. There may be questions about what to do if the user doesn't know the sheet names, which you can address by showing how to use XLSX.readxlsx and XLSX.sheetnames to obtain a list of sheet names in an Excel file. You might also find it useful to demonstrate how to open an Excel file inside of VS Code (using the Office Viewer extension, which is installed by default in JuliaHub). Once you have covered how to read files, show how to write files. Make sure to mention that XLSX.jl will not override an existing file like CSV.jl would. Instead, you will get an error if you try to create a file that already exists.

The last topic for 01-files.jl is SAS files (.sasb7dat and .xpt), which can be read using the readstat function from the ReadStatTables.jl package. However, note that the current version of ReadStatTables.jl only supports reading files, and write support is still experimental.

Next, go over the contents of 02-select_subset.jl. First, discuss the names function, which allows us to obtain a Vector containing all the column names of a DataFrame, which could be useful when working with DataFrames that have a large number of columns. After that, show the different alternatives that there are to retrieve the contents of a single column (dot syntax such as DataFrame.column_name and indexing). Participants might be curious about the difference between these two methods. If that is the case, you can explain that the dot syntax is simpler and more convenient to type, but that indexing is more flexible and powerful. Additionally, some users could find the indexing syntax more intuitive, even if it is more verbose. When going over indexing, make sure to explain the difference between using ! and using : to retrieve all rows from a column (! returns the column, while : returns a copy of it).

Afterward, showcase how to select specific columns from a DataFrame using the @select macro provided by DataFramesMeta.jl. This will be the first time in the workshop in which attendees will use DataFramesMeta.jl, so you can take this opportunity to provide a brief overview of the package and its importance. Make sure to mention that DataFramesMeta.jl imports the contents of DataFrames.jl, so it's not necessary to import DataFrames.jl if DataFramesMeta.jl has already been imported. Lastly, demonstrate the use of the Not operator as a means to specify the columns that we don't want to select, which might be useful in cases where there is a large number of columns and we want to select most of them.

Finally, cover the @[r]subset macro, which enables us to filter rows in a DataFrame based on specific conditions. Go over the differences between @subset and @rsubset in detail, as this concept will be used in the scripts that follow. Finish this part of the lesson by going over the common use case of removing rows with missing observations in a specific column.

The next script in the workshop is 03-transform.jl, which focuses on using the @[r]transform macro to create a new column in a DataFrame or modify an existing one. Once again, it is important to explain the difference between the column and row versions of the macro (@transform and @rtransform, respectively) and demonstrate how the latter provides a more convenient way of specifying column transformations whenever possible.

After that, introduce the @astable macro, which enables accessing intermediate calculations within a DataFramesMeta.jl macro call. This macro allows performing operations on multiple columns simultaneously, making it easier to apply complex transformations and computations that would otherwise be challenging to write and comprehend.

Lastly, cover the mutating version of the macros, which allow direct modification of the original DataFrame. Make sure to explain that these macros can be accessed by appending an exclamation mark (!) at the end of the macro call, such as @[r]transform! or select!. This feature is particularly handy when there is a need to update or transform data in-place, eliminating the requirement for creating additional copies of the DataFrame.

Move on to the 04-grouping.jl script. Begin by showing the groupby function, which allows grouping data based on specific columns. If users are curious about the return values of groupby, you can mention that it returns a GroupedDataFrame, which can be inspected through indexing and manipulated with transform and select (you can find more details about it in groupby's documentation). Next, show the common pattern of using groupby with @combine to apply operations on grouped data and generate aggregated results. Make sure to go over the examples and cover the cases where one or more columns are used to group data. One of the examples includes the use of the @orderby macro, so take this opportunity to provide a detailed explanation of how it works.

Once participants are comfortable with using groupby and @combine, you can introduce the @by macro, which provides a concise alternative to using groupby and @combine by streamlining the process of grouping data and applying operations in a single call. Use the example provided in the script to show a direct comparison between the methods and mention how using @by simplifies the code and enhances readability.

The last script of the workshop is 05-chaining.jl. This script provides two examples of how to use the @chain macro to perform all data wrangling operations in a single block. Go over the examples and highlight how it can be more convenient than applying all the data wrangling operations separately. Some important points to mention here are that it is not necessary to pass the DataFrame as an argument inside the @chain block, and that it is not restricted to including DataFramesMeta.jl macros (it can also include functions from DataFrames.jl such as rename).

Get in touch

If you have any suggestions or want to get in touch with our education team, please send an email to training@pumas.ai.

License

This content is licensed under Creative Commons Attribution-ShareAlike 4.0 International.