Selected Code ๐พ
Below are some examples of the work Iโve done for the Malone Disturbance Ecology Lab and the Long Term Ecological Research (LTER) Network synthesis working groups. I got to work on a variety of tasks, which allows me to broaden my flexible skill set.
Collaborative, Reproducible Workflows ๐
Everglades Fire History
To investigate fire history patterns in the Florida Everglades as part of the Malone Disturbance Ecology Lab, I scripted a workflow that took raw fire perimeter shapefiles and transformed them into a clean, harmonized shapefile. Then using that shapefile, I derived annual rasters showing the areas that were burned in a particular year. I prepared the metadata for these data products and published them at the Environmental Data Initiative Repository.
Next, I created custom functions to perform some calculations. I calculated the total number of fires and the time since the last fire using the annual rasters. I also calculated statistics using the harmonized fire perimeter shapefile in order to get the annual total area burned, total wildfire area burned, total prescribed fire area burned, mean fire size, and area/perimeter. To allow others to easily explore this spatiotemporal information, I developed an interactive Shiny web app.
Finally, my supervisor, Dr. Sparkle Malone, compiled a vegetation shapefile for the Everglades, calculated more summary statistics from my results, and designed visualizations for our publication. After I polished the visualizations, we were able to publish our findings in Scientific Data.
Links: GitHub repo, data products, Shiny web app, publication
Flux Gradient
The Flux Gradient working group had an existing pipeline to download and analyze eddy covariance data for 8 NEON sites, however they requested me to rerun and expand their workflow to include an additional 39 more sites. After I downloaded the latest version of the data, I ran into errors, so I had to debug and edit the scripts accordingly, as there were some peculiarities with a few sites.
My debugging later led me to the discovery that the format of the data had changed, which was unsuitable for some parts of the workflow. I communicated this issue to the other members of the group in our meeting and they advised me to redownload a larger version of the dataset to get the necessary columns.
Despite the small setback, I finished debugging and running the workflow for all 47 sites just in time for the working groupโs in-person meeting. Since I did not have the domain knowledge, I had to explain to the group about the issues I ran into whenever I needed them to make a judgment call on the workflow for me. The teamwork enabled me to seamlessly process around 1 terabyte of data.
As our project evolved, our GitHub repository became bloated with everyoneโs scripts so to ensure that our work was organized and reproducible, I standardized naming conventions for the main scripts and functions, fixed file paths, formatted scripts for publication, and revamped the README.
Links: GitHub repo
Data Wrangling ๐
Marine Consumer Nutrient Dynamics
The Marine Consumer Nutrient Dynamics working group had marine consumer datasets spanning across 8 sites, all varying in format and column names. After much discussion with the group, we designed a workflow where we created a data key to connect the raw column names to the new, standardized column names. Then I worked on a script where I reshaped the raw files and joined them with the data key to automate the column name standardization process for 12 datasets with completely different structures. I also standardized all dates, species names, and filled in missing taxonomic information for each species. This workflow resulted in a harmonized/standardized CSV file containing marine consumer information for over 700 unique species, which has been published at the Environmental Data Initiative Repository.
Links: GitHub repo, data product
Plant Reproduction
The Plant Reproduction working group wanted plant trait data to research why some plants produce large amounts of seeds every few years. To assist this group, my team and I retrieved data from the TRY Plant Trait Database, which included traits such as plant seed mass, lifespan, flowering season, and seed persistence. Then we wrangled the TRY data so that we can combine it with a larger dataset of plant traits that the working group already had. After much wrangling, we refined the integrated master dataset to include 56 total variables for over 100 species. This dataset was then used for further downstream analyses.
Links: GitHub repo
Spatial Data ๐
Silica Export
The Silica Export working group wanted to investigate drivers of riverine silicon exports which included variables like surface air temperature, lithology, precipitation, evapotranspiration, elevation, and net primary production. They requested for my team and I to identify and this spatiotemporal information for the 228 watersheds that they were interested in. In order to accomplish this, we first searched online for the datasets that best suited our needs. Then we extracted the spatial data for each watershed using various R packages. Finally, we summarized the extracted values and exported them in a harmonized format that was easy to use for downstream analyses.
Links: GitHub repo
Text Mining ๐
Ecosystem Transitions
The Ecosystem Transitions working group needed to review over 3000 papers in order to prepare a meta-analysis on ecosystem transitions. They split the reading assignments between their group members, but to speed the process along, they requested for me to find a way to quickly decide whether a paper is worth reading or not.
So I created a script that filters and ranks the abstracts and titles based on positive and negative keywords. The more positive keywords an abstract and its associated title have, the more likely it was for the group to include the full paper in their meta-analysis. On the other hand, if a paperโs abstract/title has more negative keywords, it means that the group will probably not be interested in this paper.
After discussing with the group, we decided that it would be helpful for the group members to see how many negative keywords were in each paperโs abstract and title. So I added a column to each personโs reading assignment list that shows the count of negative keywords for each abstract and title. That way, they could prioritize reading the abstracts that have the least negative keywords and save time by discarding the ones that have the most negative keywords.
In the end, the group was able to use my work to decide on the 700 or so papers that will be included in round 2 of the meta-analysis.
Links: GitHub repo
Exploratory Graphing ๐
Plant Reproduction
The Plant Reproduction working group requested me to help them explore a potential analysis for one of their manuscripts. They were interested in a time series plot showing the total seed mass production in grams per year at specific plots at a site. I manipulated the columns in the aforementioned integrated master dataset to calculate the grams of seed per species per year. Then I graphed the time series, with a separate panel for each plot.
Links: GitHub repo