A Redwood Canopy Data Analysis

Introduction

In 2005 Tolle et al (2005) completed an initial study on macroscopic climate of a coastal California redwood tree using a network of wireless sensors.

Fast forward to 2021 and the raw data has been re-purposed to teach statistical modeling at Duke University which encompasses this report. Much of the teaching emphasis from this assignment focused on the importance of data prep/cleaning before beginning any analysis. The structure from the original assignment has been condensed into five parts: original data collection, data cleaning, data exploration, interesting findings, and conclusion.

1 Original Data Collection

Paper Summary

The purpose of the original study was twofold: capture information about a single redwood tree canopy over time and provide a roadmap for future macroscopic studies using a multi-sensor network.

The data was collected in Sonoma, California on a single Redwood tree over a period of days at consistent time intervals during the late spring/early summer.

Based on this study, researchers were able to verify the existence of dynamic spatio-temporal gradients surrounding the tree and prove that complex biological theories can be validated using this measurement framework. Researchers highlighted lesson’s learned, beneficial for future studies, highlighting sensor sensitivity based on positioning and yield issues from memory/network constraints.

How are sensors deployed?

Nodes (sensor housing) were attached to the body of one redwood tree at various radial, angular, and vertical distances. At roughly 2-meter spacing intervals the first sensor was placed 15m from ground level and the last sensor at 70m above ground level. The majority of nodes used in the analysis were placed on the west side of the tree about 0.1m - 1m from the tree trunk. Several nodes were placed outside of the measurement envelope to monitor readings in the immediate vicinity.

Nodes had two means of capturing/transferring data: on-site logger stored readings on-device and a separate workflow transferred readings over-the-wire to gateway (make sure this is correct when looking at file).

Sensors were calibrated before being deployed in the field using two trials called roof and chamber. In the roof exercise, nodes were placed in direct view of sunlight atop a building to test the PAR measurements and which were compared to a well known reference. In the chamber phase, temperature and humidity sensors exposed to wide range of conditions: from 5-30 degrees Celsius and between 20-90 %RH. Before being deployed the data harvesting querying was tested in field on a sample tree in similar conditions to verify communication between nodes and on-site internet connected gateway.

What is the duration of data recording?

Data was recorded over period of almost 44 days from 4/27/2004 5:10pm (epoch 1) to 6/10/2004 2:00pm (epoch 12635). Measurements were taken every 5 minutes and battery operated nodes were duty cycled to conserve power when not operating (on for 4 seconds to take measurements then turned off until next reading).

What are the main variables of interest?

Researchers were interested in traditional climate variables temperature, humidity, and light levels which were measured via Photosynthetically Active Radiation (PAR). Light wavelengths between 350nm–700nm were captured in two measurements: incident (direct), which provides information about energy available for photosynthesis, and reflected (ambient), which was used for satellite validation of measurements.

What is difference between data in two different files?

Data in “sonoma-data-log.csv” represents sensor data saved from the logger to the flash memory on the actual node that was retrieved after the deployment. Data in “sonoma-data-net-csv” is the data retrieved from the wireless network during the deployment.

The main difference between the files is that the scaling/precision on voltage measurements appear different between the two sources. The network file also appears to have some row duplication.

2 Data Cleaning

In Figure 1, the number of measurements recorded during each epoch is plotted for each data source: logger and network. Epoch range and record frequency are not consistent between sources, however, the distribution looks similar over inclusive measurement ranges. The first network measurement occurs on epoch 2812 which corresponds to May 7th, almost 10 days after the experiment started. Tolle et al.cited packet drops and network related issues as a potential culprit to low network yield, but the raw data provided does not match the findings from Figure 7 in the original report.

Voltage units are not consistent across the two sources. Figure 2 illustrates the scaling difference as we see different distributions of voltage based on the data source. Voltage readings from the network source were converted to logger voltage units units.

Hamatop and hamabot measures had consistent units across data sources, but did not match the scale reported in the Tolle paper. These units were converted to match by multiplying both metrics by a conversion factor 0.0185.

Each record, unique to the data source, has a composite key identifier: nodeid and epoch. During exploratory analysis we identified duplicate entries. Figure 3 displays a distinct count of nodeid/epoch combinations which appear more than once. For example, there are 8,286 affected composite keys from the logger data source that have been duplicated anywhere from two to four times.

Out of this duplicate record subset how many are unique entries? A unique entry indicates some dimension differs between the duplicated rows and could indicate different measurement readings from the same sensor at the same time.

The pie chart displays the total number of duplicated rows by data source. Distinct refers to rows with different quantities outside of the primary key, while non-distinct means the repeated entries were copied exactly as is. This equates to over 10% of the raw sample! Due to our unfamiliarity with the sensor configuration, lack of detail regarding this problem in the Tolle paper, and large percentage of affected measurements we decided to average the numeric measures from the affected rows and remove any duplicated entries.

Missing Data

There where several data removal steps we took to ensure the best analysis. The table shows data processing steps and corresponding record counts after completion. First, duplicates were averaged and removed as mentioned above. We replicated the voltage filter step outlined in Tolle paper, entries with voltage greater than 3V or less than 2.4V were removed. Numerous entries with completely missing measurement across all dimensions were also removed. In order to create a holistic dataset encompassing the maximum amount of time, records from both data sources were combined into a unified view. Numerical measures were averaged for composite keys present in both network and logger files. Lastly, we visually identified some outliers that were removed. More on this step will be discussed later.

Data Removal Step Record Count Net Change
Ingestion 416,036 0
Duplicate Removal 393,213 -22,823
Voltage Removal 351,914 -41,299
Drop NAs 341,659 -10,255
Holistic View 277,241 -64,418
Outlier Removal 264,932 -12,309

The time of day for the missing measurements appear uniform across 24-hours. Over the course of the experiment, we see more measurement issues than at the beginning. This echos the sentiment expressed in the Tolle report.

Outlier Identification

Tolle mentioned outlier related issues corresponding to humidity measurements where %humidity was greater than 100%. We observed a similar problem in our dataset. Visually inspecting the histogram we see a number of readings over 100% threshold, which were excluded from analysis. Visually inspecting temperature plots, we saw a large number of outliers. A quick google search for the hottest recorded temperature in Sonoma, CA revealed 44 degrees Celsius, much lower than the 100+ degree points indicated on the plots. Tolle’s max temperature reading was 32.6 degrees and because we are unfamilar with the climate in the geographic region we decided to use this as our cutoff filter. All measurements greater than 32.6 were removed. The incident par histogram shows a long tail distribution. In the boxplot we see two distinct groups of outliers separated by almost 1000 units. We determined to remove the second group by enforcing a manual cutoff of < 2500 because it seemed to represent sensor failure due to the low number of points.

As we noted above, we found a large number of duplicate entries from both measurement sources. These duplicated entries were removed to prevent over-weighting with additional measures. The incident PAR boxplot indicates a large number of potential outliers even after initial cut-off filtering. However, we decided to include them in analysis because we believed the nighttime 0 PAR readings were skewing quantiles closer towards 0. We also are not familiar with the flux units on PAR dimensions which contributed to inclusion of remaining points.

3 Data Exploration

For pairwise analysis we decided to look at two distinct time periods, sunrise and sunset, as we believed they presented the most dynamic conditions to explore potential correlations. Would trends observed during one time period also manifest during the other? Researching sunrise and sunset times in Sonoma, CA during the months of the study lead us to select 5-hour intervals encompassing both astrological events. Sunrise: 5:00am - 10:00am and Sunset: 4:00pm - 9:00pm

Many interesting pairwise scatter plots were analyzed for trends. We highlight interesting findings below. Humidity vs. temperature plot reveals a relationship between the two variables as many of the points look clustered together. We see that as temperature increases, humidity decreases. This trend appears linear during sunrise, but appears exponential/polynomial during sunset.

Correlation plots were created for each of the measured values which can be viewed as proxies for scatter plot relationships. Highlights from these figures include a strong positive correlation between incident and reflected PAR and a negative correlation between humidity and temperature. All correlations appear magnified (daker colors) during sunset indicating strong relationships.

Incident PAR Association

Temperature appears to be good predictor of incident PAR. We saw positive correlations between the two variables during both time frames. Humidity displayed a negative correlation with incident PAR which is more pronounced during sunset.

Time Series

The four measured dimensions were plotted over time. To make the plots easier to comprehend the sensor heights where grouped into 10 meter interval height classes. Our analysis of each plot follows.

Temperature vs. Time

We artificially removed high temperature readings discussed earlier. We see a range of 6.78 - 32.6 degrees Celsius respectively.

There are some interesting trends in the temporal domain. All height sensors track the same general shape, but experience slight differences based on height that may indicate a relationship worth investigating. The high temperatures were almost exclusively acheived in the evenings (May 2nd/14th) by sensors mounter higher than 30m. This pattern is not consistent with the low temperature readings.

One interesting empirical note, we wee a local temperature max around May 31st that may indicate some meteorological event (heat wave, cold front, etc.) during the time period which could be interesting to focus future analysis on.

Humidity vs.Time

Humidity readings were artificially capped at 0% and 100% respectively. We see a range of 16.3% - 100% over the duration of collection. All sensors tracked the same general trend irrespective of sensor height which indicates no or weak relationship between height and humidity. One interesting thing manifested in this chart is around May 31st where we see humdity drop to a local min before increasing back up. This corresponds with the temperature increase we observed over the same temporal domain.