A Redwood Canopy Data Analysis

Introduction

In 2005 Tolle et al (2005) completed an initial study on macroscopic climate of a coastal California redwood tree using a network of wireless sensors.

Fast forward to 2021 and the raw data has been re-purposed to teach statistical modeling at Duke University which encompasses this report. Much of the teaching emphasis from this assignment focused on the importance of data prep/cleaning before beginning any analysis. The structure from the original assignment has been condensed into five parts: original data collection, data cleaning, data exploration, interesting findings, and conclusion.

1 Original Data Collection

Paper Summary

The purpose of the original study was twofold: capture information about a single redwood tree canopy over time and provide a roadmap for future macroscopic studies using a multi-sensor network.

The data was collected in Sonoma, California on a single Redwood tree over a period of days at consistent time intervals during the late spring/early summer.

Based on this study, researchers were able to verify the existence of dynamic spatio-temporal gradients surrounding the tree and prove that complex biological theories can be validated using this measurement framework. Researchers highlighted lesson’s learned, beneficial for future studies, highlighting sensor sensitivity based on positioning and yield issues from memory/network constraints.

How are sensors deployed?

Nodes (sensor housing) were attached to the body of one redwood tree at various radial, angular, and vertical distances. At roughly 2-meter spacing intervals the first sensor was placed 15m from ground level and the last sensor at 70m above ground level. The majority of nodes used in the analysis were placed on the west side of the tree about 0.1m - 1m from the tree trunk. Several nodes were placed outside of the measurement envelope to monitor readings in the immediate vicinity.

Nodes had two means of capturing/transferring data: on-site logger stored readings on-device and a separate workflow transferred readings over-the-wire to gateway (make sure this is correct when looking at file).

Sensors were calibrated before being deployed in the field using two trials called roof and chamber. In the roof exercise, nodes were placed in direct view of sunlight atop a building to test the PAR measurements and which were compared to a well known reference. In the chamber phase, temperature and humidity sensors exposed to wide range of conditions: from 5-30 degrees Celsius and between 20-90 %RH. Before being deployed the data harvesting querying was tested in field on a sample tree in similar conditions to verify communication between nodes and on-site internet connected gateway.

What is the duration of data recording?

Data was recorded over period of almost 44 days from 4/27/2004 5:10pm (epoch 1) to 6/10/2004 2:00pm (epoch 12635). Measurements were taken every 5 minutes and battery operated nodes were duty cycled to conserve power when not operating (on for 4 seconds to take measurements then turned off until next reading).

What are the main variables of interest?

Researchers were interested in traditional climate variables temperature, humidity, and light levels which were measured via Photosynthetically Active Radiation (PAR). Light wavelengths between 350nm–700nm were captured in two measurements: incident (direct), which provides information about energy available for photosynthesis, and reflected (ambient), which was used for satellite validation of measurements.

What is difference between data in two different files?

Data in “sonoma-data-log.csv” represents sensor data saved from the logger to the flash memory on the actual node that was retrieved after the deployment. Data in “sonoma-data-net-csv” is the data retrieved from the wireless network during the deployment.

The main difference between the files is that the scaling/precision on voltage measurements appear different between the two sources. The network file also appears to have some row duplication.

2 Data Cleaning

In Figure 1, the number of measurements recorded during each epoch is plotted for each data source: logger and network. Epoch range and record frequency are not consistent between sources, however, the distribution looks similar over inclusive measurement ranges. The first network measurement occurs on epoch 2812 which corresponds to May 7th, almost 10 days after the experiment started. Tolle et al.cited packet drops and network related issues as a potential culprit to low network yield, but the raw data provided does not match the findings from Figure 7 in the original report.

Voltage units are not consistent across the two sources. Figure 2 illustrates the scaling difference as we see different distributions of voltage based on the data source. Voltage readings from the network source were converted to logger voltage units units.

Hamatop and hamabot measures had consistent units across data sources, but did not match the scale reported in the Tolle paper. These units were converted to match by multiplying both metrics by a conversion factor 0.0185.

Each record, unique to the data source, has a composite key identifier: nodeid and epoch. During exploratory analysis we identified duplicate entries. Figure 3 displays a distinct count of nodeid/epoch combinations which appear more than once. For example, there are 8,286 affected composite keys from the logger data source that have been duplicated anywhere from two to four times.

Out of this duplicate record subset how many are unique entries? A unique entry indicates some dimension differs between the duplicated rows and could indicate different measurement readings from the same sensor at the same time.

The pie chart displays the total number of duplicated rows by data source. Distinct refers to rows with different quantities outside of the primary key, while non-distinct means the repeated entries were copied exactly as is. This equates to over 10% of the raw sample! Due to our unfamiliarity with the sensor configuration, lack of detail regarding this problem in the Tolle paper, and large percentage of affected measurements we decided to average the numeric measures from the affected rows and remove any duplicated entries.

Missing Data

There where several data removal steps we took to ensure the best analysis. The table shows data processing steps and corresponding record counts after completion. First, duplicates were averaged and removed as mentioned above. We replicated the voltage filter step outlined in Tolle paper, entries with voltage greater than 3V or less than 2.4V were removed. Numerous entries with completely missing measurement across all dimensions were also removed. In order to create a holistic dataset encompassing the maximum amount of time, records from both data sources were combined into a unified view. Numerical measures were averaged for composite keys present in both network and logger files. Lastly, we visually identified some outliers that were removed. More on this step will be discussed later.

Data Removal Step Record Count Net Change
Ingestion 416,036 0
Duplicate Removal 393,213 -22,823
Voltage Removal 351,914 -41,299
Drop NAs 341,659 -10,255
Holistic View 277,241 -64,418
Outlier Removal 264,932 -12,309

The time of day for the missing measurements appear uniform across 24-hours. Over the course of the experiment, we see more measurement issues than at the beginning. This echos the sentiment expressed in the Tolle report.

Outlier Identification

Tolle mentioned outlier related issues corresponding to humidity measurements where %humidity was greater than 100%. We observed a similar problem in our dataset. Visually inspecting the histogram we see a number of readings over 100% threshold, which were excluded from analysis. Visually inspecting temperature plots, we saw a large number of outliers. A quick google search for the hottest recorded temperature in Sonoma, CA revealed 44 degrees Celsius, much lower than the 100+ degree points indicated on the plots. Tolle’s max temperature reading was 32.6 degrees and because we are unfamilar with the climate in the geographic region we decided to use this as our cutoff filter. All measurements greater than 32.6 were removed. The incident par histogram shows a long tail distribution. In the boxplot we see two distinct groups of outliers separated by almost 1000 units. We determined to remove the second group by enforcing a manual cutoff of < 2500 because it seemed to represent sensor failure due to the low number of points.

As we noted above, we found a large number of duplicate entries from both measurement sources. These duplicated entries were removed to prevent over-weighting with additional measures. The incident PAR boxplot indicates a large number of potential outliers even after initial cut-off filtering. However, we decided to include them in analysis because we believed the nighttime 0 PAR readings were skewing quantiles closer towards 0. We also are not familiar with the flux units on PAR dimensions which contributed to inclusion of remaining points.

3 Data Exploration

For pairwise analysis we decided to look at two distinct time periods, sunrise and sunset, as we believed they presented the most dynamic conditions to explore potential correlations. Would trends observed during one time period also manifest during the other? Researching sunrise and sunset times in Sonoma, CA during the months of the study lead us to select 5-hour intervals encompassing both astrological events. Sunrise: 5:00am - 10:00am and Sunset: 4:00pm - 9:00pm

Many interesting pairwise scatter plots were analyzed for trends. We highlight interesting findings below. Humidity vs. temperature plot reveals a relationship between the two variables as many of the points look clustered together. We see that as temperature increases, humidity decreases. This trend appears linear during sunrise, but appears exponential/polynomial during sunset.

Correlation plots were created for each of the measured values which can be viewed as proxies for scatter plot relationships. Highlights from these figures include a strong positive correlation between incident and reflected PAR and a negative correlation between humidity and temperature. All correlations appear magnified (daker colors) during sunset indicating strong relationships.

Incident PAR Association

Temperature appears to be good predictor of incident PAR. We saw positive correlations between the two variables during both time frames. Humidity displayed a negative correlation with incident PAR which is more pronounced during sunset.

Time Series

The four measured dimensions were plotted over time. To make the plots easier to comprehend the sensor heights where grouped into 10 meter interval height classes. Our analysis of each plot follows.

Temperature vs. Time

We artificially removed high temperature readings discussed earlier. We see a range of 6.78 - 32.6 degrees Celsius respectively.

There are some interesting trends in the temporal domain. All height sensors track the same general shape, but experience slight differences based on height that may indicate a relationship worth investigating. The high temperatures were almost exclusively acheived in the evenings (May 2nd/14th) by sensors mounter higher than 30m. This pattern is not consistent with the low temperature readings.

One interesting empirical note, we wee a local temperature max around May 31st that may indicate some meteorological event (heat wave, cold front, etc.) during the time period which could be interesting to focus future analysis on.

Humidity vs.Time

Humidity readings were artificially capped at 0% and 100% respectively. We see a range of 16.3% - 100% over the duration of collection. All sensors tracked the same general trend irrespective of sensor height which indicates no or weak relationship between height and humidity. One interesting thing manifested in this chart is around May 31st where we see humdity drop to a local min before increasing back up. This corresponds with the temperature increase we observed over the same temporal domain.

Incident PAR vs.Time

We see range of 0 and 2146 over duration of collection. Sensors mounted the highest often achieved the highest reading each day. The peaks and valleys represent the day and night distinction, with peaks during the day and valleys at night. On May 26th the chart experiences a much lower local max of ~800 when compared with the other days which might indicate some sort of sun blocking event (cloud cover, storm, etc.).

Reflected PAR vs.Time

We see range of 0 - 175 over duration of collection. We see most of the daily high readings correspond to the tallest mounted sensors and most of the daily lows with the lowest mounted sensors. One interesting observation from this plot is specific to the daily maximum’s: there is much greater variability in this metric than seen in the incident PAR plot. Several days recorded max values that are much lower. One reading corresponds with the cloudy day (May 26th) noted in the incident PAR chart, but there are several other days which are interesting and may be worth investigating.

PCA Analysis & Scree Plot

PCA was performed on the following dimensions: humidity, temperature, incident PAR, reflected PAR, and height. The Scree Plot shows the first three components explain almost 90% of the variability in the data and indicates a lower dimensional representation is possible. We discuss the lower dimension representation in interesting finding 1.

4 Interesting Findings

Lower Dimensional Height Dispersion

Are measurements recorded at different heights distinguishable?

Here the first two principal component score vectors were plotted for each of the height classes. What’s interesting is we see tight groupings for sensors mounted below 30 m and much more variability as measurements are recorded at higher elevations. We initial suspected this variability was due the sample size recorded at each height, with more samples recorded at higher levels becuase of more sensors. However, looking at the table we see the most measurements in the 40-50 meter height range, but much more variability in the 60+ meter class. Intuitively we suspect that the top of the tree presents a harsher environment that is more exposed to weather patterns than the base that is more densely protected by foliage. This explanation could explain the wide dispersion of points at heigher measurement levels.

NA 10 - 20 m 20 - 30 m 30 - 40 m 40 - 50 m 50 - 60 m 60+ m
2,201 2,638 32,633 29,644 82,201 64,121 51,494

Reflected PAR Max Reading Variability

Does the time of day determine the max reflected PAR?

We noticed in the reflected PAR time series plots the maximum reading varied across days more relative to incident PAR readings. The scatter plot shows the maximum reading by day with a blue line representing the average max value. Notice the large amount of variability. We wanted to know if the distribution of the hour of the day, corresponding to the sun height, was centered around a specific time/max value. The histogram shows that around the evening hour we get most of our maximum measurements. Tolle mentioned the PAR sensors were incredible sensitive to changes in light angles and we believe that this finding proves that thought. We would expect the reflected PAR max trend to be similar to the incident PAR, but we believe the difference could be explain by cloud cover between hours 16:00 - 17:00 which is why we see this wide dispersion in the histogram.

Covariate Height Dependence

How much does the humidity and temperature vary with height during the day?

Our hypothesis is that the temperature measured should be lower at the bottom of the red wood during day time and it should have a positive relationship with height. The humidity measured would be vice versa and have a negative relationship with height. The explanation is that the top of the red wood receives a larger amount of incident PAR and reflective PAR, and therefore has more energy and higher temperatures. Correspondingly, The higher the temperature, the faster the water vapors, which leads to a lower humidity. The stacked box plot based on height during day time (6am-8pm) confirms our hypothesis. The 20-30 m range shows lower mean temperature and higher mean humidity. However it is interesting to notice that except in the 20-30 m range, the temperature does not vary significantly in the 30-60m range. The linear relationship we assume is not supported by our data. Our findings correspond to the findings of Tolle (Tolle,2005). The stacked box plot from Tolle shows that the middle part of the red wood has a comparative consistent temperature through out the day and night. In conclusion, we suppose that red woods have a spacial homeostasis mechanism that regulates its temperature and humidity in the middle. It would be an interesting research topic to find out the working principle of this homeostasis.

5 Conclusion

This project looked at data collected from Tolle et al (2005) as part of predictive/stat modeling course at Duke University. Much of the raw data required extensive cleaning before curation, which was a point of emphasis prior to starting. Interesting findings were complied through detailed explatory data analysis, clustering and PCA.