Sunday, May 4, 2025

Insights into endometriosis symptom trajectories and assessment of surgical intervention outcomes using longitudinal actigraphy

Share

Study design

Participants were recruited from gynecology out-patient departments and endometriosis service, including those on waiting lists for surgery. Inclusion criteria were defined as follows: being aged 16 or over, a diagnosis of endometriosis on imaging (for deep or ovarian endometriosis only) or at previous laparoscopy (all subtypes), no malignancy, and not currently pregnant. The study design is illustrated in Fig. 9. Participants took part in the study for up to three 4-6-week periods, which herein are referred to as smartwatch cycles (the term is used only to refer to cycles of wearing the smartwatch, not menstrual cycles). Smartwatch cycles could be completed whenever convenient, generally over a maximum period of 12 months. Participants were contacted by telephone by research nurses to schedule subsequent smartwatch cycles. A subset of patients who had previously been diagnosed with deep endometriosis and who planned to receive surgery to excise the endometriosis, with or without additional total hysterectomy±BSO, were recruited to the surgical “sub-study”, where they were asked to complete the first smartwatch cycle at any point prior to surgery, the second immediately following surgery, and the third approximately 4-6 months following surgery. Other participants could potentially have surgery during the study, but only those recruited to this “sub-study” and with deep or ovarian endometriosis found at surgery were required to complete the smartwatch cycles on this schedule and thus could be compared in further analysis. Consistent timing of smartwatch cycles was not required for all other participants for several reasons: to encourage ongoing participation by allowing hospital visits at intervals convenient for participants, and because no time-wise comparisons were made aside from the surgical sub-study (i.e., first smartwatch cycles were not compared directly to other first smartwatch cycles, etc.).

Fig. 9: Diagram of study design and analysis summary.

Included in the analysis summary is a glossary of indicative actigraphy measures and daily PROMs used in the analysis. For definitions of all daily actigraphy measures used in the analysis see Supplementary Data 1 (under the tab “Variable Definitions”).

During each smartwatch cycle, participants were solicited to complete daily self-reports (PROMs) of symptoms and were asked to wear the GENEActiv smartwatch. In the final week of each smartwatch cycle, participants were also asked to complete QoL questionnaires. At baseline and following each smartwatch cycle, details of any medication changes, surgery, menstruation, holiday, and other relevant information was collected by research nurses. At the end of the third smartwatch cycle or at withdrawal, participants were also asked to complete an acceptability questionnaire. All self-reported data, medical history, and surgical data was collected using REDCap electronic data capture tools hosted at the University of Edinburgh36,37.

Patient-reported outcome measures (PROMs)

Participants submitted daily pain and fatigue scores using (i) two questions assessing “average pain today” and “worst pain today” on a linear 10-point numeric scale from 1 (“no pain”) to 10 (“worst pain imaginable”), and (ii) the BFI. The BFI consists of nine items rated on an 11-point numeric scale, where the first three items ask to rate fatigue “right now”, “usual level of fatigue during the past 24 hours”, and “worst level of fatigue during the past 24 hours”38. The following six items assessed how, over the past 24 h, fatigue interfered with the following: (1) general activity, (2) mood, (3) walking ability, (4) normal work, (5) relations with other people, and (6) enjoyment of life. Daily PROMs could be completed through a daily SMS, or email link, or on paper, for inclusivity and adapting to participants’ preferences.

Information on demographics (ethnicity, education), medication use, and medical history, was collected by research nurses at baseline. After each smartwatch cycle, participants additionally reported any issues wearing the smartwatch, holidays, dates of menstruation, shift work, pregnancy, hormonal and pain medication taken, and any surgery or hospital visits.

Participants completed the EHP-30 questionnaire at baseline and in the final week of each smartwatch cycle. The EHP-30 is a clinically validated questionnaire to assess the impact of endometriosis on health-related quality of life (HRQoL), consisting of 30 questions each on a Likert scale from 0 (Never) to 4 (Always). The 30 questions are categorized into five subdomains related to “pain”, “control and powerlessness”, “social support”, “emotional wellbeing”, and “self-image”39. The questionnaire requires participants report on their symptoms retrospectively, “during the last 4 weeks”39. Other questionnaires were also completed along with the EHP-30 but were not analyzed in this study.

Smartwatch data

Consented participants were asked to wear the GENEActiv Original smartwatch (https://activinsights.com/) for the duration of each smartwatch cycle. The GENEActiv watches were configured to collect tri-axial acceleration data at 10 Hz (which allows for a data collection period of at least 30 days on a single charge), which we have shown in previous work is fully sufficient for day-to-day PA and sleep assessments17. Collecting data longitudinally (in the actigraphy setting this is often taken to refer to > 2-3 weeks) is particularly useful to quantify diurnal rhythms (see below for details): many actigraphy studies are limited to only collecting data for a single week, which severely limits the extent of the information that can be extracted. The watches captured a dynamic range of ±8 g, where g are the units of acceleration (equal to acceleration due to gravity) with a resolution of 12 bit (3.9 mg). In addition, the watches incorporate a sensor for ambient light on a range of 0–3000 lux with resolution of 5 lux, and a temperature sensor on a range of 0-60 degrees Celsius with resolution of 0.25 degrees.

Adherence analysis

To compute the average adherence of participants for both smartwatches and daily PROMs, smartwatch cycles with valid smartwatch data or with more than one recorded daily PROM were included. Thus, adherence for participants that dropped out after the first smartwatch cycle, for instance, would only be computed for that first smartwatch cycle. To account for the bias due to participants that did not participate in all three smartwatch cycles, we also compared adherence between the dropout and non-dropout groups (i.e., also computing adherence per smartwatch cycle for those participants that completed all three smartwatch cycles).

Processing of actigraphy data

To process the raw actigraphy data we used the R-package GGIR (version 3.0.6, see Supplementary Table 2 for the exact configuration parameters used). GGIR provides open-access tools for device calibration, non-wear and sleep detection, and extraction of a large range of sleep and PA measures40. Additionally, we processed the raw data using a further open-source approach with the MATLAB Actigraphy Toolbox developed in house and then implemented in R, which we have reported on in previous work41.

Raw actigraphy data must first be calibrated prior to any further processing, as each individual tri-axial accelerometer has device-specific offsets and therefore outputs must be aligned. For data processed using GGIR, the autocalibration process built into the package was used as detailed in van Hees et al.42, relying on periods of non-wear to take local gravity into account. For data processed independently of GGIR, outputs were calibrated using the offsets stored in the device by the manufacturer using the device-specific GENEARead package (version 2.0.10).

To detect periods where the device was not worn (non-wear), GGIR utilizes rolling 15-min intervals centered (centered in a 60-min interval to take into account the periods before and after) and identifies acceleration below a specific threshold within that interval43. However, we found frequent misdetection of non-wear using this method, and thus we also adapted an established non-wear detection method that incorporates the temperature sensor readings, which can provide clear indication of when the device is worn44. For the endometriosis participants in this study, classification of non-wear periods indicated that adapted temperature thresholds were needed. To detect periods of non-wear periods in the actigraphy data, we adapted a previous non-wear detection algorithm by Zhou et al. that utilises the temperature sensor incorporated into the GENEActiv watches in addition to acceleration data44. The Zhou et al. method uses a threshold based on the standard deviation of acceleration values as well as a temperature threshold within a rolling window. Here, we first computed rolling averages of temperature across 5-minute windows (Tsmooth), and then detected periods of at least 90 minutes where Tsmooth was lower than a chosen threshold T0 or the change in Tsmooth from the previous minute was lower than -0.5 °C. Of those periods, those where the rate of change acceleration movement (ROCAM) (see below, paragraph on acceleration summary measures) was below 0.025 were set as non-wear, and if two non-wear periods had less than 15 min in between, this period was also designated as non-wear. T0 was chosen as either 26 degrees Celsius or the 5th percentile of temperature readings, whichever was larger.

Detecting non-wear periods of at least 90 min was first performed for a more accurate assessment of the average temperature when the smartwatch was worn. Subsequently, to detect non-wear periods between 15 and 90 min, a new temperature threshold T1 was then chosen as one standard deviation below the mean temperature (up to a maximum temperature of 24 degrees Celsius) after excluding periods already designated as non-wear. If the 5th percentile of all temperature readings was larger than T1, then this temperature was used instead, and if the smartwatch was worn for fewer than 3 full days after the first iteration, T1 was set to 24 degrees Celsius. The same thresholding approach was used as described for long non-wear periods but followed by a filtering method where only short non-wear periods were kept, where the first 5 minutes of the interval was at least 2 degrees higher than the final 5 minutes, as within shorter non-wear periods it is typical to see a sudden drop in temperature. This non-wear detection algorithm was found to be accurate for the cohort in our study, as each participant’s data was visually assessed following non-wear classification.

The sleep detection method incorporated into GGIR was primarily used in this study, as described in ref. 45, which detects periods of time when changes in the arm angle are below a threshold of 5 degrees over at least 5 minutes (“sustained inactivity”). As no self-reported sleep logs were collected in this study, further processing was used to detect the most likely sleep period time (SPT) window using a heuristic algorithm detailed in van Hees et al.46. Subsequently, periods of wake during the sleep period (WASO) were computed as periods not detected as sustained inactivity within the SPT-window. However, many sleep detection methods, including that by van Hees et al.45,46, were designed to work well primarily in healthy individuals, and although some work has been undertaken to explore accuracy in cohorts with sleep disorders47 (e.g., van Hees et al.45 validated their method against polysomnography from cohorts with sleep disorders), it is unknown how well these algorithms perform in free-living contexts for individuals with various conditions that may affect sleep such as in endometriosis. To provide a comparator to GGIR-based sleep detection, we also used the sleep detection method presented in Tsanas et al.41 that was developed and validated both for healthy controls and people with post-traumatic stress disorder where the hallmark symptoms include frequent sleep disturbances. The method by Tsanas et al. utilizes thresholds based around measures of acceleration and ambient light collected by the devices, and crucially can allow for the detection of multiple sleep periods which is relevant for cohorts with severely disturbed sleep. We remark that although that sleep detection algorithm was not validated against polysomnography data, it was demonstrated to match well self-reported sleep onset and offset times in participants’ sleep diaries.

After observation of sleep detection results in specific participants in this study cohort, we also decided to detect periods of ‘low variation’ using a similar method to sleep detection by Tsanas et al., with low variation defined using the average of successive differences of each axis (the threshold was chosen using visualization of nights of sleep). Measures relating to low-variation periods, such as overall duration and percentage duration within the SPT period, were subsequently extracted.

Similarly to sleep detection, to extract daily measures of PA and diurnal rhythms, we utilized both GGIR and additional approaches as presented in Tsanas et al. 41. A typical pre-processing step towards characterizing actigraphy data is using an acceleration summary measure to project the three-dimensional acceleration data onto a vector. Van Hees et al. used the Euclidean Norm Minus One (ENMO) acceleration summary measure (with negative values rounded to zero, also referred to as ENMONZ) for the actigraphy measures computed using GGIR48, whereas all additional measures utilized the recently proposed ROCAM, which has been shown to outperform alternative widely used acceleration summary measures in terms of mapping onto PA levels and sleep17. Using the resulting vectors from the application of the acceleration summary measures, we subsequently computed actigraphy measures to characterize the magnitude and patterns of movement per day, for instance the most active 10 h (M10) or least active five hours (L5) of the day, or the relative amplitude of most and least active hours (RA). Additionally, using GGIR we extracted PA measures such as inactivity and light, moderate, or vigorous PA, as well as moderate-to-vigorous PA (MVPA). PA intensity measures were extracted using default cut-points for the GGIR algorithm. Although these cut-points were similar to those established by previous studies using GENEActiv devices on the non-dominant wrist49, participants in our study were allowed to choose the wrist placement of the smartwatch. Although these differences could introduce bias to comparisons between participants, it would not generally influence the findings within participants (only two participants recorded changing wrists partway through the study). A full list of extracted actigraphy measures using both GGIR and other approaches is detailed in Supplementary Data 1 (we clarify that the source indicated therein refers to the implementation we used in this study rather than reflecting where each actigraphy measure was first proposed).

Additionally, to quantify sleep and circadian rhythms, we utilized a measure of sleep regularity, the sleep regularity index (SRI), proposed by Philipps et al.50 and modified to apply to day-pairs as implemented in GGIR. The SRI measure compares the sleep state within 30-s time-points 24 h apart (e.g., day k-1 and day k as applied for day-pairs as in GGIR). The resulting value ranges from -100 to 100, with 100 representing perfectly aligned sleep periods50,51. Lastly, the temperature sensors incorporated into the devices used were also used to extract further daily measures relating to diurnal rhythms, as proposed in our work previously19. Daily actigraphy measures were ignored if the smartwatch was worn for less than 75% of the 24-h period either starting at 12am (for measures assessing daytime sleep/PA or across the full 24 h) or at 12 pm (for sleep measures). Sleep regularity values were also only included if the GGIR-assessed validity (using non-wear detection) was greater than 80%.

Processing of PROMs

For the daily PROMs reporting pain (average and worst) and fatigue (BFI), both the individual question scores and global scores were used. To compute the global scores, we used the mean of the two pain scores (global pain), and similarly for the BFI we took the mean of the scores on the nine questions (global fatigue)38. For the EHP-30 questionnaire, which was evaluated near/at the end of each smartwatch cycle, we examined the scores from the subdomains (pain_ehp30, control_ehp30, emotion_ehp30, social_ehp30, self_ehp30) in addition to the total of all 30 questions (ehp30_overall)52 after normalizing to a scale of 0 to 100.

The PROMs in this study were collected using REDCap and the collected data was processed in R after being exported. Ad-hoc corrections to the raw data (e.g., where PROM dates were clearly mistakenly entered) were made prior to any further processing. Daily self-reports completed between 5 pm on the associated date and 5 am the following morning were considered within the “correct” time-window for the purposes of de-duplicating entries, where if duplicate entries were present, only the entry within the correct timeframe, or if unavailable then within a “feasible” timeframe (anytime on the associated day or following day) was included.

Summarizing daily actigraphy measures and PROMs

To compare the daily PROMs and daily actigraphy measures with end-of-cycle questionnaires and participant-level data (e.g., BMI, age, diagnosis), we computed summary measures of the day-level data across each smartwatch cycle and across individuals. To limit bias from participants or smartwatch cycles with large amounts of missing data, when computing measures across participants, we included only participants with at least 10 non-missing values to compute the mean, and at least 20 values to compute all other summary measures, where “missing” means either no PROM was submitted for that day or the 24-h wear-time was below 75%. Similarly, when computing across smartwatch cycles, we included only smartwatch cycles with fewer than 10 non-missing values to compute the mean, and 20 values to compute all other summary measures.

We then extracted the mean value, standard deviation, skewness, interquartile range (IQR), the upper quartile (75th percentile) and lower quartile (25th percentile) using actigraphy data only, and the mean of upper and lower quartile values (i.e., highest and lowest 25% of daily values, respectively) using both self-report and actigraphy data. For PROMs, the mean of the upper and lower quartile values was used to better summarize the ordinal scales with bounded and discrete values (e.g., with only 10 values).

To compute further variability measures across smartwatch cycles, we first imputed any remaining missing values using linear interpolation (through the imputeTS package in R) with a maximum gap of three missing values, such that if more than 3 consecutive values were missing, they would not be imputed. This imputation was performed before computing variability measures that utilize consecutive daily values, as missing data may introduce unrealistic ‘jumps’ between values that are falsely treated as consecutive days. On the imputed data (with remaining missing values removed), we then computed the Teager-Kaiser energy operator (TKEO) and and root mean squared successive differences (RMSSD), as have been used in other studies53, which are defined in Eq. (1) and Eq. (2), respectively:

$${TKEO}=\frac{1}{N}\sum _{i=2}^{N-1}({{x}_{i}}^{2}-{x}_{i-1}{x}_{i+1})$$

(1)

$${RMSSD}=\sqrt{\frac{1}{N}\sum _{i=1}^{N-1}{({x}_{i+1}-{x}_{i})}^{2}}$$

(2)

Statistical associations

Repeated measures correlations were used to examine pairwise associations between daily self-reports and the daily actigraphy measures, using R-package rmcorr54 (the package also computes p-values using the F-ratio). This approach allowed for capturing only the within-subject variation rather than between-subject variation. Statistical associations were interpreted as “statistically significant” at a threshold of p < 0.05. Intra-person correlations were defined by computing Pearson correlations using each individual participant’s data separately. Partial correlations were also computed using Pearson or Spearman correlations using the R-package ppcor55. Partial correlations were used to control for the effect of other variables (e.g., possible confounding factors) when examining the association between two variables. To fully investigate associations between daily actigraphy variables and daily PROMs, we used linear mixed-effects models to further account for potential autocorrelation within a participant’s repeated measures (given that measures on days close to each other may be more strongly correlated). Specifically, we modeled the daily PROMs (“fatigue” is the global BFI score, and “pain” is the global pain score) according to Eq. (3) and Eq. (4):

$${{Fatigue}}_{i,j}={\beta }_{0}+{\beta }_{1}\cdot {{pain}}_{i,j}+{\beta }_{2}\cdot {M10}_{i,j}+{\beta }_{3}\cdot {M10}_{i-1,j}+{\beta }_{4}\cdot {{SleepDur}}_{i,j}+{\beta }_{5}\cdot {{SleepDur}}_{i-1,j}+{\beta }_{6}\cdot {{WASO}}_{i,j}+{\beta }_{7}\cdot {{WASO}}_{i-1,j}+{\beta }_{8}\cdot {{SleepReg}}_{i,j}+{u}_{j}+{\varepsilon }_{i,j},$$

(3)

$${{Pain}}_{i,j}={\beta }_{0}+{\beta }_{1}\cdot {{fatigue}}_{i,j}+{\beta }_{2}\cdot {M10}_{i,j}+{\beta }_{3}\cdot {M10}_{i-1,j}+{\beta }_{4}\cdot {{SleepDur}}_{i,j}+{\beta }_{5}\cdot {{SleepDur}}_{i-1,j}+{\beta }_{6}\cdot {{WASO}}_{i,j}+{\beta }_{7}\cdot {{WASO}}_{i-1,j}+{\beta }_{8}\cdot {{SleepReg}}_{i,j}+{u}_{j}+{\varepsilon }_{i,j},$$

(4)

where \({{Fatigue}}_{i,j}\) represents the global BFI score on day i from participant j, the \(\beta\) coefficients represent the fixed effects, \({u}_{j}\) represents the random effect of participant j, and \({\varepsilon }_{i,j}\) represents the residual error. Similarly, \({{Pain}}_{i,j}\) refers to the global pain score on day i from participant j. Additionally, we imposed an autoregressive process (order 1) as the correlation structure between residual errors for repeated measures from an individual participant, where distance between two errors was determined by the day of enrollment in the study (such that adjacent repeated measures are considered more strongly correlated). We refer to Gałecki et al.56 for further clarification of linear mixed-effects models with specific correlation structures. The models were implemented in R using package nlme (version 3.1-166). Only a small indicative set of actigraphy variables was chosen to avoid collinearity and illustrate associations between the main constructs: sleep, physical activity, and diurnal rhythms. All variables were standardized prior to model fitting.

When computing associations, days where any of the variables of interest (i.e., two variables in pairwise associations and all relevant variables for partial correlations and linear mixed-effects models) were missing were excluded. Spearman correlations were used to examine pairwise associations between measures summarized at a smartwatch cycle or participant level, and thus all correlation coefficients presented throughout as ‘R’ refer to the Spearman coefficient unless indicated otherwise. Correlations were regarded as statistically strong when \({|R|} > 0.3\), which is common in clinical applications31. Comparisons between subgroups, such as by treatment received, were visually examined using boxplots or distribution plots; due to the limited sample size and exploratory nature of the study, these visualization approaches were used to provide an overview of potential group differences (including outliers) which should be tentatively interpreted.

As an exploratory study assessing statistical associations with a large number of actigraphy measures (many of which were highly correlated) with no pre-specified analysis, no adjustment for multiple comparisons was applied as this would likely result in overlooking many potential key statistical associations. Thus, in this study we primarily focused on the statistical strength of associations (rather than statistical significance) as an indication of potential relationships that should be interpreted tentatively. Assumptions relating to the independence of samples were addressed by using methods that account for repeated measures, and non-parametric alternatives (Spearman correlation coefficient) were used where appropriate. Although Pearson correlations were used to compute intra-person correlations, only point estimates were used and only computed where enough valid entries ( > 20) were present; these correlations were used to illustrate variation in the data that was further investigated in mixed-effects models that accounted for autocorrelation in repeated measures.

Ethical approval

The data in this study was collected as part of the EndoTECH study ‘Understanding the symptoms of endometriosis using Smartwatch technology’ performed in NHS Lothian, Scotland, UK, and was approved by the relevant research ethics committee (West of Scotland REC 5, ref. 21/WS/0092). Informed consent to participate in the study were obtained from all participants. The research has been performed in accordance with the Declaration of Helsinki.

Read more

Related updates