3  Can we do data better?

3.1 2024

Chairs: Stephanie Lussier and Doug Kelkhoff

3.1.1 databases

A lot of thought currently about databases, but not a lot of companies using it in primary data flows (although it is used in curated trial data for secondary use, e.g. Novartis’ Data42 and Roche’s EDIS).

3.1.2 Blockers

  • Dependence on CROs who deliver SAS datasets generated by SAS code is a factor.
  • Often fear from IT groups about the cloud, which is sometimes confusing when platforms like medidate are already cloud-based and other companies already have STDM/ADaM in AWS S3/cloud.
  • Unclear justification for changes, particularly what are we getting from databases for current STDM/ADaM primary use; existing systems are mostly functional.
  • Challenges with concurrent data access by multiple teams in some file based approaches, leading to errors.

3.1.3 an approach around tortoiseSVN

  • One company had been using tortoiseSVN for a while, and is considering moving to snowflake.
  • Pros: Integration with version control and modern cloud storage solutions.
  • Cons:
    • Higher entry threshold for users.
    • Gap in a user friendly GUI
    • Storing data in ‘normal’ version control rather than tools designed for data versioning rapidly leads to bloated repositories.

3.1.4 Version Control and Data Storage

  • Alignment code versioning in Git; data versioning in tools like S3 versioning
  • S3 can be accessed as a mounted drive (e.g. Lustre) and the S3 API.

3.1.5 Denodo as Data Fabric Mesh

One company uses Denodo as a data fabric mesh; users interact via Denodo, which serves as an API layer. No direct interaction with the source data by users.

3.1.6 Nontabular Data

  • Not common for statistical programmers working on clinical trial data.

3.1.7 CDISC Dataset JSON vs. Manifest JSON

Writing CDISC JSON is super slow and potentially not sufficient for regular working data.

3.1.8 Popularity and Concerns with Parquet Datasets

  • Admiral tool generates Parquet directly; others convert from SAS to Parquet.
  • Questions about the longevity and maintenance requirements of Parquet as it’s a blob (vs a ‘human readable’ format like CSV/JSON)

3.1.9 Handling Legacy Data

  • Suggest stacking legacy data into a database if for secondary data use

3.1.10 Change Management

  • For statistical programming, direct instruction to new systems is necessary.
  • Emphasize direct support over broad training.
  • Simplify systems for users to reduce friction.
  • Consider a GUI similar to Azure.
  • Focus on reducing the user burden.

3.1.11 Different Data Use Cases

Differences in data use (e.g., Shiny App vs. regulatory documents). Dashboards directly accessing EDC without needing snapshots.

3.1.12 Summary

Uncertain value in moving from CDISC data standards to databases. Limited interest and action in this area across the organization. Not a high priority given other ongoing organizational changes. Ongoing shift away from SAS-based datasets and file storage to cloud-based systems, with increasing use of Parquet.

3.1.13 Action Items

  • SCE whitepaper - mark bynum from J&J
  • Is there actual value / gain in databases?
  • Not the best investment relative to other non-data changes going on across organization (e.g. R, containers, etc)