Opaque enables a Never Decrypt Data Lake running computation on top of existing cloud infrastructure. In particular, Opaque offers rich analytics built on top of Apache Spark that computes only on encrypted data. Opaque leverages secure enclaves to ensure that the entire software stack (outside of the enclave) cannot access plaintext, decrypted data, promising greater security than existing cloud solutions.
Researchers
- Chester Leung, UC Berkeley
- Fletcher Liverance, Amazon
- Raluca Ada Popa, UC Berkeley
- Rishabh Poddar, UC Berkeley
- Wenting Zheng, UC Berkeley
Overview
Since customer data is often private or confidential, there is a strong need to keep data encrypted while computing on it. For example, an organization storing sensitive data on AWS, such as a data lake, might want to keep this data encrypted at all times, run analytics on it while keeping it encrypted, and only share the results with others according to some privacy policies it sets. Similarly, often organizations want to run aggregate analytics on sensitive data from different sources which cannot be shared, so they could conduct such analytics on encrypted data.
To tackle this problem, we are designing and building a prototype for a “Never Decrypt Data Lake” within our open-source platform MC2 . The proposed project leverages hardware enclaves to perform data analytics in a way that the software stack outside of an enclave cannot access decrypted data, but only encrypted data. Only the code loaded in an enclave (such as analytics operators) can access decrypted data, and any transfer of the data or computation results outside of the enclave will first encrypt the data.
While the use of hardware enclaves provides greater security, it does come with a higher performance cost. In particular, in prior work we’ve found that a system providing outsourced computation with data encryption, authentication and computation verification incurs up to a 3.3x overhead. A system providing outsourced computation additionally incurs only network transfer overhead linear in the data size compared to a system performing computation locally.