De-identify your data wherever they are, with only 3 lines of code.
OUR DE-IDENTIFICAITON SUITE
A brief review of the Legal Handbook of Artificial Intelligence and Machine Learning.View
The new Tensorflow Lite XNNPACK delegate enables best in-class performance on x86 and ARM CPUs.View
What is it and how do we achieve it? We identified four pillars of privacy-preserving machine learning.View
Some techniques to improve DALI resource usage and create a completely CPU-based pipeline. Up to 4x faster PyTorch training.View
The Fourier TransformView
A tentative decision tree for the privacy-conscious programmerView
The basics of homomorphic encryption, followed by a brief overview of the open source homomorphic encryption libraries that are currently available, ending with a tutorial on how to use one of those libraries (namely, PALISADE).View
Why we should bother creating natural language processing (NLP) tools that preserve privacy. Apparently not everyone spends hours upon hours thinking about data breaches and data privacy infringements.View
Symmetric encryption, asymmetric encryption, homomorphic encryption, differential privacy, and secure multi-party computation.View
We frame the problem of de-identifying unstructured text within the greater landscape of privacy enhancing technologies. We then cover what sort of background knowledge can be gained from only stylistic information about a written document and how we can use research on authorship attribution and author profiling to improve our understanding about the sorts of inferences that can be made from an otherwise de-identified text. Finally, we provide a risk score for determining the likelihood that a message will be attributed to a particular author within a dataset using only author profiling tools.View
We describe a method for extracting MFCCs and BFCCs from an encrypted signal without having to decrypt any intermediate values. To do so, we introduce a novel approach for approximating the value of logarithms given encrypted input data. This method works over any interval for which logarithms are defined and bounded. Extracting spectral features from encrypted signals is the first step towards achieving secure end-to-end automatic speech recognition over encrypted data. We experimentally determine the appropriate precision thresholds to support accurate WER for ASR over the TIMIT dataset.View
We assess the current state of the art in speech summarization, by comparing a typical summarizer on two different domains: lecture data and the SWITCHBOARD corpus. Our results cast significant doubt on the merits of this area's accepted evaluation standards in termms of: baselines chosen, the correspondence of results to our intuition of what "summaries" should be, and the value of adding speech-related features to summarizers that already use transcripts from automatic speech recognition (ASR) system.View
We show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations.View
Some of the most sensitive information we generate is either written or spoken using natural language. Privacy-preserving methods for natural language processing are therefore crucial, especially considering the ever-growing number of data breaches. However, there has been little work in this area up until now. In fact, no privacy-preserving methods have been proposed for many of the most basic NLP tasks. We propose a method for calculating character bigram and trigram probabilities over sensitive data using homomorphic encryption.View
We propose a set of baseline heuristics for identifying genuinely tabular information and news links in HTML documents. A prototype implementation of these heuristics is described for delivering content from news providers' home pages to a narrow-bandwidth device such as a portable digital assistant or cellular phone display. Its evaluation on 75 Web sites is provided, along with a discussion of topics for future research.View
Universities have long relied on written text to share knowledge. As more lectures are made available on-line, these must be accompanied by textual transcripts in order to provide the same access to information as textbooks. While Automatic Speech Recognition (ASR) is a cost-effective method to deliver transcriptions, its accuracy for lectures is not yet satisfactory. One approach for improving lecture ASR is to build smaller, topic-dependent Language Models (LMs) and combine them (through LM interpolation or hypothesis space combination) with general-purpose, large-vocabulary LMs. In this paper, we propose a simple solution for lecture ASR with similar or better Word Error Rate reductions (as well as topic-specific keyword identification accuracies) than combination-based approaches. Our method eliminates the need for two types of LMs by exploiting the lecture slides to collect a web corpus appropriate for modelling both the conversational and the topic-specific styles of lectures.View
Patricia Thaine is a Computer Science PhD Candidate at the University of Toronto and a Postgraduate Affiliate at the Vector Institute doing research on privacy-preserving natural language processing, with a focus on applied cryptography. She also does research on computational methods for lost language decipherment. Patricia is a recipient of the NSERC Postgraduate Scholarship, the RBC Graduate Fellowship, the Beatrice “Trixie” Worsley Graduate Scholarship in Computer Science, and the Ontario Graduate Scholarship. She has eight years of research and software development experience, including at the McGill Language Development Lab, the University of Toronto's Computational Linguistics Lab, the University of Toronto's Department of Linguistics, and the Public Health Agency of Canada.
Pieter Luitjens has a Bachelor of Science in Physics and Mathematics and a Bachelor of Engineering from the University of Western Australia, as well as a Masters from the University of Toronto. He worked on software for Mercedes-Benz and developed the first deep learning algorithms for traffic sign recognition deployed in cars made by one of the most prestigious car manufacturers in the world. He has over 10 years of engineering experience, with code deployed in multi-billion dollar industrial projects. Pieter specializes in ML edge deployment & model optimization for resource-constrained environments.
Gerald Penn is a Professor of Computer Science at the University of Toronto, where he studies spoken language processing and computational linguistics. He has over 100 publications, with the top one accruing 1,581 citations. He is a senior member of IEEE and AAAI, and a past recipient of the Ontario Early Researcher Award. His lab revolutionized speech recognition with its work on neural networks, which received the IEEE Signal Processing Society's Best Paper Award. He has led numerous research projects, including ones funded by Avaya, Bell Canada, CAE, the Connaught Fund, Microsoft, NSERC, the German Ministry for Training and Research, SMART Technologies, the U.S. Army and the U.S. Office of the Director of National Intelligence. Gerald has also worked at Bell Labs and NASA.
John has a Masters in Journalism from Concordia and a BA in Russian History from UofT, and has tried to be the dumbest guy in the room ever since. With a decade of experience in software sales split between London and Toronto at Airbnb and Limelight, he specializes in customer acquisition for start-ups, growth marketing, and scaling high-performance teams. He's extremely excited to be working with such a talented group of individuals and to be maintaining user privacy while unlocking more datasets for exploration. If you ever need to bribe him, beer and/or pancakes are your best friend.
Peizhao Hu is an Assistant Professor in the Department of Computer Science at Rochester Institute of Technology (RIT), New York. His research focuses on (1) privacy-preserving cloud data analytics, specifically homomorphic encryption and multiparty computations; (2) distributed systems, including mobile and pervasive computing. Before joining RIT, he was Senior Research Engineer at NICTA (Australia's centre of research excellence; now Data61@CSIRO).
Remi Daviet is a Post-Doctoral researcher at the Wharton School of the University of Pennsylvania. His research focuses on marketing analytics and behavior modelling. He is interested in the development and promotion of privacy respecting marketing practices.
11.5 year veteran of InterActiveCorp (Nasdaq: IAC). Experience as a COO, CDO, Chief Legal Officer and GM.
Former Information and Privacy Commissioner of Ontario (3-terms). Inventor of Privacy by Design.
Assistant Professor at the University of Waterloo with a PhD in Computer Science from MIT.
Over 12 years of experience in communications management. MEd in higher education theory and policy.
Over 27 years of B2B sales and BD experience, including managing roles at Rogers Communications, ADP, and SAP.
Manager, Data Science at the Globe and Mail with a PhD in Computer Science from the University of Toronto.
International Management Consultant with an MSc from the London School of Economics.
Helped found the equities trading group at Renaissance Technologies. PhD in Computer Science from Stanford.
Risk management and treasury analytics specialist with a PhD from the University of Toronto.