I'm based in Seattle and am a Consulting CTO, helping build products and teams that apply 'big data' and 'data science' in healthcare and life science. My interests and projects include large scale distributed systems, machine learning, data mining, natural language processing and agile methods.


Natural Language Understanding

Data Science in Production

Machine Learning for Fraud Detection

  • Hunting Criminals with Hybrid Analytics.
    At Data by the Bay, San Francisco, May 2017; Global Data Science Conference 2016, in Santa Clara, CA, March 2016; and IBM Datapalooza, in Seattle, WA, February 2016.
  • Architecting a predictive, petabyte-scale, self-learning fraud detection system.
    Global Predictive Analytics Conference, Santa Clara, CA, March 2017.
  • Online Predictive Modeling of Fraud Schemes from Multiple Live Streams.
    With Claudiu Branzan, at Spark Summit East, in New York, NY, February 2016.
  • Online fraud detection: A reference architecture for adversarial learning.
    At MLConf Atlanta, in Atlanta, GA, September 2015.
  • Hunting Criminals with Hybrid Analytics, Semi-supervised Learning & Agent Feedback.
    With Claudiu Branzan, At the Smart Data Conference, in San Jose, CA, August 2015 & at at Strata + Hadoop World, in London, UK, May 2015.
  • Active learning from streams of graph, language & time series signals.
    With Claudiu Branzan, At the Data Science Summit & Dato Conference, in San Francisco, CA, July 2015.

Big Data Architecture

  • Building a new predictive model & API in 30 minutes.
    With Claudiu Barbura, at PAPIs.io — The Predictive APIs and Apps Conference, in Barcelona, Spain, November 2014.
  • Building an intelligent big data app in 30 minutes.
    With Claudiu Barbura, at the Strata Barcelona Conference, in Barcelona, Spain, November 2014.
  • Lessons Learned from Embedding Cassandra in an enterprise-grade big data platform.
    With Claudiu Barbura, at Cassandra Day Seattle, in Bellevue, WA, July 2014.
  • Leveraging a big data infrastructure to accelerate the data science workflow.
    At the 5th Timisoara Big Data Meetup, in Timisoara, Romania, June 2014.

Data Driven Healthcare

  • Clinical natural language understanding at scale.
    EU Data Science Summit, in Tel Aviv, Israel, June 2016.
  • Moving Beyond Templates and Coercion to Improve Physician Documentation.
    With Jill Wolf, at the 23rd Annual WEDI National Conference, in Hollywood, CA, May 2014.
  • Data driven approach to revenue capture process improvement.
    With Gene Boerger, at the AHIMA 85th Convention, in Atlanta, GA, October 2013.
  • Data driven models to minimize hospital readmissions.
    With Miriam Paramore, at the 2013 Strata Rx Conference, in Boston, MA, September 2013.


Parallel Computer Scheduling & Workload Modeling

Agile Software Development

Model Driven Software Engineering



Data Science

Software Engineering


Natural Language Processing

An Apache-licensed natural language processing library built on top of Apache Spark and its Spark ML library. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment.

Visual Co-Plot

A statistical analysis tool, tailored for datasets with few observations and many variables which may be intercorrelated. Co-Plot enables visually analysing observations, variables and the correlations between them together.

Parallel Workload Analyser

A tool for analysing parallel computer workloads in standard workload format. Computes self-similarity, auto-correlation, distributions, time series, per-month and summary statistics.