The summary of some presentations at SIGMOD 2017
Keynote: Democratizing Advanced Analytics Beyond Just Plumbing
Advanced analytics = data management + ML (Machine Learning). Machine Learning puts more focus on accuracy (or other evaluation metrics) and runtime efficiency while advanced analytics puts more focus on scalability, usability, manageability, developability.
There are cloud ML services, e.g: BigML, AzureML, Google and Amazon cloud ML, that offer the pre-trained models on big training datasets for the end users. Training data is becoming a hot commodity for ML projects.
ML processes are complex and manual. We can apply data management ideas to automate/systematize them, e.g: Columbus framework provides a declarative framework of operations for feature selection over in-RDBMS data.
For structured data, we still need manual feature engineering. Deep learning is good for unstructured data (e.g: text, images, videos…)
Machine Learning for Recommender Systems at Twitter
Objective: recommending the most relevant contents to Twitter users
The candidate contents to recommend are:
* Interest model of each user
* Trends, e.g: based on location
* Human curation, e.g: breaking news
* Contents which are engaged by the users you follow
They take into account the following behaviors to evaluate their recommender system’s performance:
* Positive: user logins, likes, retweets
* Negative: turning push notification off
A model is trained for each action/behavior then combine them together
* User’s action on a notification
* User’s behavior, e.g: probability that a user login everyday
* User’s characteristics: # follows, # tweets, reaction history…
* Candidate content’s characteristics: # likes, # retweets
* Social proof features, e.g: content has been engaging by several important users
* Gradient boosted decision tree, Logistic Regression
* DeepBird: Twitter’s deep learning framework
EMT: End To End Model Training for MSR Machine Translation
A fully automated system responsible for gathering new data, training systems, and shipping them to production with little or no guidance from an administrator
Snorkel: Creating Noisy Training Data to Overcome Machine Learning’s Biggest Bottleneck
Biggest bottleneck of Machine Learning is the lack of training data. Hand-labeled training is slow and can’t fit well with real-world problem, it’s also expensive.
This paper introduces data programming pipeline
* Domain expert writes labelling functions (e.g: simple regular expression) to label training data
* Identify the most probable labels using generative model: compare the agreements and disagreements of the accuracies generated by labelling functions to infer probabilities of the accuracies for each labelling function
* Noise-aware discriminative model: put higher importance to the labels that multiple labelling functions agree on
Versioning for end-to-end machine learning pipelines
Store outputs (features, model’s performance…) of different experiments from the ML pipeline (feature engineering, parameter tuning…)
Dataset derivations: Dataset x configuration —> new dataset
Use hash function to map each derivation into a different storage —> can keep track of data processing and model selection processes
Data processing results could be reused
Using Word Embedding to Enable Semantic Queries in Relational Databases
Cognitive database: capture and exploit semantic contextual similarities using standard SQL queries
Step 1: a database table data is first textified into a meaningful text format (they call it token). A modified version of the word2vec algorithm to learn vectors for the words (database tokens) in the extracted text. This phase can also use an external source, e.g., Wikipedia, as the source for text for model training.
Step 2: the resultant vectors are stored in a relational system table
Step 3: compute distances between vectors in a semantic vector space using the cosine distance metric to determine contextual semantic similarities between corresponding source database tokens. The similarity results are then used in the relational query execution, thus enabling the relational engine to exploit latent semantic information for answering enhanced relational queries.
* Similarity query: e.g: identifies similar customers by comparing their purchases
* Analogy query: e.g: associating customers with either most-common or least-common purchases in a given domain (e.g., books, electronics, etc.)
* Using external source for query: e.g: query the sales database to find out which customers have bought fruits that may be allergenic, an item is considered as allergenic if it is listed in allergenic fruits from Wikipedia (the external source)
MacroBase: Prioritizing Attention in Fast Data
MacroBase is a new analytic monitoring engine designed to prioritize human attention in large-scale datasets and data streams.
Reducing the amount of data that people should take a look to analyse.
Source code: https://github.com/stanford-futuredata/macrobase