Linkshare archives

SUBSCRIBE

Receive a short weekly e-mail with resources about Machine Learning, Big Data and Natural Language Processing.

Click for the Archives

An example post (March 21st, 2012):

This week’s links:
* Using NLP to highlight misconceptions of “olde” English in traditionally-set TV/Movie productions like Downton Abbey, Mad Men or Pride & Prejudice: http://blog.revolutionanalytics.com/2012/03/anachronism-machine.html (apparently, “Black market” wasn’t used until the 1940’s. They were more PC back then)
Common crawl is a nonprofit providing free S3 buckets with pre-parsed, pre-stripped scraping of a billion sites on the net – saves you the crawling. A good talk with the founder reviews the project if you can mute through Calacanis: http://commoncrawl.org/author/allison-domicone/page/2/
* You are probably all familiar with the great Stanford courses on Machine Learning (ml-class.org), NLP (nlp-class.org), AI (ai-class.org) and Probabilistic Graphical Models (pgm-class.org), but are you also familiar with Harvard’s Advanced Quantitative Research Methodology by Gary King (His “replication replication” idea is invaluable for anyone who wants to publish quickly in Academia) and CalTech’s machine learning course http://www.work.caltech.edu/telecourse.html ?
* There are a million videos from the recent PyCon up on http://pyvideo.org/category/17/pycon-us-2012, but I found Oliver Grisiel’s 3 hour tutorial to scikit-learn to be especially versatile, showing how to use techniques from SVMs through PCA and clustering on Scikit-learn: http://pyvideo.org/video/622/introduction-to-interactive-predictive-analytics. There are other tutorials like on social network analysis, Pandas (a python R-like data massaging library) in the pyvideo link.
* Twitter recently released an in-house graph processing library. If you’ve used JUNG before you know it’s a nightmare to use for large graphs: Cassovary (http://engineering.twitter.com/2012/03/cassovary-big-graph-processing-library.html) is a Scala-based library that they claim scales up to billions of nodes from the get-go and has built-in functions geared towards social network analysis like calculatePersonalizedReputation.

Advertisements