Antonio Gulli's coding playground: November 2015

Sunday, November 22, 2015

Elsevier: Machine Learning Content Discoverability

http://www.slideshare.net/antoniogulli/2015-machine-learning-elsevier-demos

Tuesday, November 17, 2015

Something cool made with our API

Very cool

blog.sciencedirect.com/posts/reach-for-the-stars-how-one-developer-uses-sciencedirect-apis-to-achieve-more-for-nasa

For 20 years, the Smithsonian/NASA Astrophysics Data System (ADS) has kept all professional astronomers worldwide up-to-date via their digital library of 12 million records which provides links to ScienceDirect and other platforms for full-text retrieval. The ADS maintains relationships with all major publishers and offers users access to four million full-text article links with some of those links originating in 40 full-text Elsevier journals on ScienceDirect.
In order to increase visibility of - and encourage linking to – their subscribed full-text (especially articles written by NASA researchers), NASA had the idea to add thumbnails of graphics appearing within the article to the abstract view of a publication. To do this, they turned to the ScienceDirect Object retrieval and Object search APIs to mine the images and then linked them to the corresponding articles on ScienceDirect. Until now, the ADS has been able to implement this feature for 32,000 publications.

A view of the ADS abstract page

A view of of the ADS graphics page with thumbnails linking to the full-text of the article

“My experience with the ScienceDirect API was exemplary. A well-designed API with a very efficient and friendly support team to back it up!”
- Edwin Henneken, IT Specialist for the Smithsonian/NASA Astrophysics Data System, employed at the Smithsonian Astrophysical Observatory in Cambridge, Massachusetts.

The redesigned ADS remains in beta release and can be easily accessed while more infornation about the ADS in general, is also available.

Example of ScienceDirect article page with images

ScienceDirect APIs are designed to help developers retrieve and integrate full-text content from publications on ScienceDirect into their websites or applications. Visit the ScienceDirect API page to learn more, watch videos and get started.

Friday, November 13, 2015

Boson Higgs

According wikipedia "On 4 July 2012, the discovery of a new particle with a mass between 125 an was announced; physicists suspected that it was the Higgs boson"

However, scholar returns results from 1960 and 1990, which is 22 years before the Scientific discovery. One result is from Elsevier
.

ScienceDirect returns fresher and relevant results,

Thursday, November 12, 2015

Instantaneous Recommendation: real time suggestions for your Academic Library

One of my most favorite features shipped during the last round is a form of instantaneous recommendations. This feature suggests in real time new relevant papers as soon as my library is updated.

So suppose that I add a few papers about deep learning to my library and that this is the first time I have papers about this research topic in my library.

The suggestions are immediately updated and I see papers about Deep Neural Networks for speech recognition, Convolutional Networks, and LCVSR

and relevant papers published by Yann LeCun

I believe that this feature is useful to explore a subject if you are not familiar with the topic, and to make sure that your next paper has a solid "Related Works" section where the most important papers for your research activity are mentioned.

Wednesday, November 11, 2015

Stats is bigdata

(reposted from http://blog.mendeley.com/academic-features/new-research-features-on-mendeley-com/ )

Feature: Stats
If you are a published author, Mendeley’s “Stats” feature provides you with a unique, aggregated view of how your published articles are performing in terms of citations, Mendeley sharing, and (depending on who your article was published with) downloads/views. You can also drill down into each of your published articles to see the statistics on each item you have published. This powerful tool allows you to see how your work is being used by the scientific community, using data from a number of sources including Mendeley, Scopus, NewsFlo, and ScienceDirect.

Stats gives you an aggregated view on the performance of your publications, including metrics such as citations, Mendeley readership and group activity, academic discipline and status of your readers, as well as any mentions in the news media – helping you to understand and evaluate the impact of your published work. With our integration with ScienceDirect, you can find information on views (PDF and HMTL downloads), search terms used to get to your article, geographic distribution of your readership, and links to various source data providers.

Please keep in mind that Stats are only available for some published authors whose works are listed in the Scopus citation database. To find out if your articles are included, just visit www.mendeley.com/stats and begin the process of claiming your Scopus author profile. If not, please be patient as we work further on this feature.

Tuesday, November 10, 2015

Satisfying the exploratory search needs : poster query {dyscalculia}

{dyscalculia} is severe difficulty in making arithmetical calculations, as a result of brain disorder.This is scientific term for a cognitive problem associated to 3%-6% of the world population. Therefore, many people are interested in better understanding the topic.

Google Scholar returns Elsevier content from 1992 and 1985 and from Wiley 1996.

Undoubtedly, Science made significant progress in the last 9 years but this progress is not easily found in Google Scholar for this query.

ScienceDirect finds fresh Elsevier's content for {dyscalculia} including books, and articles. All the results are from 2015, and 2016 (pre-print)

Monday, November 9, 2015

New research features on Mendeley.com - Recommends

(posted in http://blog.mendeley.com/academic-features/new-research-features-on-mendeley-com/

Mendeley’s Data Science team have been working to crack one of the hardest “big data” problems of all: How to recommend interesting articles that users might want to read? For the past six months they have been working to integrate 6 large data sets from 3 different platforms to create the basis for a recommender system. These data sets often contain tens of millions of records each, and represent different dimensions which can all be applied to the problem of understanding what a user is looking for, and providing them with a high-quality set of recommendations.

With the (quite literally) massive base data set in place, the team then tested over 50 different recommender algorithms against a “gold standard” (which was itself revised five times for the best possible accuracy). Over 500 experiments have been done to tweak our algorithms so they can deliver the best possible recommendations. The basic principle is to combine our vast knowledge of what users are storing in their Mendeley libraries, combined with the richness of the citation graph (courtesy of Scopus), with a predictive model that can be validated against what users actually did. The end result is a tailored set of recommendations for each user who has a minimum threshold of documents in their library.

We are happy to report that two successive rounds of qualitative user testing have indicated that 80% of our test users rated the quality of their tailored recommendations as “Very good” (43%) or “Good” (37%), which gives us confidence that the vast majority of Mendeley reference management users will receive high-quality recommendations that will save them time in discovering important papers they should be reading.

For those who are new to Mendeley, we have made it easy for you to get started and import your documents – simply drag-and-drop your papers, and get high-quality recommendations.

On our new “Suggest” page you’ll be getting improved article suggestions, driven by four different recommendation algorithms to support different scientific needs:

Popular in your discipline – Shows you the seminal works, for all time, in your field
Trending in your discipline – Shows you what articles are popular right now in your discipline
Based on the last document in your library – Gives you articles similar to the one you just added
Based on all the documents in your library – Provides the most tailored set of recommended articles by comparing the contents of your library with the contents of all other users on Mendeley.

Suggestions you receive will be frequently recalculated and tailored to you based on the contents of your library, making sure that there is always something new for you to discover. This is no insignificant task, as we are calculated over 25 million new recommendations with each iteration. This means that even if you don’t add new documents to your library, you will still get new recommendations based on the activity of other Mendeley users with libraries similar to yours.

To find your recommended articles, check out www.mendeley.com/suggest and begin the discover new papers in your field!

Sunday, November 8, 2015

Academic Search and Relevance: basic normalization for matching

One more post about Academic Search and Relevance. This time around is back to the basics: there is little you can do for relevance if you do not match the article first. In order to do so, you need to assume that users will make mistakes while they write. So you need to be proactive and correct those mistakes on their behalf. Let's see a few examples

Here the mistake is made on purpose for simulating a user with a different keyboard. The search should automatically support normalization, which does not. ScienceDirect does.

Here the idea is to search a specific item related to prostate cancer and named {ARN-509}, By mistake, it is written as {ARN \space -509} and no match is given.

ScienceDirect simply match it regardless of the mistake.

while Google matches only with the exact term. In this case, they are not able to correct the user's mistake automatically

Saturday, November 7, 2015

TOC for my new book: A collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark

Table of Contents
1. Why is Cross Validation important? 11
Solution 11
Code 11
2. Why is Grid Search important? 12
Solution 12
Code 12
3. What are the new Spark DataFrame and the Spark Pipeline? And how we can use the new ML library for Grid Search 13
Solution 13
Code 14
4. How to deal with categorical features? And what is one-hot-encoding? 16
Solution 16
Code 17
5. What are generalized linear models and what is an R Formula? 18
Solution 18
Code 18
6. What are the Decision Trees? 19
Solution 19
Code 21
7. What are the Ensembles? 22
Solution 22
8. What is a Gradient Boosted Tree? 22
Solution 22
9. What is a Gradient Boosted Trees Regressor? 23
Solution 23
Code 23
10. Gradient Boosted Trees Classification 24
Solution 24
Code 25
11. What is a Random Forest? 26
Solution 26
Code 26
12. What is an AdaBoost classification algorithm? 27
Solution 27
13. What is a recommender system? 28
Solution 28
14. What is a collaborative filtering ALS algorithm? 29
Solution 29
Code 30
15. What is the DBSCAN clustering algorithm? 32
Solution 32
Code 32
16. What is a Streaming K-Means? 33
Solution 33
Code 34
17. What is Canopi Clusterting? 34
Solution 34
18. What is Bisecting K-Means? 35
Solution 35
19. What is the PCA Dimensional reduction technique? 36
Solution 36
Code 37
20. What is the SVD Dimensional reduction technique? 38
Solution 38
Code 38
21. What is Latent Semantic Analysis (LSA)? 39
Solution 39
22. What is Parquet? 39
Solution 39
Code 39
23. What is the Isotonic Regression? 40
Solution 40
Code 40
24. What is LARS? 41
Solution 41
25. What is GMLNET? 42
Solution 42
26. What is SVM with soft margins? 43
Solution 43
27. What is the Expectation Maximization Clustering algorithm? 44
Solution 44
28. What is a Gaussian Mixture? 45
Solution 45
Code 45
29. What is the Latent Dirichlet Allocation topic model? 46
Solution 46
Code 47
30. What is the Associative Rule Learning? 48
Solution 48
31. What is FP-growth? 50
Solution 50
Code 50
32. How to use the GraphX Library? 50
Solution 50
33. What is PageRank? And how to compute it with GraphX 51
Solution 51
Code 52
Code 52
34. What is Power Iteration Clustering? 54
Solution 54
Code 54
35. What is a Perceptron? 55
Solution 55
36. What is an ANN (Artificial Neural Network)? 56
Solution 56
37. What are the activation functions? 57
Solution 57
38. How many types of Neural Networks are known? 58
39. How can you train a Neural Network 59
Solution 59
40. What application have the ANNs? 59
Solution 59
41. Can you code a simple ANNs in python? 60
Solution 60
Code 60
42. What support has Spark for Neural Networks? 61
Solution 61
Code 62
43. What is Deep Learning? 63
Solution 63
44. What are autoencoders and stacked autoencoders? 68
Solution 68
45. What are convolutional neural networks? 69
Solution 69
46. What are Restricted Boltzmann Machines, Deep Belief Networks and Recurrent networks? 70
Solution 70
47. What is pre-training? 71
Solution 71
48. An example of Deep Learning with nolearn and Lasagne package 72
Solution 72
Code 73
Outcome 73
Code 74
49. Can you compute an embedding with Word2Vec? 75
Solution 75
Code 76
Code 77
50. What are Radial Basis Networks? 77
Solution 77
Code 78
51. What are Splines? 78
Solution 78
Code 78
52. What are Self-Organized-Maps (SOMs)? 78
Solution 78
Code 79
53. What is Conjugate Gradient? 79
Solution 79
54. What is exploitation-exploration? And what is the armed bandit method? 80
Solution 80
55. What is Simulated Annealing? 81
Solution 81
Code 81
56. What is a Monte Carlo experiment? 81
Solution 81
Code 82
57. What is a Markov Chain? 83
Solution 83
58. What is Gibbs sampling? 83
Solution 83
Code 84
59. What is Locality Sensitive Hashing (LSH)? 84
Solution 84
Code 85
60. What is minHash? 85
Solution 85
Code 86
61. What are Bloom Filters? 86
Solution 86
Code 87
62. What is Count Min Sketches? 87
Solution 87
Code 87
63. How to build a news clustering system 88
Solution 88
64. What is A/B testing? 89
Solution 89
65. What is Natural Language Processing? 90
Solution 90
Code 90
Outcome 92
66. Where to go from here 92
Appendix A 95
67. Ultra-Quick introduction to Python 95
68. Ultra-Quick introduction to Probabilities 96
69. Ultra-Quick introduction to Matrices and Vectors 97
70. Ultra-Quick summary of metrics 98
Classification Metrics 98
Clustering Metrics 99
Scoring Metrics 99
Rank Correlation Metrics 99
Probability Metrics 100
Ranking Models 100
71. Comparison of different machine learning techniques 101
Linear regression 101
Logistic regression 101
Support Vector Machines 101
Clustering 102
Decision Trees, Random Forests, and GBTs 102
Associative Rules 102
Neural Networks and Deep Learning 103

The art of news clustering: modern metrics for the Reserchers

Team shipped another cool feature. Nowadays, Modern researcher is not limited to academic papers and the labs. Nowadays break-through research is mentioned by the news sources and there are articles published in generalist magazines and newspapers talking about the progress made by Science in all the disciplines.

One key aspect is to have fast algorithms based on machine learning and data analysis for grouping related articles as soon as they are published. In this way, Data science can help to infer the importance of each piece of information.

My group recently acquired Newsflo an innovative company in London and, together, we shipped an engine for clustering news articles mentioning Academic papers and research . This engine is progressively shipped in all Elsevier's products. Here it is the integration with myresearchdashboard.com

Friday, November 6, 2015

Search terms as an automatic way to annotate scientific articles

Another feature has been shipped by the team.

Search terms are an automatic way to annotate scientific articles. Here we show the aggregated (e.g. anonymized) queries which were submitted by the user for retrieving my article

https://www.mendeley.com/stats/articles/22334278000/2-s2.0-34748866005

Thursday, November 5, 2015

How to build a news clustering system

(excerpt from my new book, question #65)

News clustering is a hard problem to be solved. News articles are typically arriving to our clustering engine in a continuous streaming fashion. Therefore, a plain vanilla batch approach is not feasible. For instance, the simple idea of using k-means cannot work for two reasons. First, it is not possible to know the number of clusters a-priori because the topics are dynamically evolving. Second, the articles themselves are not available a-priori. Therefore, more sophisticate strategies are required.

One initial idea is to split the data in mini-batches (perhaps processed with Spark Streaming) and to clusters the content of each mini-batch independently. Then, clusters of different epochs (e.g. mini-batches) can be chained together.

An additional intuition is to start with k-seeds and then extend those initial k-clusters whenever a new article that is not similar enough to the initial groups arrives. In this way, new clusters are dynamically created when needed. In one additional variant, we could think about re-clustering all the articles after a number of epochs under the assumption that this will improve our target metric.

In addition to that, we can have a look to the data and perhaps notice that many articles are near-duplicates. Hence, we could aim at reducing the computational complexity by applying pseudo-linear techniques such as minHash shingling.

More sophisticate methods will aim at ranking the articles by importance. This is an even harder problem again because of the dynamic nature of the content and the absence of links, which could have allowed PageRank-type computations. If that is not possible, then a two-layer model could be considered where the importance of a news article depends on the importance of the originating news sources, which in turns depends on the importance of the emitted articles. It can be proven that this recurrent definition has a fixed-point solution[1].

Even more sophisticate methods can aim at extracting entities from the clusters and this is typically achieved by running a topic model detection on the evolving mini-batches.

[1] Ranking a stream of news, WWW '05 Proceedings of the 14th international conference on World Wide Web, G. M. Del Corso, A. Gulli, F. Romani.

Wednesday, November 4, 2015

Disrupting the Academic Research Arena with Recommenders

We leave in a world where the quantity of available information is hugely massive. How many movies, songs, news articles, apps are out of there and what is the best way to find content relevant for every single user?

Search is not the only solution. Search assumes that you are already aware of what you are looking for. Perhaps, you already heard the latest song made by Nicky Jam and you search few words, or you want to see the latest movie of Paolo Sorrentino and so you search the title. However, the problem is that you need to know in advance what you are looking for and, then, explicitly submit a query to pull (retrieve) the content. What if there is some piece of information which is very relevant but you are not aware of it? Search will not necessarily help.

For overcoming this limitation, Netflix, Spotify, Google Play, Apple Genius, Amazon they use recommender based technologies that are used to suggest fresh and relevant information to the users with no need of explicitly submitting queries. You can watch your favourite movie, listen your song, read your news articles, and discovery new items to buy even if you are not aware of what is relevant for you in advance.

Surprisingly enough, recommenders are still not yet largely adopted by the Research Communities. How many new and fresh papers are relevant for your research discipline and how long it takes to discover them? Traditionally, discovery is based on word-of-mouth communications where someone in your community will suggest what paper to read and what the new research trends are. But this requires time, and time is fundamental in research. That's why we worked hard to create a break-through technology with our team in London. We needed to solve this problem and help the communities.

So, Team shipped an Academic Recommendation engine which adopts sophisticate machine learning algorithm to learn how to discover scientific articles that are relevant for you. Moreover, Recommendation is personalized and it is based on your own scientific interests. What is cool is that the algorithms makes recommendations tailored on you, the Researcher.

Let's see how this works. Browse mendeley.com/suggest/

First, recommendations are based on what I have read previously and stored in my library. It's clear that I have an interest in data mining and usage statistics. Plus, there a surprising article related to some new types of research topics that was considering recently. That's the serendipity effect. Then, there are also recommendations based on my research discipline (Computer Science)

More important, experimental data showed that freshness is very important for research. So, we developed a special set of recommenders focused on my own very recent research activity. In my case, this is related to different methodologies for sampling the Web size, and search - of course. Then, we also show what is trending in my discipline right now.

Obviously, we encourage the users to interact with our system and fine tune the suggestion so that the quality of the personalized recommendations can improve over time. The more you interact, the merrier the suggestions will be.

So, Try this cool technology which I believe will disrupt the way in which research is done and will help researchers to save time Antonio

Tuesday, November 3, 2015

What is a recommender system?

(excerpt from my new book)

Recommender systems produce a list of recommendations such as news to read, movies to see, music to listen, research articles to read, books to buy, and so on and so forth. The recommendations are generated through two main approaches which are often combined

Collaborative filtering approaches learn a model from a user's past behaviour (items previously purchased or clicked and/or numerical ratings attributed to those items) as well as similar choices made by other users. The learned model is then used to predict items (or ratings for items) that the user may have an interest in. Note that in some situations rating and choices can be explicitly made, while in other situations those are implicitly inferred by users’ actions. Collaborative filtering has two variants:

User based collaborative filtering:user’ interest is taken into account by looking for users who are somehow similar to her. Each user is represented by a profile and different kinds of similarity metrics can be defined. For instance, a user can be represented by a vector and the similarity could be the cosine similarity
Item based collaborative filtering: user’s interest is directly taken into account by aggregating similar classes of interest

Content-based filtering approaches learn a model based a series of features of an item in order to recommend additional items with similar properties. For instance, a content based filtering system can recommend an article similar to other articles seen in the past, or it can recommend a song with a sound similar to ones implicitly liked in the past.

Recommenders have generally to deal with a bootstrap problem for suggesting recommendations to new unseen users for whom very few information about their tastes are available. In this case a solution could be to cluster new users according to different criteria such us gender, age, location and/or to leverage a complete set of signals such as time of the day, day of the week, etc. One easy approach is to recommend what is popular where the definition of popularity could be either global or conditioned to few and simple criteria.

More sophisticate recommenders can also leverage additional structural information. For instance an item can be referred by other items and those can contribute to enrich the set of features. As an example, think about a scientific publication which is referred by other scientific publication. In this case, the citation graph is a very useful source of information for recommendations.

Monday, November 2, 2015

Benchmarking your Academic Profile is BigData Computation

Exciting day today. Let's ship it. Team worked on a modern BigData pipeline built on Apache Spark for helping the Researchers to benchmark their Academic Profiles

Here it is me checking how am I doing and it's clear that I've moved to the industry since have no recent publication and few recent citations.

First, an overall summary of Antonio's gulli views & citations over time

Then an in-depth view of one selected article with its citations

Check this out on http://mendeley.com/stats/

Sunday, November 1, 2015

Academic Search and Relevance: deep searching what you are looking for

Let's see some more examples of Academic Search and Relevance. This time around from my domain of expertise which is Machine Learning. Again side-by-side comparison and we will show why directly matching the users' needs is important.

{deep learning autoencoders}

Here I am interested in finding a specific innovation discovered in deep learning. As discussed in a previous post autoencoders are deep learning machines which are able to auto-learn what are the important features in a dataset with no human intervention. The machine will pick the right features on your behalf with no handcraft work.

Google returns the seminal paper from 2006 which is considered the starting point for the renaissance of Neural Networks and their evolution into modern Deep Learning systems.

However, this paper DOES NOT talk about Autoencoders, Instead, it talks about deep believe nets a slightly related topic. At the time of that paper Autoencoders where NOT YET popular for Deep Learning (and even Deep Learning was not invented as a new word yet).

Therefore, I'd consider this a DSAT because it is not immediately satisfying my very specific search needs.

So Google Scholar is not returning a very relevant result

ScienceDirect is instead returning a very relevant and recent results discussing about Deep Learning and Autoencoders.