Thursday, August 11, 2016

Are institutional repositories a dead end?

Long time readers of my blog know that I'm a bit of a worrier when it comes to libraries,

I've written tongue in check posts about  "The day library discovery died (for libraries)", I've been a skeptic of trends such as SMS reference, QRcodes and mobile sites (nailed the first 2), Altmetrics and 3D printers (the jury is still out on them) and generally worried about the death of libraries like any good librarian.

Consider this post as yet another part of the series where I play skeptic or devil's advocate.This time I focus on institutional repositories.

To avoid doubt, I define institutional repository as failing/dead end if they are unable to get most of the institution's scholarly output and make it publicly available to all.

The one great strength of Institutional Repositories

Let me set the scene, it was May 2016, and the scholarly world was still reacting to the shock purchase of SSRN by Elsevier.

On the GOAL mailing list, it was pointed out that the distributed nature of institutional repositories which are owned by individual universities was a great defense against monopolistic take-overs, as no single commercial entity could buy up all institutional repositories in the world. No one could do with IRs what Elsevier did with purchasing SSRN, hence taking a big slice of the OA market market (in certain disciplines) in one blow.

A response to that by a certain Eric F. Van de Velde caught my eye. He basically outlined why  he thought institutional repositories would fail and why subject repositories or even commercial based sites like ResearchGate were winning out.

It resonated with me because I was coming to the same conclusion.

Last month, I found he expanded his short reply into a post provokingly entitled  "Let IR RIP "  .

How provocative? It begins "The Institutional Repository (IR) is obsolete. Its flawed foundation cannot be repaired. The IR must be phased out and replaced with viable alternatives."

Eric as he explains was a early believer and advocate in the future of institutional repositories (going way back to 1999). This is someone who has managed and knows IRs and was hoping that they could eventually disrupt the scholarly communication system. Such a person now thinks IRs are a "dead end".

I don't have even a tenth of his experience in this field, but as a humble librarian working on the ground, I must concur with his points.

It seems to me, no matter how we librarians try, most researchers don't seem to have half the enthusiasm (assuming they had in the first place) they have with depositing full text in institutional repositories as compared to subject repositories or even social networking sites like ResearchGate.

Why is this so? You should really read his post , but here's my rambling take from a librarian point of view.

1. Institutional affiliations will change and control is lost when it happens.

Many faculty will move at least once in their career (twice if you include their time as a Phd) as such this doesn't incentivize them much to learn how to use or manage their own local IR systems.

Compare this to someone who invests in setting up his profile and/or deposits in ResearchGate or SSRN. This is something they will own and control throughout their career no matter where they go.

ORCID helps solves part of this problem, but even in a ideal world where you update in ORCID and it pushes to various profiles, the full text has to exist somewhere.

And if you upload it to a IR, the moment you leave, you lose control of everything there. And some progressive IRs include public statistics like downloads and views of your papers which is all well and good (especially if you are smart enough to create metadata records in multiple venues but link back to your IR copy) until you leave the institution and you can't bring them over to aggregate with your future papers.

Why would someone devote so much time on something they may not fully own? Compare this to someone setting up SSRN/ResearchGate profile, where all the work you do, all the statistics you accumulate in terms of downloads etc will forever be with you centralized in one place.

SSRN Statistics

Incidentally that's also why I suspect implementing the "copy request button" idea on institutional repositories tends to not work so well.

STORRE: Stirling Online Research Repository

For those of you who are unaware, the idea here is that you can legally? circumvent embargo by adding the "copy request button". Just list the record (with no full text) on repositories and the visitor to the metadata only record can click on a "Copy request" button to instantly request a copy from the author. You as the author get the email, you can either reply with the file or in some systems simply give approval and the file will be released automatically to the individual.

This idea works very well in theory but in practice when you leave a institution it is likely the IR will continue to list your old invalid email!

Since I started my profile in ResearchGate, I've gotten requests for thesis and papers written when I was a undergraduate and later as a library masters student.

I would not have seen these requests if I relied on my old institution's IR "Copy request" buttons!

2. Lack of consistency across IRs

Though most University IRs are using a relatively small set of common software such as Digital Commons, Dspace, Eprints they can differ quite greatly depending on the customization and feature set, and this can be very off putting to the researcher.

It's not just surface usability and features, but also because there are no standards for metadata, content etc, it's becomes as Eric says "a mishmash of formats" when you try to search across them using aggregator systems like CORE, BASE etc. Each IR will have it's own system of classifying research, subjects, fields used etc. This is also something familiar to those of us who have tried to include IR contents into discovery services and find to our dismay we often have to turn them off.

A researcher who wants to use the IR when he switches institutions will have to struggle with all this and why would he when he could use something more familiar that he has been using since his grad school days....

3. Subject/Discipline affiliations are stable while institution affiliations are not. 

@aarontay @lisalibrarian @helent13 another disadvantage is scholars think in fields/disciplines not institutions.
— R. David Lankes (@rdlankes) July 30, 2016

This is a complimentary point to point number 1.

Subject Repositories have the advantage of greater familiarity to scholars and can have systems custom built for each researcher's community.

4. IRs generally lag behind in terms of features and sophistication  

Not every institution is a rich top Tier 1 University that is capable of investing time and money to provide a useful and usable IR that can compete with the best in the commercial world.

For example, there's a belief (which I think might be justified but I have no evidence) floating around that it's better to put your outputs in ResearchGate, than in IRs because the former two have greater visibility in Google.

I'm no expert but I find systems like ResearchGate and are just more usable. I've deposited to Dspace , Digital Commons systems before and they take me easily 30 minutes to get through it, and I'm a librarian!

ResearchGate and company are also more aggressive in encouraging deposits, for example if I list a metadata only record, it will often check Sherpa Romeo automatically for me and encourage me to deposit when it's allowed.

Maybe there are Dspace, Eprint etc systems out there with such features but the few ones I have used don't seem to do that. (CRIS systems do that I believe?)

While many find ResearchGate and annoying and intrusive, I think you can see they try to work on human psychology to encourage desired behaviors to deposit through gamification techniques or just evoking old fashioned human curiosity.

For example, Researchgate can tell you who viewed your record, who downloaded and read your paper (if they were signed on while doing so)  and you can even respond to such information by asking the identified readers for a review!

Not everyone thinks such features are a positive (privacy!) but the point here is that they are innovating much quicker and IRs, at least the average IRs are lagging. Often I feel it is akin to library vendors talking about bringing "Social features" into catalogues in 2012 and expecting us librarians to cheer.

Others such as Dorothea Salo in  Innkeeper at the Roach Motel have long pointed out the many shortcomings of IR software like Dspace. Under the section "Institutional repository software", she lists a depressing inventory of problems with IRs.

These include poor UX, lack of tracking statistics, siloed repositories which lack inter-operation-ability and the lack of batch uploading and download tools, the inability to support document versioning (something subject repositories do decently well), means faculty won't use IRs not even for the final version.

Add outdated protocols like OAI-PMH (which Google Scholar ignores) and the realities of how most IRs are a mix of full-text and metadata, rather than 100% full text as envisioned, IRs have had a uphill task.

Most of the above was written back in 2007, I'm unsure if much has changed since then.

5. IRs lack mass  

When was the last time you went specifically to the IR homepage to do something besides deposit a paper?

How about the last time you decided to go to your IR homepage to search for a topic?

IRs just simply don't have enough central mass (one institution's output is insignificant even if it was all full-text)  to be worth visiting to browse and search compared to say a typical Subject repository.

As such, the most common way for a user to end up on a IR page or more likely just a pdf download is via Google Scholar.

Is this a problem? In a way it is because the lack of reasons for authors to visit the IRs means that any possible social networking effect is not present and as the saying goes out of sight, out of mind.


I would like to say here that I fully respect efforts and achievements of my colleagues & librarians around the world who directly manage the IR. It's can't be an easy task particularly since many can be labouring under what The Loon calls the coordinator syndome (though hopefully this problem has diminished over the years given that scholarly communication jobs are better understood, see also the tongue in cheek "How to Scuttle a Scholarly Communication

Still looking at my points, it seems that a big unifying point is that economies of scale matter and repositories at the institutional level aren't the right level to work in. Lorcan Dempsey would put it as researchers preferring to work at the network scale as opposed to the institution scale.

The point here is while some IRs have achieved some success eg MIT hitting 44% total output deposited (and consider that MIT is a early pioneer and leader of the open access movement), many have failed to attract all but the most minimal amount of deposits.

Perhaps this is purely anecdotal, but my impression is while you can find researchers who put their papers on Subject repositories/Social networking researcher sites AND institution repositories (aka researchers who just crave visibility and are willing to juggle multiple profiles and sites) or those who just put in the former only, it's rare to find those who only put things in the IR and nowhere else.

Various studies (e.g this and this ) are starting to show more and more free text are reside in sites like say ResearchGate than institutional repositories.

This doesn't augur well.

I'm not saying though it's not possible to coerce researchers to deposit into IRs.

For example it seems an Immediate-Deposit/Optional-Access model like that done by the University of Liège seems to achieve much success by making researchers deposit all their papers on publication whether it can be released open access or not immediately or at all. This coupled with a understanding that papers not submitted into the IR will not be considered for performance purposes seems sufficient to cause high rates of compliance.

However doing so  is going against the wishes of the researchers who seem to naturally not favor open access via IRs and it seems to me would rather do it via SR, researchgate or even through gold OA (if money is available).

A lot of problems I suggested for IR can have solutions, for instance more standardization of IRs would be one. More resources poured into doing UX to understand needs and motivations of researchers is another. Librarians can either push or pull full text to/from subject repositories on behalf of authors (via SWORD), work out a way to aggregate statistics across repositories perhaps. I've read COUNTER  is working on this to standardise downloads, but I wonder if one could have ORCID like system that aggregates such COUNTER statistics of all papers registered to you?

But one wonders , perhaps this is a space librarians should cede if other methods work better.

With the rise of solutions like SocArXiv bioRxiv and engrXiv , perhaps institutions should start running or sharing responsibility for aggregation of output at higher levels such as via subject repositories or even national repositories?

Of course, we all agree "solutions" like researchgate and are not solutions at all because they are owned by commercial entities and might disappear at any moment.

But is it possible to have both the advantage of scale and centralization and yet be immune if not resistant to take-overs by commercial entities? Can subject repositories be the solution?

In any case let me end off with Eric's words.

"The IR is not equivalent with Green Open Access. The IR is only one possible implementation of Green OA. With the IR at a dead end, Green OA must pivot towards alternatives that have viable paths forward: personal repositories, disciplinary repositories, social networks, and innovative combinations of all three."

What do you think? Are institutional repositories a dead end? Or are they needed as part of the eco system alongside subject repositories? I am frankly unsure myself.

Additional note : As I write this, there is some discussion about the idea of retiring IRs for CRIS . The idea seems to be that instead of running two systems that barely talk to one another, one should opt for a all in one system. There is grave suspicion by some against such a move because of the entities who own the software. How this factors into my arguments above I am still mulling over. 

On a personal note, I will be taking a month off my usual blogging schedule and will resume in Oct 2016. 

Friday, July 8, 2016

5 Extensions to help access full text when starting outside the library homepage

2016 seems to the year Sci-hub has broken out into popular consciousness. The service that provides access to academic papers for free , often dubbed "The Napster" of academic papers by media is having it's moment in the sun.

To me though the most interesting bit was finding out how much usage of Sci-Hub seems to by people (either researchers or academics) who have access to academic library services.

In Science's "Who's downloading pirated papers? Everyone", John Bohannon in the section "Need or convenience?" suggests "Many U.S. Sci-Hub users seem to congregate near universities that have good journal access."

Bastian Greshake went even further and asked Sci-hub for logs segmented by University/College IP ranges.  The list of University IP ranges he used to determine whether usage is within campus looks inaccurate to me (eg it is missing out the 2nd biggest University here in Singapore), but it's still a interesting piece of analysis.

The % of usage from each country within University IP ranges varies but it is surprisingly high for some countries like Australia (where just below 20% of Sci-Hub usage comes from University IP ranges).

We can't tell if users with access to academic libraries are using Sci-hub because their library doesn't provide immediate access and they are too lazy to wait for document delivery or worse they just find it easier to use Sci-hub than fiddle with library access!

(As an aside, this is why it is truly a bone-headed move by publishers to suggest Universities introduce more barriers like two factor authentication to access articles. That's going to drive even more people away!)

But I'll bet one reason most users don't use library subscriptions to access articles is because our systems generally don't make it easy to access articles if users don't start searching via library systems (discovery services, databases etc). Roger C. Schonfeld's "Meeting Researchers Where They Start Streamlining Access to Scholarly Resources" is a recent great exploration of these issues, and is unusual because it comes from a publisher and hence useful to explain to other publishers since it comes from one of their own. (Most librarians working in this space are aware of these issues).

Similarly , since the inception of this blog, I have regularly explored and shared various tools trying to close this access gap, some methods I posted on have become obsolete , but new ones have risen to take their place.

This is a summary of the tools I am aware of as of 2016 that can help improve matters, though none are close to solving the whole issue.

1. Proxy bookmarklet - Tried and trusted method

Land on a article landing page off campus and have no access to full text because the site doesn't recognize your institutional affiliation?

A quick click on this bookmarklet and login and you will be proxied and recognised.

Unsure what I am talking about? Have a look at the video below

There are various ways to quickly append the proxy string , but adding via a bookmarklet is still the most popular.

This method is lightweight, works on most browsers including many mobile ones (though the initial setup can be tricky) and you can do some more fancy tricks to track usage but essentially this idea has been around for years. (As a sidenote, the earliest mention I can find of this idea is in 2005 by Tony Hirst of Open University UK)

A quick search in Google or Youtube will find hundreds of academic libraries that mention or offer a variation of this idea though I highly suspect for many it's a experiment someone setup and quickly forgot without popularizing much (with some exceptions).

2. UU Easy Access Chrome extension - A improved proxy bookmarklet in the form of a chrome extension

My former institution heavily promoted the proxy bookmarklet method and it proved very popular. However with high usage came feedback and I soon realized the proxy bookmarklet had several issues.

Firstly, users did not understand why the proxy bookmarklet would occasionally fail. Part of it was that they would proxy pages that made no logical sense (for example trying it on  Scribd, Institution repositories, Free abstracting and indexing sites) because they were taught "whenever you asked to pay for something click the button". They loved it when it works but were bewildered when it didn't.

Failure could also occur for certain resources where the subdomain or even domain were slightly different depending on the country or institution (e.g Lexis Nexis sites) you were from.

Secondly, occasionally the library would have access to full text of a item via another source but they would land on another site where proxying that site would lead to an error.

A very common scenario would be someone landing on a publisher site via Google, but the library has access via a aggregator like Proquest or EBSCO. Users would happily click on proxy bookmarklet, fail and give up thinking the library didn't have access.

While some institutions might see less of such failures (e.g Bigger institutions that have "everything" and subscribe mostly through publishers rather than aggregators tend to work more), in general failures can lead to a lot of confusion and users might lose confidence in the tool after failing many times and not knowing why.

The next idea done by Utrecht University avoids the first issue and provides what I considers the next step in the evolution of the proxy bookmarklet idea.

Utrecht University Library is regularly mentioned and credited for starting the idea of  "Thinking the unthinkable : Doing Away with the Library Catalogue"  and by focusing mainly on delivery over discovery and it's no surprise they are working on ways to improve access.

Their solution is UU Easy Access - a chrome extension , currently in beta.

The chrome avoids the first problem described above where users are confused on when they can add the proxy by natively including a list of domains that can be proxied in the extension and when you land on such pages it will recognise the page and invite you to proxy the page.

You can also try to click on the extension button to proxy any site but it will check against the list of domains allowed and will display a informative message if it's a site that isn't allowed to be proxied.

This is much better than a system that makes you login and then issue a typically cryptic message like "You are trying to access a resource that the Library Proxy Service has not been configured to work."

I've found users sometimes interpret this message as saying the library just needs to configure things and they will then be able to access the item they want.

The UU Easy Access Chrome extension avoid all these problems and like my souped by proxy bookmarklet idea above uses Google analytics to track usage.

Still, installing a proxy bookmarklet is also somewhat clunky compared to installing an extension and less savvy users might not be able to follow the instructions on their own.

Currently UU Easy Access only has a Chrome extension and does not yet support Firefox.

3. LibX - A browser plugin to aid library access

Both methods #1 and #2 above are unable to deal with the fact that a user may have access to full text via another source other than the site they are on. In such a case, adding the proxy will still fail.

Libx a project licensed under the Mozilla Public License can occasionally work around the issue.

Some of the nice features it has include
  • Function to proxy any page you are on (same as the bookmarklet)
  • Support autolinking for supported identifiers such as ISBNs, ISSNs, DOIs,
  • autocues that show availability of items on book vendors sites like Amazon, Barnes and Nobles and 
So for example if a page has embedded a indentifer like DOI or PMID, it will be hyperlinked such that when you click on it, you will be sent to your library's link resolver and redirected to the appropriate copy that you can access where-ever it is.

Sadly a lot of the functionality over the years as depreciated and/or now only work with libraries using Summon

Libx currently supports Firefox and Chrome and has a nice Libx edition builder to help libraries create their own version.

4. Google Scholar button - Finding free and full text

So far the solutions we talked about only tries to get the user to subscribed articles. But with the rise of open access (a study found that 50% of papers published in 2011 were freely available by 2013), more and more freely available material can be found.

I've also mused about the impact of academic libraries on the rise of open access (here, here and here) , so a extension that helps users find a alternative free version when he lands on a paywall is definitely important.

Most would agree that Google Scholar is probably one of the easiest way to find free full text, just pop the article title into Google Scholar and see if there is any pdf or html link at the side of the results. With their huge index due to permissions from many vendors to crawl full-text and unbeatable web crawling matched with the ability to recognise "Scholarly work", they are capable of finding free articles whereever they lurk on the web and are not restricted to simply find free pdfs on Scholarly sites or institutional repositories.

Add the ability to see if your institution has access to a subscribed version via the presence of a link resolver link (as most academic libraries support Google's Library Link Program), Google Scholar is the ultimate full text finder.

Never used Google Scholar before? Below shows a example of a result

Highlighted in yellow is the free full text, "Find it@SMU Library" - provides full text via the library link resolver

But what happens if you don't start from Google Scholar and land on a page that is asking you to pay and you are too lazy to open another tab and search for the article in Google Scholar? Use the Google Scholar button released by Google last year instead.

On any page, you can click on the Google Scholar button extension and it will attempt to figure out the article title you are looking for, run the search in Google Scholar in the background and display

a) the free full text (if any)
b) the link resolver link (if your library has a copy of the article)

If the title detection isn't working or if you want to check for other articles say in the reference, you can highlight the title and click on the button.

A secondary function is the ability to create citations similar to the "cite" function in Google Scholar.

This extension supports both Chrome and Firefox.

5. Lazy Scholar button - Google Scholar button + extras

Interestingly enough, the idea of using Google Scholar to find full text was already available in this extension called Lazy Scholar. I've covered Lazy Scholar when it was new in 2014.

Created by Colby Vorland a Phd student in Nutrition as a personal project it has evolved a lot and goes beyond just helping you find full text.

In terms of finding full text it does the following

1. Ability to proxy any page (Same as functionality in Proxy bookmarklet)

2. Scrape Google Scholar for free pdfs and link resolver links (Same as functionality in Google Scholar button)

3. It goes beyond Google Scholar by searching multiple places (including, PubMed Central, and Europe PMC) for free full texts.

4. It is also capable of scanning non-scholarly websites to locate scholarly links

5. Unlike Google scholar button, you can set it to auto-detect full text and load without pressing a button

6. Checks for author's email (from Pubmed) , presumably allowing you the option to email the author if all the methods above fail!

It helps you assess the quality of the article you are looking for

1. Provides citation counts from both Google Scholar and

2. Shows impact factor of journal (wonder where this comes from)

3. It does a automated check of Beall’s list of Predatory Journals and warns you

4. Shows any comments if available from Pubpeer

5. Checks if a study has a listing

6. Supports for annotation 

That's not all. Other useful functions 

1. The citation option supports over 900 styles compared to just a handful in Google Scholar button

2. Ability to block non-scholarly sites for a period (for self control)

3. More sharing options to not just reference managers but also to Facebook etc

4. Many more I probably missed out.

Here's how it looks like

I'm really impressed by the variety of functions, the main criticism I can make is that it might be overkill for many users with a very complicated interface.

For example in the above example, under the Full text check, you see 8 options!

The official site says "The green icons are non-PDF full texts that Lazy Scholar is highly confident are 100% free, whereas the yellow icon means that Lazy Scholar is moderately confident that it is a free full text".

The EZ icon next to it allows you to add the proxy string to the URL (like the bookmarklet) and the icon with books is the link resolver link scraped from Google Scholar.

Off hand, I would say it would be cleaner just to offer say the top 3 options (including the link resolver option) and hide the rest under a dropbox menu.

Still it's crazy impressive for a personal project by someone who has no ties to any libraries. The variety of sources/api he pulls from/ use is seriously amazing.

Many are well known such as, Google scholar but some are lesser known systems like comments and annotation systems like Pubpeer, etc or even dare I say pretty obscure like DOAI (Digital Open Access Identifier) that tries to resolve you to find a free version of a paper.


Can we ever make our systems to access articles truly 100% seamless and frictionless? Even within-campus or with VPN (off campus), users can still find it tough to determine if we have access to full text via alternative venues.

Anyone know of other useful tricks or tools that can help?

Perhaps this is one of the other attractions of open access, in a world where open access is dominant, we need not waste time and effort creating these workarounds to make access friendly.

Friday, June 17, 2016

Emaze a new presentation tool & on becoming a academic librarian in Singapore

It's probably a coincidence but I have been recently getting queries about pursuing a career as a academic librarian in Singapore, so I decided to write "So you want to be a academic librarian in Singapore?"

This is my way of giving back a little to the profession. I hope it will be useful to people curious or interested in potentially joining us in academic libraries in Singapore. For qualified librarians feel free to send me your comments and suggestions for improvements.

As always, everything here is my own personal opinion and not endorsed or supported by my employers or the Library Association of Singapore.

I've also recently started playing around with a interesting presentation tool called emaze. It's hard to describe but it's basically powerpoint-like with slides similar to Google slides or Slide Rocket but the templates available makes your presentations really different.

Below I tried converting part of my long boring FAQ into a slide presentation using the "gallery" template.

Use arrow keys to advance.

Powered by emaze

Saturday, May 21, 2016

Does the type of Open Access matter for future of academic libraries?

In Aug 2014, I wrote the speculative "How academic libraries may change when Open Access becomes the norm".

I argue that the eventual triumph of open access will have far reaching impacts on academic libraries with practically no domain of librarianship escaping unscathed. The article predicts that in a mostly open access environment, the library's traditional role in fulfillment and to some extent discovery will diminish (arguably library's role in some aspects of discovery is already mostly gone ).

Given that currently faculty view academic libraries mainly in the role of purchasers, I suggest to survive academic libraries will start shifting towards expertise based services like Research data management, GIS, information literacy etc.

Libraries may move towards supporting publishing of open access journals (perhaps via layered journals or similar) or focusing on special collections, supporting Lorcan Dempsey's inside-out view

I end by suggesting the trick for academic libraries is to figure out the right way and time to shift resources away from current traditional roles. Perhaps the percentage of content your faculty uses/cited that is available for free could be a useful indicator of when to shift roles.

What about the nature of open access that emerges?

One thing I shied away from speculating on was the type of open access that emerges as well as how the transition would occur. When open access become the norm (defined as say 80% of yearly scholarly output freely available) would most of the Open access be provided predominantly via Green Open Access or via Gold Open Access or some fair mix of the two? Would it be provided via Subject Repositories or Institutional Repositories (or maybe even modules from CRIS systems like PURE, Converis) ? 

Heck would it even matter if a Sci-Hub like system prevails and everyone pirates articles?  (That was a joke, I think....)

In other words, did it matter for the future of academic libraries no matter how articles were made freely available?

Elsevier , SSRN and the civil war in Open access

What led to this article was of course the news that the very dominant social science & humanties subject repository SSRN (Social Science Research Network) was bought up by Elsevier. 

I knew institutional repositories in general were not experiencing much traction and if I were a betting man I would venture preference for open access by faculty or rather faculty wanting to publicise their work by placing output online generally goes

a) Gold Open Access (if payment not issue)
b) Green Open Access (via Subject Repository) - for disciplines with traditions such as RePec, ArXiv, SSRN etc
c) Commercial academic sharing networks (e.g, ResearchGate)
d)Green Open Access (via Institutional Repository) 

and when (if?) open access became dominant, open access would be provided mostly in this order.

Still, I must admit until this happened it never occurred to me that subject repositories could be bought by legacy publishers!

Barbara Fister and Roger Schonfeld as usual have very good takes on the situation.

Roger's article points out that Elsevier is likely to pursue a very similar strategy as the one that led them to purchase Mendeley (leverage user information and analytics and to get into the user work flow)

"Given the nature of the emphasis that Elsevier has been making on data and analytics, we should expect to see over time other integrations between an article repository like SSRN and Elsevier’s other services. There is a wealth of information in the usage logs of services like SSRN that could help guide editors trying to acquire manuscripts for publication or that could assist business development efforts for journal acquisitions. Also important to watch are SciVal, Pure, and some of Elsevier’s other “research intelligence” offerings."

In addition, SSRN strength in the social sciences complements nicely Mendeley's strength in STEM fields. 

To me though this purchase of SSRN also shows how much a force Elsevier now is in the open access area.

Here are three moves it made in the open access space in May 2016 alone. 

First off not just 5 days ago it was announced Elsevier was now the world’s largest open access publisher. In terms of number of Gold Access Journal titles they are now in the lead.

Their acquisition of SSRN gives them a foothold in the social science preprint-postprint world. Will arXiv (which I remember had to resort to begging for donations a few years back) or other subject repositories be next? (RePEc apparently is safe) Will other publishers or companies in the library space start doing the same?

Just a few days earlier they announced a pilot program with University of Florida that allows metadata from Sciencedirect to automatically populate the Institutional repository.

In the GOAL (Global Open Access) mailing list, I see talk that the distributed nature of institutional repositories are the best defense against such take-overs.

But one wonders if all this makes any difference if our institutional repositories fail to compete.

Given the large investments that Elsevier can pour into SSRN, add the synergies it can create with it's ownership of other parts of the ecosystem , can institutional repositories truly compete? Institutional repositories today are often mostly metadata rather than full text. Even as a librarian I find uploading my papers to University Institutional Repositories extremely painful compared to commercial alternatives like ResearchGate, due to the complicated online forms.

Sure, most Universities running Dspace , Eprints can in theory can fix the interface, add functionalities that aren't in the standard set, but this would apply only to their versions and not the base package. Compared to a centralised subject repository , researchers would find uploading their output extremely fragmented and uneven experience. Eg Some Institutional repositories would have usage statistics sent to them, some wouldn't. Compare to someone uploading to SSRN,  which will have a set of consistent data available for comparison (Institution, Researcher, Paper) across the whole output posted in SSRN.

So much for my hope that one of the tasks academic libraries could do once the purchaser role was phrased out would be that of a publisher via Institutional Repositories or even overlay journals. 

Also as Jennifer Howard notes, we are slowly getting cut out of researcher workflows. In the past such publishers would still consult librarians to get  a sense of how their material was used. With the digital era, they can see a lot more via web analytics. With acquisition of tools used across the whole research cycle (e.g Citation manager, preprint server etc), they can arguably be closer and know more about faculty than any liason librarian can hope to know!

One bright spot exists though. Current research information system (CRIS)  (eg. Thompson Reuters' converis or Elsevier's Pure), do have the potential to be in researcher workflows and it's logical for institutions to leverage on those systems to provide traditional Institutional Repository functions. But as noted here , such systems are mainly internal focused rather than external focused (though this might change) and libraries are generally secondary partners in them compared to Institutional repositories where they typically lead.

So it's hard to say if this will pan out or if they do what roles libraries will play.


"Librarians certainly should be thinking about what we can contribute to an open access world – after all, we’ve been advocating for it for decades. We need to figure out how we can contribute to a more open, more accessible world of knowledge."

Let's start thinking seriously now...

Personal Note

I was recently awarded the LAS (Library Association of Singapore) Professional Service Award 2015 at a ceremony at the Singapore National Gallery last week.

I am truly humbled and thankful for this incredible honor. I truly did not expect this.

I would like to thank Gulcin Gribb my University Librarian for nominating me and the awards panel.

I was cited for my contributions to the library profession for sharing of knowledge and ideas and this blog is definitely a very big part of it, so I thank everyone who I have worked with, corresponded and exchanged ideas with including all of you dear readers who give me motivation to blog.

Tuesday, April 26, 2016

A quick comparison of online infographics makers - Infogram, Pikochart and Venngage

When I was back in school, I dreaded art class as I was simply horrible at it. I was never a visual type of person and even today I favor words and numbers and avoid most "artistic" endeavors. So you can understand why when I decided to try creating a infographic for the library I expected it to be a big disaster.

Fortunately many tools have appeared that help even the artistic impaired people like me to not fail too badly.

Creating infographics to me involve three parts

1) Pulling out the data you need from various library systems
2) Creating interesting infographic "objects" (images, charts, visualizations)
3) Organizing everything in a interesting structure

I am decently well versed in the first step and can happily pull library data from Google Analytics, Primo Analytics, Alma Analytics etc so this part wasn't the problem.

For the organization of the infographic, I kept it simple and used one of the numerous templates available.

So the last part involved doing charts and other visualizations of the data I had extracted. While Excel has become increasingly capable at creating all types of charts (Excel 2013 has donut charts, radar charts, combo charts etc while Excel 2016 adds histogramstreemaps, waterfallsunburst, Box & Whisker, Pareto and more ) , there are still some typical visualizations used in infographics that Excel can't do and this is where the online infographics makers come into play.

In particular a very common visualization commonly used is to show X people in Y type statistics.

Another similar visualization often seen is to represent a percentage by shading proportionally an icon.

While it's possible to create the above by hand using say powerpoint , it can be pretty exhausting. This is where the free online tools help.

I tried the free versions of , Piktochart and Venngage and these are my impressions. - good for more than 2 categories

Infogram has the usual charts and visualizations you expect and also some less commonly used ones like Treemap, Bubble , Hierarchy etc

But it is the Pictorial ones that are interesting to me.

Pictorial bar is the down right easy way to visualize X in Y type of images.

For example, if you want to show say 1 in 4 history students visit the library daily, it looks like this.

You can easily change the colors by editing the data , then clicking settings

You can also change the shape of the icon, to say a female icon or any of the preset ones. The selection is very limited though compared to others for the free version.

What if you want to create a visualization of three or more categories? Say you want to show of 10 students who visit the library, three are from business, five are from history and two are from Science?

For that you use the Pictorial chart.

I admit that I am puzzled that when I first enter the data by default it gives me a row of icons that are 12 by 24 = 288 icons.

If you fiddle around with other switches such as turning off the "round values" and using the "absolute distributions", you can see some of the icons are partly filled.

But still I wonder what's the point, why such a weird distribution of 12x24, I may have missed something but I can't change this distribution to something more sane like 10x10 to get a "For every 100 students...."

In any case, you can always turn on the "actual" switch, to get the exact number of icons you included.

Do also check out the Gauge , Progress bar or Size visualizations but in general Infogram visualizations are fairly simple compared to the ones below.

Pikochart - upload your own icons

Pikochart has roughly the same types of visualizations as infogram via their "icon matrix" 

However Pikochart seems to have far more options than You can 

a) change to a far large set of icons available than in
b) change to a icon you uploaded (svg file)
c) Set the number of columns for the icon to be arranged in.

In the above example, I changed the data to Business = 20, History = 30 and Science = 50.

I also changed the columns to 10, so there are 10 icons per row.

You can of course use this to do various tricks. In general, I find Pikochart has slightly more options than Infogram and the ability to upload your own icon is a big win.

Venngage - my favourite

Venngage  is by far my favourite tool  of the bunch at least for the purposes I am using it for. 

First off, if you just want to represent 2 categories (e.g Use/non-Use) you select Pictograms.

Like Pikocharts, you get a huge library of icons to select from. But unlike Pikocharts you can't upload your own (not for the free version at least).

Still with the wide variety of icons available, you can easily create high quality professional looking stuff like this.

By default, you get a 5 by 5 set of icons and you get 13 icons colored blue.

You can easily change it to say 10 by 10 with a value of 35. I've also changed the color.

Besides the fact that this visualization can't handle visualizations with more than 2 categories (say Faculty/Post-Graduate/Undergraduate), it also can't show partial shading of icons. So for example if you wanted to represent in a icon of 5x2 , 2.5 icons shaded it can't be done. 

A fairly unique visualization that Venngage offers is the icon column and icon bar. Below is an example of a icon column that visualizes queries at the desk by source.

All you have to do is to enter a table with values, choose the icons you one and venngage will automatically calculate and create the icons scaled proportionally to your values.

In the above example, below is what I entered as values.

I also changed each of the icons to the appropriate icons using one of the available icons. It doesn't seem possible to upload your own, but fortunately there seems to be hundreds available. 

Have you seen infographics where there are icons that are proportionally filled up to X%? Seems a lot of work to create one? Venngage makes it easy.

In this example, I wanted to show that the library has an average occupancy rate of 80% at 10pm by creating a icon of a chair that is 80% filled.

The way to do so in Vennage is a bit hidden. First go to charts (on left), scroll down and select Icon chart .

Drag the icon chart (the partly filled Twitter icon) to the canvas on the right. But how do you change it from the Twitter icon to something else?

This is done by choosing icon (again on left), selecting one of the hundred icons available and then dragging it to the canvas. If you have done it correctly, if you click on the icon, you can see at the top a way of adjusting colors and the percentage fill.

Other nice stuff to explore include icons showing percentages (see below), bubble, stacked bubble and cloud bubble.


Canva has a very nice set of icons and other graphical elements etc, but it is relatively lacking in the pictorials that I have covered above. Still worth looking at if you want to use the large number of templates and other graphical elements


This is just a quick overview of these online tools in one particular aspect that I was looking for.

Most of these tools are also capable of creating map visualizations something I didn't try this time.

This is something I might cover in future posts together with a quick comparison of desktop visualization/business intelligence tools including Qliksense Desktop, Tableau public and Microsoft BI desktop.

I am obviously still a beginner at this, so any corrections, comments and tips are welcome.

Sunday, March 20, 2016

Ezpaarse - a easier way to analyze ezproxy logs?

I've been recently trying to analyse ezproxy logs for various reasons (eg. supplement vendor usage reports, cost allocation, studying library impacts etc) , and for those of you who have done so before, you will know it can be a pretty tricky task given the size of files involved.

In fact , there is nothing really special or difficult about ezproxy logs other than the size, a typical log will look something like jRIuNWHATOzYTCI p9234212-1503252-1-0 [17/May/2011:10:01:44 
+1000] "GET
HTTP/1.1" 200 120

Your library may show slightly more details such as capturing user login information, user-agent (i.e type of browser) and Referrer (i.e the URL the user was on before).

In fact, you could even import this into Excel using space as a delimiter to get perfectly analyzable data. The main issue is you couldn't go very far doing this because Excel is limited to 1 million records only.

Overcoming the size issue

So one can't use Excel, what about exporting the data into a SQL database?

One idea is to use sed - a stream editor to convert the files into csv and import them into a SQL database (which is capable of managing a large number of records, though you may still come up against memory limits of your machine).

In any case I personally highly recommend sed, it is capable of finding, replacing and extracting even very large txt files in a efficient manner as it a stream editor. For example I can use it to go over 15 Gb of ezproxy logs to extract logs that contain a certain string (e.g. in less than 10 minutes on a laptop with 4-8Gb of Ram.

I messed around with it for a day or two and find it relatively easy to use.

What if you don't want to use a SQL database and just want to quickly generate the statistics?

Typical most methods involve either

a) Working with some homebrew Perl or Python script - eg See ones shared by other libraries here or here

b) Using some standard weblog analyzer like Sawmill , Awsstats , analogx etc .

These can run through your logs can generate statistics on any reasonable machine.

Still too big? Another alternative is to do an analysis over so called SPU (start point URLs) , which basically only captures the very first time a user logins via ezproxy and creates a session. This results in much smaller files , depending on the size of your library you probably will be able to analyse it even in Excel.

You may have to set up your ezproxy configuration files to generate SPU logs as it is not logged by default.

Session based analysis

But regardless of the method I studied , I realize that fundamentally they gave the same results basically what I call sessions based analysis.

Example output from this script

These methods would tell you how many sessions were generated, and combined with the domains in the HTTP requests could tell you the number of sessions or users for each domain (say Scopus, or JSTOR)

But sometimes sessions alone or not enough, if you wanted more in depth analysis like the number of pdfs downloaded or page viewed from say Ebsco or Sciencedirect you are stuck.

The difficulty lies in the fact that it isn't always obvious from the HTTP request whether the user is requesting a download of a PDF or even if it is a html view from that platform.

Certainly if you wanted to you could do a quick adhoc analysis of the URLs for one or two platforms, but to do it for every platform you subscribed to (and most libraries subscribe to hundreds) would be a huge task especially if you started from the scratch.

Is there a better way?

Going beyond session based analysis with ezpaarse

What if I told you there was a free open source tool - ezpaarse that already had URL patterns for parsing over 60 commonly subscribed library resources and could produce data rich reports like the ones below?

Starting out with Ezpaarse

Ezpaarse comes in 2 versions, a local version you can host and run on your own servers and more interestingly a cloud based version.

The cloud based version is perfectly serviceable and great to use if you don't have resources or permission to run your own servers but obviously one must weight the risk of sending user data over the internet even if you trust the people behind ezpaarse. (The ezproxy log you upload to the cloud version doesn't seem to be secured I think)

One can reduce the risks by anonymizing IP address, masking emails, cleaning HTTP requests etc before sending it off to the cloud of course (I personally recommend using sed to clean the logs)

Choosing the right log format 

Your logs might be in slightly different formats , so the first step after you sign in you need to specify the format of your logs. You do so by clicking on the "Design my log format" tab, then throwing in a few lines of your logs to test.

If you are lucky, it may automatically recognise your log format, if not you need to specify the log format.

Typically you need to look into your ezproxy.config for the ezproxy log directive. Look for something like

LogFormat %h %l %u %t "%r" %s %b

If you did it correctly, it should interprete the sample lines nicely like this (scroll down)

If you are having problems getting this to work do let the people at Ezpaarse know, they will help you figure out the right syntax. My experience so far is they are very helpful.

In fact, for ease of reuse, the ezpaarse people have helped some institutions create preset parameters set already. Click on parameters

You can see some predefined paraemters for various institutions . They are mostly France and Europe in the screenshot, but as you scroll down you will see libraries from US, Australia are already included, showing that word is spreading of this tool.

You can look at other options, including the ability to email you when the process is complete but most intriguing to me is the ability to simulate COUNTER reports (JR1)

I haven't tried it yet but could be used to compare with vendor reports for a sanity check (differences are expected of course because of off-campus access etc).

Loading the file and analyzing

Once done the rest is simple. Just click on Logfiles tab and add the files you want to upload.

I haven't tried with huge files (e.g >4 Gb), so there may be file limits but it does seem to work for reasonably sized files as it seems to be reading line by line.

As the file is processed line by line you can see the number of platforms recognized and the accesses recorded so far. My own personal experience was on the logs occasionally choking on the first line and refusing to work, so it might be worth while clicking on system traces to see what error messages occur.

Downloading and reporting

Once the file is 100% processed you can just download the processed file.

It a simple file csv file where the data is divided or delimited by Semicolons that you can open with many tools such as Excel.

You can see the processed file below.

There are tons of information that ezpaarse managed to extract from the ezproxy log, including but not limited to

a) Platform
b) Resource type (Article, Book, Abstract, TOC etc)
c) File type (PDF, HTML, Misc)
d) Various identifiers - ISSN, DOIs, Subjects (extracted from DOIs) etc.
e) Geocoding - By country etc

It's not compulsory but you can also download the Excel template and load the processed file through it to generate many beautiful charts.

Some disadvantages of using Ezpaarse

When I got the cloud based version of Ezpaarse to work, I was amazed at how easy it was to create detailed information rich reports from my ezproxy logs.

Ezpaarse was capable at getting very detailed information that I wouldn't have thought it was possible. This is due to the very capable parsers build-in for each platform.

This also is it's weakness because ezpaarse will totally ignore lines in your logs for platforms it does not have any parsers.

You can see the current list of parsers available and ones that is currently worked on.

While over 60 platforms have parsers such as Wiley, T&F, Sciencedirect, Ebscohost etc, many popular ones such as Factiva, Ebrary, Westlaw, Lexisnexis , Marketline , Euromonitor are still not available though they are in progress.

Of course if you subscribe to obscure or local items chances of them been covered is nil unless you contribute a parser yourself.

Overall, it seems to me currently Ezpaarse has parsers on more traditional large journal publishers and fewer on business, law type databases. So institutions specializing in law or business may get lesser benefit from Ezpaarse.

In some ways, many of the parsers cover platforms that libraries typically get COUNTER statistics from, but ezproxy log analysis goes beyond simplistic COUNTER statistics allowing you to for example to consider other factors like user group, discipline etc as such data is available in your ezproxy logs.

A lot of the document is also in French but nothing Google translate can't handle.


Ezpaarse is a really interesting piece of software. The fact it is open source and allows other libraries to contribute to the project  without reinventing the wheel and create parasers for each platform is a potential game changer.

What do you think? I am a newbie at this ezproxy analysis with limited skills, do let me know what I have missed or misstated. Are there alternative ways to do this?

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...