Google and Google Scholar - Bodleian Libraries

Google and Google Scholar Roger Mills and Judy Reading May 2007 Welcome to the Web

The worlds biggest haystack What can you do in a haystack? Romp Get hay fever

Have unexpected encounters Sleep Not do research So what do you fancy? Finding needles

Google helps you find needles in haystacks But: Google is an index of web pages A journal article is not a web page So Google is not good at finding journal articles However:

An image of a journal article may be placed on a web page So Google may find it If its free and not behind a firewall How do you know? Google is fast

Very fast Proudly fast Tells you how fast Found OUCS home page in 0.09 secs Also found 350,000 other relevant pages But put home page first

Brilliant - How does it do it? Not telling. Did I need 350,000 references? Nobody looks at all the references Google retrieves So why display them?

Algorithm takes into account links made by other pages And click-throughs So the top result for a given search is determined over time by the people who make that search Is that the same as the best result?

OK, how would you do it? To index a document, Id read it first. Google cant read We dont read the web we view it We remember references visually that red book on the third shelf down

If Google can list all the red books on all the third shelves down in all the world Im bound to find it, right? Actually I remember I saw in Oxford, so I just need to list all the red books in Oxford doddle Thats not really how Google works is it?

So you read the article, and then? Give it some index terms Not ones Ive just made up, but ones from a standard list. That way, everyone will know what the articles about, and every article on the same topic can be found. Provided everyone agrees what the articles about.

Then Id list the authors in a standard form: so everything by Roger Mills, Roger Anthony Mills, Roger A Mills, R Anthony Mills, Anthony Mills, R A Mills can be found in one go. Thats a controlled vocabulary. Works for journal titles too.

Google doesnt do that No controlled terms So you must think of synonyms, different forms of name, title abbreviations etc You must define the context that matters.

Knitting according to Google OK, we get it. So lets invent Google Scholar Lets team up with publishers so they let us search behind

their firewalls Lets modify our algorithm so it excludes non-scholarly material (how do we define that?) Lets look at citations so when one article we index cites another one we index, we can move it higher up the relevance ranking

Lets link together different versions of the same article Lets include library locations for full-text access Lets see how it goes But lets not allow: creation of sets

Or controlled vocabularies Or combining of searches Or hit rate figures for individual search terms Or proximity searching Or saving and e-mailing results Or creation of alerts

Or standardisation of journal names/abbreviations Or info on what is included and what is not Or info on how the system decides what is scholarly Or an indication of update frequency seems slower than normal Google

Which of these statements is true? Google is comprehensive Google is all I need Google is up-to-date Google is not evil Google is commercial

Google is independent Google is secretive Google wants to rule the world Google wants to beat Microsoft Google loves me I love Google

Google is a family A range of products under a common brand Some add value to the basic search engine; others are nothing to do with searching Google Scholar is a variant of the standard search engine

It uses a different algorithm, but we dont know how it differs Whats in Google Scholar? Google Scholar provides a simple way to broadly search for

scholarly literature. From one place, you can search across many disciplines and sources: peer-reviewed papers, theses, books, abstracts and articles, from academic publishers, professional societies, preprint repositories, universities and other scholarly organizations. Google Scholar helps you identify the most relevant research across the world of scholarly research.

NB: only in Beta Features may change Developing in tandem with Google Books, which will include digitised texts from Oxford collections and others In competition with WoK, ScienceDirect, SCOPUS, Scirus

etc Content Algorithm to identify scholarly materials crawled by Google from the open web Access to materials locked behind subscription barriers

Must include abstract Full-text access requires institutional subscriptions or individual payment Includes peer-reviewed papers, theses, books, preprints, abstracts, full-text, citations, etc.

Library links Includes OpenURL links to local library holdings In Oxford displays as Oxford Full Text beside title Includes citation data Uses citation extraction to build connections between

papers Cited by link lists items (known to Google Scholar) that cite the original paper Cited items not available online are listed with prefix [citation] Citation analysis puts the most-cited papers at the top of

the results list Searching AND implied between words as in normal Google + to include common words, letters or numbers that Googles search technology generally ignores

quote marks to search for a phrase minus sign to exclude from a search OR for either search term author: for author search intitle: to search document title restrict by date and publication

advanced search screen available Exercise Try searching for: French national identity In Google and Google Scholar With and without quotation marks

Now try searching in Web of Science (or other relevant database) Is it clear why results differ? What approach provides the most useful results:

For writing a paper for publication For quoting in a thesis

For preparing a speech For preparing for a pub quiz Or any other purpose Help screens

Earlier version Alternatives to Google Google it! See Charles Knights up-to-date Top 100 list in Reade/Write Web: _search_engines_mar07.php Use Intute for reputable human-selected sites, chosen for a UK academic audience Check OxLIP for complete listing and subject guide to university-subscribed databases. Most

list the sources they cover and use controlled vocabularies for indexing An example of Googles strengths - and weaknesses in finding a specific article: a search done in 2005 and repeated in Nov 2006:

Biology search: glutathione in green Arabidopsis WoS Exact article in one step

Scholar phrase search 2005: 15 results, this one at 7 Scholar phrase search 2006: 16 results, this one first

Scholar keyword search 2005: 2420 results, this one at 10 Scholar keyword search 2006: 4800 results, this one first

Google keyword search 2005: 17600 results, this one first Google keyword search 2006: 169000 articles, this one first

Google phrase search 2005: 59 results, this first Google phrase search 2006: 86 results, this first

Scholar 2005: all 7 versions Scholar 2005: cited by 2 Scholar 2006: cited by 14

WoS 2005: cited by 3 WoS 2006: cited by 15 Comparing citations data: 2005

X GS X SC X GS Comparing citations data: 2006

X GS Citations arranged by most cited SCIRUS phrase search: 2 journals, this first; 8 other web

sources (inc previous versions of this talk!) SCIRUS keyword search: 735 journals, this first; 6996 others Biological Abs phrase search: exact match in 1

note controlled keywords SCIRUS Very similar to Scholar but can also: Mark records Save records

E-mail records Export set in RIS format (for Endnote) Search on controlled terms in Biological Abstracts Omitting green, 14 results

Not including this one, first on Scholar Need wildcard arabidopsis-* Conclusion

Maintain a balanced diet! Five a day WoK, Scopus, Intute, subject-specific database, Google Scholar

