Grids for Chemical Informatics Chemistry, IU Bloomington Oct.

Grids for Chemical Informatics Chemistry, IU Bloomington Oct.

Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] Why are Grids Important

Grids are important for Chemistry because they support key functionalities that grow in importance as we are deluged with data from instruments and simulations Grids provide information access, storage and management Grids manage multiple simulations with different defining parameters Grids allow complex workflows with data flowing between filters Grids define models for portals Grids are built on top of commodity web service technology with broad industry support the next generation information technology Grids are used in multiple NIH and other life science/chemistry projects across the world (BIRN, caBIG,

myGrid, Comb-e-Chem ) Internet Scale Distributed Services Grids use Internet technology and are distinguished by managing or organizing sets of network connected resources Classic Web allows independent one-to-one access to individual resources Grids integrate together and manage multiple Internetconnected resources: People, Sensors, computers, data systems Organization can be explicit as in TeraGrid which federates many supercomputers; Deep Web Technologies IR Grid which federates multiple data resources; CrisisGrid which federates first responders, commanders,

sensors, GIS, (Tsunami) simulations, science/public data Organization can be implicit as in Internet resources such as curated databases and simulation resources that harmonize a community Different Visions of the Grid Grid just refers to the technologies

Or Grids represent the full system/Applications DoDs vision of Network Centric Computing can be considered a Grid (linking sensors, warfighters, commanders, backend resources) and they are building the GiG (Global Information Grid) Utility Computing or X-on-demand (X=data, computer ..) is major computer Industry interest in Grids and this is key part of enterprise or campus Grids e-Science or Cyberinfrastructure are virtual organization Grids supporting global distributed science (note sensors, instruments are people are all distributed Skype (Kazaa) VOIP system is a Peer-to-peer Grid (and VRVS/GlobalMMCS like Internet A/V conferencing are Collaboration Grids) Commercial 3G Cell-phones and DoD ad-hoc network initiative are forming mobile Grids Types of Computing Grids

Running Pleasing Parallel Jobs as in United Devices, Entropia (Desktop Grid) cycle stealing systems Can be managed (inside the enterprise as in Condor) or more informal (as in [email protected]) Computing-on-demand in Industry where jobs spawned are

perhaps very large (SAP, Oracle ) Support distributed file systems as in Legion (Avaki), Globus with (web-enhanced) UNIX programming paradigm Particle Physics will run some 30,000 simultaneous jobs Distributed Simulation HLA/RTI style Grids Linking Supercomputers as in TeraGrid Pipelined applications linking data/instruments, compute, visualization Seamless Access where Grid portals allow one to choose one of multiple resources with a common interfaces Parallel Computing typically NOT suited for a Grid (latency) Analysis and Visualization ADVANCED VISUALIZATION ,ANALYSIS

QuickTime and a decompressor are needed to see this picture. Large Disks Old Style Metacomputing Grid COMPUTATIONAL RESOURCES LARGE-SCALE DATABASES Large Scale Parallel Computers Original: Spread a single large Problem over multiple supercomputers Now-1: Control multiple smallish jobs each on independent Computers Now-2: Choose which of a few supercomputers to use

Towards an International Compute Grid Infrastructure Starlight (Chicago) US TeraGrid SDSC UK NGS Leeds Manchester Netherlight (Amsterdam) Oxford

RAL NCSA PSC UCL UKLight SC05 All sites connected by production network (not all shown) Computation Network PoP Steering clients

Service Registry Local laptops in Seattle and UK Information/Knowledge Grids Distributed (10s to 1000s) of data sources (instruments, file systems, curated databases ) Data Deluge: 1 (now) to 100s petabytes/year (2012) Moores law for Sensors

Possible filters assigned dynamically (on-demand) Run image processing algorithm on telescope image Run Gene sequencing algorithm on compiled data Needs decision support front end with what-if simulations Metadata (provenance) critical to annotate data Integrate across experiments as in multi-wavelength astronomy

Data Deluge comes from pixels/year available Data Deluged Science Now particle physics will get 100 petabytes from CERN using around 30,000 CPUs simultaneously 24X7 Exponential growth in data and compare to:

The Bible = 5 Megabytes Annual refereed papers = 1 Terabyte Library of Congress = 20 Terabytes Internet Archive (1996 2002) = 100 Terabytes Weather, climate, solid earth (EarthScope) Bioinformatics curated databases (Biocomplexity only 1000s of data points at present) Virtual Observatory and SkyServer in Astronomy Environmental Sensor nets In the past, HPCC community worried about data in the form of parallel I/O or MPI-IO, but we didnt consider it as an enabler of new science and new ways of computing

Data assimilation was not central to HPCC DoE ASCI set up because didnt want test data! Virtual Observatory Astronomy Grid Integrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map International Virtual Observatory Alliance

Reached international agreements on Astronomical Data Query Language, VOTable 1.1, UCD 1+, Resource Metadata Schema Image Access Protocol, Spectral Access Protocol and Spectral Data Model, Space-Time Coordinates definitions and schema Interoperable registries by Jan 2005 (NVO, AstroGrid, AVO, JVO) using OAI publishing and harvesting So each Community of Interest builds data AND service standards that build on GS-* and WS-* Imminent deluge of data Highly heterogeneous Highly complex and inter-related Convergence of

data and literature archives myGrid Project The Williams Workflows A A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence B

C Web services Web Services build loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles. Web Services interact by exchanging messages

in SOAP format The contracts for the message exchanges that implement those interactions are described via WSDL interfaces. Devices Humans Databases Programs Computational resources ...

... SOAP messages A typical Web Service In principle, services can be in any language (Fortran .. Java .. Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining) The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python Web Services

WSDL interfaces Portal Service Security WSDL interfaces Web Services Payment Credit Card Catalog Warehouse Shipping control

Two-level Programming I The Web Service (Grid) paradigm implicitly assumes a two-level Programming Model We make a Service (same as a distributed object or computer program running on a remote computer) using conventional technologies C++ Java or Fortran Monte Carlo module Data streaming from a sensor or Satellite Specialized (JDBC) database access Such services accept and produce data from users files and databases Service Data The Grid is built by coordinating such services assuming we have solved problem of programming the service Two-level Programming II

The Grid is discussing the composition of distributed services with the runtime Service1 Service2 interfaces to Grid as opposed to UNIX Service3 Service4 pipes/data streams Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs Such interpretative environments are the single

processor analog of Grid Programming Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately Repositories Federated Databases Database Sensors Streaming Data Field Trip Data Database Sensor Grid

Database Grid Research Compute Grid Data Filter Services Research Simulations SERVOGrid ? GIS Discovery Grid

Services Education Customization Services From Research to Education Analysis and Visualization Portal Grid of Grids: Research Grid and Education Grid Education Grid Computer

Farm SERVOGrid Requirements Seamless Access to Data repositories and large scale computers Integration of multiple data sources including sensors, databases, file systems with analysis system Including filtered OGSA-DAI (Grid database access)

Rich meta-data generation and access with SERVOGrid specific Schema extending openGIS (Geography as a Web service) standards and using Semantic Grid Portals with component model for user interfaces and web control of all capabilities Collaboration to support world-wide work Basic Grid tools: workflow and notification NOT metacomputing SERVOGrid Portal Screen Shots Earthquake Grid

DoD NCOW Grid C2 (JBI CEE etc.) NCOW-IS Services CoI Specific Grids/Services Earthquake Data & Simulation Service ServoIS Information Grid 7: Portals Compute Grid 6: Collaboration Grid

GIS Grid Sensor Grid 9: Application Services 10: Policy (ECS) 8: Data Access/Storage 4: Discovery 2: Security Core Low Level Grid Services 3: Messaging 5: Mediation 11: Metadata 1: Management

Physical Network n: Service refers to core services identified by DoD CoI Community of Interest GIS Geographical Information System BioInformatics Grid Chemical Informatics Grid HTS Tools Quantum Calculations CIS

Domain Specific Grids/Services 7: Portals Compute Grid MIS Grid Instrument Grid Information Grid 6: Collaboration Grid 9: Application Services 10: Policy 8: Data Access/Storage 4: Discovery

2: Security Sequencing Tools Biocomplexity Simulations BIS Core Low Level Grid Services 3: Messaging 5: Workflow 11: Metadata 1: Management Physical Network M(B,C)IS Molecular (Bio, Chem) Information System GIS Grid with WMS, WFS, data sources and GML

Northridge2 Northridge2 Wald D. J. -118.72,34.243 118.591,34.176 ` WMS

Client WFS Server S Q LQ u e ry S Q L Q u e ry H ig w a y ] R a ilro a

d s -b [a [1 2 -1 8 ] Railroads Interstate Highways Rivers Bridges

90 GML becomes CML, CellML, SBML Electric Power and Natural Gas data from LANL Interdependent Critical Infrastructure Simulations Zoom-in Zoom-out FeatureInfo mode Measure distance mode Clear Distance Drag and Drop mode Refresh to initial map Integrating Archived Web Feature Services

and Google Maps Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records. What is Happening?

Grid ideas are being developed in (at least) four communities Web Service W3C, OASIS, (DMTF) Grid Forum (High Performance Computing, e-Science) Enterprise Grid Alliance (Commercial Grid Forum with a near term focus) Service Standards are being debated Grid Operational Infrastructure is being deployed Grid Architecture and core software being developed Apache has several important projects as do academia; large and small companies Particular System Services are being developed centrally OGSA or GS-* framework for this in GGF; WS-* for OASIS/W3C/Microsoft-IBM

Lots of fields are setting domain specific standards and building domain specific services USA started but now Europe is probably in the lead and Asia will soon catch USA if momentum (roughly zero for USA) continues The Grid and Web Service Institutional Hierarchy 4: Application or Community of Interest Specific Services such as Run BLAST or Look at Houses for sale 3: Generally Useful Services and Features Such as Access a Database or Submit a Job or Manage Cluster or Support a Portal or Collaborative Visualization OGSA GS-* and some WS-* GGF/W3C/. WS-* from

Handlers like WS-RM, Security, Programming Models like BPEL OASIS/W3C/ Industry 2: System Services and Features or Registries like UDDI 1: Container and Run Time (Hosting) Environment Must set standards to get interoperability Apache Axis .NET etc. Location of software for Grid Projects in Community Grids Laboratory

htpp:// provides Web service (and JMS) compliant distributed publish-subscribe messaging (software overlay network) htpp:// is a service oriented (Grid) collaboration environment (audio-video conferencing) is an OGC (open geospatial consortium) Geographical Information System (GIS) compliant GIS and Sensor Grid (with POLIS center) has WS-Context, Extended UDDI etc. The work is still in progress but NaradaBrokering is

quite mature All software is open source and freely available Project Goals Establish Requirements from stakeholders Research Pharmaceutical Industry Government Consider educational implications e-Science v Bio/Chem/Molecular Informatics

Consider other national and international projects to ensure we either lead or use best practice Design a Grid architecture and staged implementation Start pilot projects led by Chemistry/Chemical Informatics Evaluate and iterate Design and implement ?(Chem, Life Science, Science, Molecular) Informatics educational program that will attract students Write winning center grant in 2006-7 Web Services Introduction What are Web Services? A distributed invocation system built on Grid computing Independent of platform and programming

language Built on existing Web standards A service oriented architecture with Interfaces based on Internet protocols Messages in XML (except for binary data attachments) Web Services Introduction A web-based architecture providing for interoperability among resources Centralized service registry Solves problems associated with finding, using, and combining online resources Employ standard Internet protocols for: Communication with resources Automated discovery using centralized registries

Communicate with devices, people, and each other with the protocols and computer languages Service Oriented Architecture (SOA) Goal is to achieve loose coupling among interacting software agents Define service: a unit of work done by a service provider to achieve desired end results for a service consumer Both provider and consumer are roles played by software agents on behalf of their owners. How does SOA work? Two architectural constraints are employed Small set of simple and ubiquitous interfaces to all participating software agents

Descriptive messages constrained by an extensible schema delivered through the interfaces Web Services Architectures Individual services are registered globally Broken down into individual services with inputs and outputs specified Services are published Services are requested Open registry, publishing, and requesting Service-Oriented Architecture From Curcin et al. DDT, 2005, 10(12),867 Web Services for Science

Invisible Services, Semantic Web, and Grid Easy-to-use tools for any scientist High throughput, resource intensive computing done for low cost/resources Shared community Collaborations between labs and fields Shared data Shared tools e-Science and the Grid 1 e-Science: Major UK Program global collaboration in key areas of science and the next generation of infrastructure that will enable it reflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teams total investment of some 200M over the five-year period from 2001 to 2006

CyberInfrastructure: the analogous US initiative Grid Technology: supports e-Science & Cyberinfrastructure Basic Architectures: Servlets/CGI and Web Services Browser GUI Client Browser HTTP GET/POST Web Server WSDL SOAP

SOAP JDBC DB or MPI Appl. Web Server WSDL Web Server WSDL WSDL

JDBC DB or MPI Appl. Importance of Web Services Building a true science community Enabling interoperability between tools and the integration of data Less time coding, more time for science Change the way scientists work by achieving new levels of integration When To Use Web Services? Applications do not have severe restrictions on reliability and speed. Two or more organizations need to cooperate. One needs to write an application that uses anothers service.

Services can be upgraded independently of clients. Services can be easily expressed with simple request/response semantics and simple state. Web Services Benefits Web services provide a clean separation between a capability and its user interface. Increase in productivity Increase in flexibility Rapid return on investment Integration across multiple applications Web Services Advantages Output in human- and computer-readable formats I/O formats based on standard Internet protocols Resources accessible server to server

allow automated I/O Integration based on specific services: you select services or data needed without downloading the entire data set Web Services Advantages Description protocols provide details of service provided and interface components Semantic Web standards increase efficiency Use a central registry and standardized description of services Quality and status of the information is dynamically available Web Services Drawbacks

Based on new technologies Time and commitment required to learn Standards still in a state of rapid flux Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy Components of Web Services Protocols SOAP WSDL UDDI XML as a basis for the protocols Ontologies OWL: Ontology Web Language

Semantic Web Components of the Semantic Web for Chemistry XML eXtensible Markup Language RDF Resource Description Framework RSS Rich Site Summary Dublin Core allows metadata-based newsfeeds OWL for ontologies BPEL4WS for workflow and web services Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 31923203.

SOAP: Simple Object Access Protocol Flexible protocol to communicate information between server and server or client and server using XML Supports Remote Procedure Calls Allows layers (security, authentication, transactions) over the basic SOAP elements WSDL: Web Service Definition Language Describes a services interface to clients Services register themselves with Web Services WSDL describes how to contact and interact with services I/O, operations and messages to aid interaction with client

WSDL Overview An XML-based Interface Definition Language. You can define the APIs for all of your services in WSDL. WSDL docs are broken into five major parts: Data definitions (in XML) for custom types Abstract message definitions (request, response) Organization of messages into ports and operations (classes and methods). Protocol bindings (to SOAP, for example) Service point locations (URLs) Some interesting features A single WSDL document can describe several versions of an interface. A single WSDL doc can describe several related services. UDDI: Universal Description,

Discovery, and Integration Provides ways for clients and services to interact with other services Uses XML Defines the means of access, e.g., URL E-Mail Defines services hosted by an entity Business-oriented tags Uses SOAP for communicating XML: eXtensible Markup Language Allows definitions of types of documents Tags are used to specify components of documents Allows specification of namespaces to differentiate between identical tag names Tag names do not provide semantics other

than simple hierarchical relations XML Overview A language for building languages Basic rules: be well formed and be valid Particular XML dialects are defined by XML schemas. XML itself is defined by its own schema. Extensible via namespaces Many non-Web services dialects RDF, SVG, GML, CML, XForms, XHTML Many basic tools available: parsers, XPath and XQuery for searching/querying, etc. XML and Web services XML lends itself to distributed computing: Its just a data description.

Platform, programming language independent Web Services Description Language (WSDL) Describes how to invoke a service Can bind to SOAP, other protocols for actual invocation Simple Object Access Protocol (SOAP) Wire protocol extension for conveying RPC calls Can be carried over HTTP, SMTP OWL: Web Ontology Language Builds on RDF and RDFS and adds a means for richer descriptions of properties and classes Disjoint classes Cardinality of classes Characteristics of relations, like symmetry

Standards for Web Services Business Process Execution Language for Web Services (BPEL4WS) Ontology Web Language Semantics (OWL-S) Web Service Modeling Ontology (WSMO) Standards Setting Boards OASIS: Organization for Advancement of Structured Information Standards ebXML: e-business XML UDDI: Universal Description, Discovery and Integration Global Grid Forum community of users, developers, and vendors leading the global standardization effort for grid computing

Standards Setting Boards W3C: World Wide Web Consortium OWL: Ontology Web Language RDF/RDFS: Resource Description Framework/Schema SOAP: Simple Object Access Protocol URI/URL/URN: Universal Resource Identifier/ Locator/Name WSDL: Web Service Definition Language XML: eXtensible Markup Language SWWS: Semantic Web-Enabled Web Services Main objectives: Provide a comprehensive Web Service description framework Define a Web Service discovery framework Provide a scalable Web Service mediation

middleware A program of the European Commission to run 2002-2005 Web Services Integration Projects: Biosciences myGrid BIOPIPE BioMOBY Web Services for Chemistry: Problems

Performance and scalability Proprietary data Competition from high-performance desktop applications -- Geoff Hutchison, its a puzzle blog, 2005-01-05 ALSO: Lack of a substantial body of trustworthy Open Access databases Non-standard chemical data formats (over 40 in regular use and requiring normalization to one another) Missing Ingredients in Chemistry Chemical communities to assemble Open Access databases Well-defined quality assurance procedures performed by distributed peer-review systems Software underlying the databases needs to be open source.

Chemistry Databases on the Web Marc Nicklaus lists 37 databases as of October 2001 Must have structure searching and at least 100 molecules SoaringBears List has 15 databases Institutional Repositories NARSTO Quality Systems Science Center Pollutant species in the troposphere over North America Part of the Carbon Dioxide Information Analysis Center at ORNL NARSTO Data and Information Sharing Tool Public Data Repositories Developmental Therapeutics Program/NCI Some assay data for download Structures for over 200,000 compounds Zinc and other screening databases NIST computational chemistry database Environmental fate and exposure databases Other Public Repositories 1 ChemExper Chemical Directory > 200,000 substances; > 10,000 IR spectra HIC-Up; Hetero-Compound Identification Centre

Uppsala 5384 substances as of 1/15/05 Chemicals with Pharmaceutical Activity; a 3D Structural Database 400 3D structures Other Public Repositories 2 41 data sets in 9 categories as of 8/18/05 WebReactions Other Public Repositories 3 MolTable MatWeb Materials Property Data Spectral Database for Organic Compounds (SDBS) Over 32,000 compounds Has EI-MS, FT-IR, 1H NMR, 13C NMR, Raman, ESR NMRShiftDB (Christoph Steinbeck) 14,753 structures as of 8/19/05 Features peer-reviewed submission of data sets Other Public Repositories: Commercial Teasers (Thermo Electron)

Demo file of 575 spectra from 87,000 in the full database ChemACX 30 of >350 suppliers catalog data Sunset Molecular Discovery, LLC Wombat (World of Molecular BioAcTivity) 117,007 entries with over 230,000 biological activities Wombat PK Database for Clinical Pharmacokinetics: 643 substances with 4668 measurements Three sample files from Wombat containing 341 Histamine-1 receptor antagonists A group of chemists, programmers, and

informaticians working collaboratively on projects such as: Chemistry Development Kit (CDK) JChemPaint Jmol JUMBO NMRShiftDB Octet Open Babel

QSAR World Wide Molecular Matrix (WWMM) Indiana University Existing Projects System for the Integration of Bioinformatics Services (SIBIOS) PlatCom: A Platform for Computational Comparative Genomics Reciprocal Net Indiana University Planned Projects Design of a Grid-based distributed data architecture Development of tools for HTS data analysis and

virtual screening Database for quantum mechanical simulation data Chemical prototype projects Novel routes to enzymatic reaction mechanisms Mechanism-based drug design Data-inquiry-based development of new methods in natural product synthesis Web Services Future Depends on Adoption of standards Incorporation of WS in current and newly developed applications Security, privacy, quality of data issues Development of WS tools and resources for eScience

Recently Viewed Presentations

  • Epidemiology - Lecture 1

    Epidemiology - Lecture 1

    Human populations. Clinical observations. Identify and analyze available data. New studies. Case-control study. Identify suspect exposures. Cohort study. Follow up to see if associations hold
  • Game representations, game-theoretic solution concepts, and complexity Tuomas

    Game representations, game-theoretic solution concepts, and complexity Tuomas

    Game representations, game-theoretic solution concepts, and complexity Tuomas Sandholm Computer Science Department Carnegie Mellon University The heart of the problem In a 1-agent setting, agent's expected utility maximizing strategy is well-defined But in a multiagent system, the outcome may depend...
  • Hemodynamics II … The Waves Don't Look Right

    Hemodynamics II … The Waves Don't Look Right

    Dobutamine Challenge. Nishimura R A , Carabello B A Circulation 2012;125:2138-2150. True fixed stenosis vs Pseudo variable stenosis. In patients in whom there is a low-output, low-gradient (Grad) state, it may be necessary to perform dobutamine stimulation to normalize cardiac...
  • CSE 154 - University of Washington

    CSE 154 - University of Washington

    Web data. most interesting web pages revolve around data. examples: Google, IMDB, Digg, Facebook, YouTube, Rotten Tomatoes. can take many formats: text, HTML, XML ...
  • School Presentation

    School Presentation

    The children then begin their writing and are given regular opportunities to ensure their work is covering the regular VCOP. How do we start? We start with a stimulus, and this might be: A book (fiction or non-fiction) Poem ....
  • Office of Aviation Safety Heli-USA Airways Aerospatiale AS350BA

    Office of Aviation Safety Heli-USA Airways Aerospatiale AS350BA

    Heli-USA Airways Aerospatiale AS350BA Haena, Hawaii Survival Aspects Nicole L. Charnon Passenger Safety Safety briefing video Personal Flotation Device (PFD) Seatbelt Headset PFD Donning Procedures Remove headset Don PFD Do not inflate PFD inside helicopter PFD Inflated Pilot and one...
  • Probability and Induction - Michael Johnson

    Probability and Induction - Michael Johnson

    Probability and Induction. Probability. Probability is a measure of the chances that something will happen. ... Inductive arguments are also 'ampliative' in that the truth of their premises does not guarantee the truth of their conclusions.
  • Microinjection and Fallopian Transfer (MIFT)-- an effective ...

    Microinjection and Fallopian Transfer (MIFT)-- an effective ...

    wu kit yee. ho shiu hing. leung kam pik. kwan fung lin. 6 spare eggs by conv.ivf and no fert. on d1. shao wai hing. fung wai chu. ng wai ming. adenomyomectomy,myomectomy. ovarian failure, hyper-prl. 4 eggs donated by ma...