• IBM Consulting

    DBA Consulting can help you with IBM BI and Web related work. Also IBM Linux is our portfolio.

  • Oracle Consulting

    For Oracle related consulting and Database work and support and Migration call DBA Consulting.

  • Novell/RedHat Consulting

    For all Novell Suse Linux and SAP on Suse Linux questions releated to OS and BI solutions. And offcourse also for the great RedHat products like RedHat Enterprise Server and JBoss middelware and BI on RedHat.

  • Microsoft Consulting

    For Microsoft Server 2012 onwards, Microsoft Client Windows 7 and higher, Microsoft Cloud Services (Azure,Office 365, etc.) related consulting services.

  • Citrix Consulting

    Citrix VDI in a box, Desktop Vertualizations and Citrix Netscaler security.

  • Web Development

    Web Development (Static Websites, CMS Websites (Drupal 7/8, WordPress, Joomla, Responsive Websites and Adaptive Websites).

27 March 2017

IBM Power 9 CPU a Game Changer.


IBM Power 9 CPU

IBM is looking to take a bigger slice out of Intel’s lucrative server business with Power9, the company’s latest and greatest processor for the datacenter. Scheduled for initial release in 2017, the Power9 promises more cores and a hefty performance boost compared to its Power8 predecessor. The new chip was described at the Hot Chips event.


IBM Power9 CPU


The Power9 will end up in IBM’s own servers, and if the OpenPower gods are smiling, in servers built by other system vendors. Although none of these systems have been described in any detail, we already know that bushels of IBM Power9 chips will end up in Summit and Sierra, two 100-plus-petaflop supercomputers that the US Department of Energy will deploy in 2017-2018. In both cases, most of the FLOPS will be supplied by NVIDIA Volta GPUs, which will operate alongside IBM’s processors.

Power 9 Processor For The Cognitive Era


The Power9 will be offered in two flavors: one for single- or dual-socket servers for regular clusters, and the other for NUMA servers with four or more sockets, supporting much larger amounts of shared memory. IBM refers to the dual-socket version is as the scale-out (SO) design and the multi-socketed version as the scale-up (SU) design. They basically correspond to the Xeon E5 (EP) and Xeon E7 (EX) processor lines, although Intel is apparently going to unify those lines post-Broadwell.

The SU Power9 is aimed at mission-critical enterprise work and other application where large amounts of shared memory are desired. It has extra RAS features, buffered memory, and will tend to have fewer cores running at faster clock rates. As such, it carries on many of the traditions of the Power architecture through Power8. The SU Power9 is going to be released in 2018, well after the SO version hits the streets.

The SO Power9 is going after the Xeon dual-socket server market in a more straightforward manner. These chips will use direct attached memory (DDR4) with commodity DIMMs, instead of the buffered memory setup mentioned above. In general, this processor will adhere to commodity packaging so that Power9-based servers can utilize industry standard componentry. This is the platform destined for large cloud infrastructure and general enterprise computing, as well as HPC setups. It’s due for release sometime next year.



Distilling out the differences between the two varieties, here are the basics of the new Power9 (Power8 specs in parentheses for comparison):

  • 8 billion transistors (4.2 billion)
  • Up to 24 cores (Up to 12 cores)
  • Manufactured using 14nm FinFET (22nm SOI)
  • Supports PCIe Gen4 (PCIe Gen3)
  • 120 MB shared L3 cache (96 MB shared L3 cache)
  • 4-way and 8-way simultaneous multithreading (8-way simultaneous multithreading)
  • Memory bandwidth of 120 or 230 GB/sec (230 GB/sec)

From the looks of things, IBM spent most of the extra transistor budget it got from the 14nm shrink on extra cores and a little bit more L3 cache. New on-chip data links were also added, with an aggregate bandwidth of 7 TB/sec, which is used to feed each core at the rate of 256 GB/sec in a 12-core configuration. The bandwidth fans out in the other direction to supply data to memory, additional Power9 sockets, PCIe devices, and accelerators. Speaking of which, there is special support for NVIDIA GPUs in the form of NVLink 2.0 support, which promises much faster communication speeds than vanilla PCIe. An enhanced CAPI interface is also supported for accelerators that support that standard.



The accelerator story is one of the key themes of the Power9, which IBM is touting as “the premier platform for accelerated computing.” In that sense, IBM is taking a different tack than Intel, which is bringing accelerator technology on-chip and making discrete products out of them, as it has done with Xeon Phi and is in the process of doing with Altera FPGAs. By contrast, IBM has settled on the host-coprocessor model of acceleration, which offloads special-purpose processing to external devices. This has the advantage of flexibility; the Power9 can connect to virtually any type of accelerator or special-purpose coprocessor as long it speaks PCIe, CAPI or NVLink.

Understanding the IBM Power Systems Advantage


Thus the Power9 sticks with an essentially general-purpose design. As a standalone processor it is designed for mainstream datacenter applications (assuming that phrase has meaning anymore). From the perspective of floating point performance, it is about 50 percent faster than Power8, but that doesn’t make it an HPC chip, and in fact, even a mid-range Broadwell Xeon (E5-2600 V4) would likely outrun a high-end Power9 processor on Linpack. Which is fine. That’s what the GPUs and NVLink support are for.

IBM Power Systems Update 1Q17


If there is any deviation from the general-purpose theme, it’s in the direction of data-intensive workloads, especially analytics, business intelligence, and the broad category of “cognitive computing” that IBM is so fond of talking about. Here the Power processors have had something of an historical advantage in that they offered much higher memory bandwidth that their Xeon counterparts, in fact, about two to four times higher. The SO Power9 supports 120 GB/sec of memory bandwidth; the SU version, 230 GB/sec. The Power9 also comes with a very large (120 MB) L3 cache, which is built with eDRAM technology that supports speeds of up to 256 GB/sec. All of which serves to greatly lessen the memory bottleneck for data-intensive applications.

IBM Power Systems Announcement Update


According to IBM, Power9 was about 2.2 times faster for graph analytics workloads and about 1.9 times faster for business intelligence workloads. That’s on a per socket basis, comparing a 12-core Power9 to that of a 12-core Power8 at the same 4GHz clock frequency. Which is a pretty impressive performance bump from one generation to the next, although it should be pointed out that IBM offered no comparisons against the latest Broadwell Xeon chips.


The official Power roadmap from IBM does not say much in terms of timing, but thanks to the “Summit” and “Sierra” supercomputers that IBM, Nvidia, and Mellanox Technologies are building for the U.S. Department of Energy, we knew Power9 was coming out in late 2017. Here is the official Power processor roadmap from late last year:



And here is the updated one from the OpenPower Foundation that shows how compute and networking technologies will be aligned:




IBM revealed that the Power9 SO chip will be etched in the 14 nanometer process from Globalfoundries and will have 24 cores, which is a big leap for Big Blue.

That doubling of cores in the Power9 SO is a big jump for IBM, but not unprecedented. IBM made a big jump from two cores in the Power6 and Power6+ generations to eight cores with the Power7 and Power7+ generations, and we have always thought that IBM wanted to do a process shrink and get to four cores on the Power6+ and that something went wrong. IBM ended up double-stuffing processor sockets with the Power6+, which gave it an effective four-core chip. It did the same thing with certain Power5+ machines and Power7+ machines, too.

The other big change with the Power9 SO chip is that IBM is going to allow the memory controllers on the die to reach out directly and control external DDR4 main memory rather than have to work through the “Centaur” memory buffer chip that is used with the Power8 chips. This memory buffering has allowed for very high memory bandwidth and a large number of memory slots as well as an L4 cache for the processors, but it is a hassle for entry systems designs and overkill for machines with one or two sockets. Hence, it is being dropped.

The Power9 SU processor, which will be used in IBM’s own high-end NUMA machines with four or more sockets, will be sticking with the buffered memory. IBM has not revealed what the core count will be on the Power9 SU chip, but when we suggested that based on the performance needs and thermal profiles of big iron that this chip would probably have fewer cores, possibly more caches, and high clock speeds, McCredie said these were all reasonable and good guesses without confirming anything about future products.

LINUX on Power


The Power9 chips will sport an enhanced NVLink interconnect (which we think will have more bandwidth and lower latency but not more aggregate ports on the CPUs or GPUs than is available on the Power8), and we think it is possible that the Power9 SU will not have NVLink ports at all. (Although we could make a case for having a big NUMA system with lots and lots of GPUs hanging off of it using lots of NVLink ports instead of using an InfiniBand interconnect to link multiple nodes in a cluster together.)

The Power9 chip with the SMT8 cores are aimed at analytics workloads that are wrestling with lots of data, in terms of both capacity and throughput. The 24 core variant of the Power9 with SMT8 has 512 KB L2 cache memory per core, and 120 MB of L3 cache is shared across the dies in 10 MB segments with each pair of cores. The on-chip switch fabric can move data in and out of the L3 cache at 256 GB/sec, and adding in the various interconnects for memory controllers, PCI-Express 4.0 controllers, and the “Bluelink” 25 Gb/sec ports that are used to attach accelerators to the processors as well as underpinning the NVLink 2.0 protocol that will be added to next year’s “Volta” GV100 GPUs from Nvidia and IBM’s own remote SMP links for creating NUMA clusters with more than four sockets, and you have an on-chip fabric with over 7 TB/sec of aggregate bandwidth.




The Power9 chips will have 48 lanes of PCI-Express 4.0 peripheral I/O per socket, for an aggregate of 192 GB/sec of duplex bandwidth. In addition to this, the chip will support 48 lanes of 25 Gb/sec Bluelink bandwidth for other connectivity, with an aggregate bandwidth of 300 GB/sec. On the Power9 SU chips, 48 of the 25 Gb/sec lanes will be used for remote SMP links between quad-socket nodes to make a 16-socket machine, and the remaining 48 lanes of PCI-Express 4.0 will be used for PCI-Express peripherals and CAPI 2.0 accelerators. The Power9 chip has integrated 16 Gb/sec SMP links for gluelessly making the four-socket modules. In addition to the CAPI 2.0 coherent links running atop PCI-Express 4.0, there is a further enhanced CAPI protocol that runs atop the 25 Gb/sec Bluelink ports that is much more streamlined and we think is akin to something like NVM-Express for flash running over PCI-Express in that it eliminates a lot of protocol overhead from the PCI-Express bus. But that is just a hunch. It doesn’t look like the big bad boxes will be able to support this new CAPI or NVLink ports, by the way, since the Bluelink ports are eaten by NUMA expansion.

More Information:

http://marketrealist.com/2016/09/intels-server-chips-see-first-signs-competition-idf-2016/

https://www.nextplatform.com/2016/08/24/big-blue-aims-sky-power9/

http://www-03.ibm.com/systems/power/

http://www-03.ibm.com/systems/power/hardware/

https://www.nextplatform.com/2016/04/07/ibm-unfolds-power-chip-roadmap-past-2020/

https://www.nextplatform.com/2015/08/10/ibm-roadmap-extends-power-chips-to-2020-and-beyond/

https://en.wikipedia.org/wiki/IBM_POWER_microprocessors#POWER9

21 February 2017

Why Cloudera's hadoop and Oracle?


Oracle 12c & Hadoop: Optimal Store and Process of Big Data

How to use the Hadoop Ecosystem tools to extract data from an Oracle 12c database, use the Hadoop Framework to process and transform data and then load the data processed within Hadoop into an Oracle 12c database.

Oracle big data appliance and solutions



This blog covers basic concepts:

  • What is Big Data? Big Data is the amount of data that one single machine cannot store and process. Data comes with different formats (structured, non - structured) from different sources and with great velocity of grow. 
  • What is Apache Hadoop? It is a framework allowing distributed processing of large data sets across many (can be thousands) of machines. Hadoop concept was first introduced by Google. Hadoop framework consists of HDFS and MapReduce. 
  • What is HDFS? HDFS (Hadoop Distributed File System): the Hadoop File System that enables storing large data sets across multiple machines. 
  • What is Map Reduce? The data processing component of the Hadoop Framework that consists of Map phase and Reduce phase. 
  • What is Apache Sqoop? Apache Sqoop(TM) is a tool to transfer bulk data between Apache Hadoop and structured data stores such as relational databases. It is part or the Hadoop ecosystem. 
  • What is Apache Hive? Hive is a tool to query and manage large datasets stored in Hadoop HDFS. It is also part of the Hadoop ecosystem. 
  • Where Does Hadoop Fit In? We will use the Apache Hadoop Ecosystem (Apache Sqoop) to extract data from an Oracle 12c database and store it into the Hadoop Distributed File System (HDFS). We will then use the Apache Hadoop Ecosystem (Apache Hive) to transform data and process it using the Map Reduce (We can also use Java programs to do the same). Apache Sqoop will be used to load the data already processed within Hadoop into an Oracle 12c database. The following image describes where Hadoop fits in the process. This scenario represents a practical solution to processing big data coming from Oracle database as a source; the only condition is that data source must be structured. Note that Hadoop can also process non – structured data like videos, log files etc.



Why Cloudera + Oracle?
For over 38 years Oracle has been the market leader of RDMBS database systems and a major influencer of enterprise software and hardware technology. Besides leading the industry in database solutions, Oracle also develops tools for software development, enterprise resource planning, customer relationship management, supply chain management, business intelligence, and data warehousing.  Cloudera has a long standing relationship with Oracle and has worked closely to develop enterprise class solutions that enable enterprise customers to quickly manage with big data workloads.
As the leader in Apache Hadoop-based data platforms, Cloudera has the enterprise quality and expertise that make them the right choice to work with on Oracle Big Data Appliance.
— Andy Mendelson, Senior Vice President, Oracle Server Technologies
Joint Solution Overview
Oracle Big Data Appliance
The Oracle Big Data Appliance is an engineered system optimized for acquiring, organizing, and loading unstructured data into Oracle Database 12c. The Oracle Big Data Appliance includes CDH, Oracle NoSQL Database, Oracle Data Integrator with Application Adapter for Apache Hadoop, Oracle Loader for Hadoop, an open source distribution of R, Oracle Linux, and Oracle Java HotSpot Virtual Machine.

Extending Hortonworks with Oracle's Big Data Platform



Oracle Big Data Discovery
Oracle Big Data Discovery is the visual face of Hadoop that allows anyone to find, explore, transform, and analyze data in Hadoop. Discover new insights, then share results with big data project teams and business stakeholders.

Oracle Big Data SQL Part 1-4


Oracle NoSQL Database
Oracle NoSQL Database Enterprise Edition is a distributed, highly scalable, key-value database. Unlike competitive solutions, Oracle NoSQL Database is easy-to-install, configure and manage, supports a broad set of workloads, and delivers enterprise-class reliability backed by enterprise-class Oracle support.

Oracle Data Integrator Enterprise Edition
Oracle Data Integrator Enterprise Edition is a comprehensive data integration platform that covers all data integration requirements: from high-volume, high-performance batch loads, to event-driven, trickle-feed integration processes. Oracle Data Integrator Enterprise Edition (ODI EE) provides native Cloudera integration allowing the use of the Cloudera Hadoop Cluster as the transformation engine for all data transformation needs. ODI EE utilizes Cloudera’s foundation of Impala, Hive, HBase, Sqoop, Pig, Spark as well as many others, to provide best in class performance and value. Oracle Data Integrator Enterprise Edition enhances productivity and provides a simple user interface for creating high performance to load and transform data to and from Cloudera data stores.

Oracle Loader for Hadoop
Oracle Loader for Hadoop enables customers to use Hadoop MapReduce processing to create optimized data sets for efficient loading and analysis in Oracle Database 12c. Unlike other Hadoop loaders, it generates Oracle internal formats to load data faster and use less database system resources.

How the Oracle and Hortonworks Handle Petabytes of Data


Oracle R Enterprise
Oracle R Enterprise integrates the open-source statistical environment R with Oracle Database 12c. Analysts and statisticians can run existing R applications and use the R client directly against data stored in Oracle Database 12c, vastly increasing scalability, performance and security. The combination of Oracle Database 12c and R delivers an enterprise-ready deeply-integrated environment for advanced analytics.

Discover Data Insights and Build Rich Analytics with Oracle BI Cloud Service


Oracle NoSQL Database, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, and Oracle R Enterprise will be available both as standalone software products independent of the Oracle Big Data Appliance.

Learn More
Download details about the Oracle Big Data Appliance
Download the solution brief: Driving Innovation in Mobile Devices with Cloudera and Oracle

Oracle is the leader in developing software to address a enterprise data management.  Typically known as a database leader, they also develop and build tools for software development, enterprise resource planning, customer relationship management, supply chain management, business intelligence, and data warehousing.  Cloudera has a long standing relationship with Oracle and have worked closely to develop enterprise class solutions that can enable end customers to more quickly get up and running with big data.

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle Big Data Discovery



Oracle Big Data SQL product, will be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured relational Oracle data warehouses, making it possible to store and analyse much larger sets of data with much more diverse data types and structures; the issue that customers face when trying to implement this architecture is that Hadoop is a bit of a “wild west” in terms of data access methods, security and metadata, making it difficult for enterprises to come up with a consistent, over-arching data strategy that works for both types of data store.

Bringing Self Service Data Preparation to the Cloud; Oracle Big Data Preparation Cloud Services


Oracle Big Data SQL attempts to address this issue by providing a SQL access layer over Hadoop, managed by the Oracle database and integrated in with the regular SQL engine within the database. Where it differs from SQL on Hadoop technologies such as Apache Hive and Cloudera Impala is that there’s a single unified data dictionary, single Oracle SQL dialect and the full management capabilities of the Oracle database over both sources, giving you the ability to define access controls over both sources, use full Oracle SQL (including analytic functions, complex joins and the like) without having to drop down into HiveQL or other Hadoop SQL dialects. Those of you who follow the blog or work with Oracle’s big data connector products probably know of a couple of current technologies that sound like this; Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that copies Hive or HDFS data into an Oracle database typically faster than a tool like Sqoop, whilst Oracle Direct Connector for HDFS (ODCH) gives the database the ability to define external tables over Hive or HDFS data, and then query that data using regular Oracle SQL.

Storytelling with Oracle Analytics Cloud


Where ODCH falls short is that it treats the HDFS and Hive data as a single stream, making it easy to read once but, like regular external tables, slow to access frequently as there’s no ability to define indexes over the Hadoop data; OLH is also good but you can only use it to bulk-load data into Oracle, you can’t use it to query data in-place. Oracle Big Data SQL uses an approach similar to ODCH but crucially, it uses some Exadata concepts to move processing down to the Hadoop cluster, just as Exadata moves processing down to the Exadata storage cells (so much so that the project was called “Project Exadoop” internally within Oracle up to the launch) - but also meaning that it's Exadata only, and not available for Oracle Databases running on non-Exadata hardware.

As explained by the launch blog post by Oracle’s Dan McClary https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one  , Oracle Big Data SQL includes components that install on the Hadoop cluster nodes that provide the same “SmartScan” functionality that Exadata uses to reduce network traffic between storage servers and compute servers. In the case of Big Data SQL, this SmartScan functionality retrieves just the columns of data requested in the query (a process referred to as “column projection”), and also only sends back those rows that are requested by the query predicate.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data Analyics using Oracle Advanced Analytics12c and BigDataSQL


Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Oracle Exadata X6: Technical Deep Dive - Architecture and Internals


Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.
Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-Time and Predictive Analytics


The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

Data Reduction in Data Base:

Oracle In-database MapReduce in 12c (big data)
There is some interest from the field about what is In-database map-reduce option and why and how it is different than hadoop solution.
I though I will share my thoughts on it.

 In-database map-reduce is an umbrella term that includes two features.
            "SQL Map-reduce" or  "SQL pattern matching".
             In database container for Hadoop.  to be released in future release.




"SQL MapReduce" : Oracle database 12c introduced a new feature called PATTERN MATCHING using "MATCH_RECOGNIZE" clause in SQL. This is one of the latest ANSI SQL standards proposed and implemented by Oracle. The new sql syntax helps to intuitively solve complex queries that are not easy to implement using 11g analytical functions alone. Some of the use cases are fraud detection, gene sequencing, time series calculation, stock ticker pattern matching . Etc.  I found most of the use case for Hadoop can be done using match_recognize in database on structured data. Since this is just a SQL enhancement , it is there in both Enterprise & Standard Edition database.

Big Data gets Real time with Oracle Fast Data


"In database container for Hadoop  (beta)" : if you have your development team more skilled at Hadoop and not SQL , or want to implement some complex pre-packaged Hadoop algorithms, you could use oracle container for Hadoop (beta). It is a Hadoop prototype APIs  which run within the java virtual machine in the database.

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)


It implements Hadoop Java APIs and interfaces with database using parallel table functions to read data in parallel. One interesting fact about parallel table functions is that it can run in parallel across RAC cluster and also can also route data to a specific parallel processes . This functionality is the key in making Hadoop scale across clusters and  this functionality exited in database for over 15 years now.  Advantage of in-database Hadoop  is:


  • No need to move data out of database for running Mapreduce functions and hence save time and resources.
  •  More  real time data could be used.
  •  Less redundant copies of data and hence better security & less disk space used.
  •  The servers could be used for not just MapReduce work, but also used to run the database making better resource utilization,
  • The output of the MapReduce is immediately available for analytic tools and can combine this functionality along with database features like "in-memory option (beta) to get near real time analysis of Big Data. 
  • Combine db features for security. Backup, auditing, performance with MapReduce. API.
  • The ability to stream the output of one parallel table function as input to the next parallel table function has an advantage of not needing to maintain any intermediate stages.
  • Features like graphical, test, spacial and semantic within oracle database can be used for further analysts.


In addition to this, Oracle 12c will support schema less access using JSON protocol. That will help big data use cases of NOSQL to run on data within Oracle database as well.

Conclusion.
Having these features will help to solve MapReduce challenges when the data is mostly within database and reduce data movement and make better use of available resources..
If Most of your data is outside the DB, then sql Connectors for hadoop and Oracle Loader for Hadoop could be used.

More Information:

https://www.oracle.com/big-data/index.html

https://www.cloudera.com/partners/solutions/oracle.html

http://www.datasciencecentral.com/video/video/listFeatured

https://cloud.oracle.com/bigdata

https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one

http://www.oracle.com/technetwork/articles/bigdata/hadoop-optimal-store-big-data-2188939.html

https://www.oracle.com/engineered-systems/big-data-appliance/index.html

https://www.oracle.com/database/big-data-sql/index.html

https://www.oracle.com/big-data/big-data-discovery/index.html

http://www.dwbisummit.com/?gclid=COvD85_joNICFeQp0wodyvIMsA

http://www.toadworld.com/platforms/oracle/w/wiki/10911.loading-data-from-hdfs-into-oracle-database-12c-with-oracle-loader-for-hadoop-3-0-0-and-cdh4-6

http://www.oracle.com/technetwork/database/database-technologies/bdc/hadoop-loader/overview/index.html

http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-hadoop-oracle-194542.pdf

https://blogs.oracle.com/bigdataconnectors/

http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-hadoop-oracle-194542.pdf

http://www.oracle.com/technetwork/database/bigdata-appliance/overview/bigdatasql-datasheet-2934203.pdf

25 January 2017

IBM Predictive Analytics

About Big Data Analytics

What's New in IBM Predictive Analytics

The 5 V’s of Big Data

Too often in the hype and excitement around Big Data, the conversation gets complicated very quickly. Data scientists and technical experts bandy around terms like Hadoop, Pig, Mahout, and Sqoop, making us wonder if we’re talking about information architecture or a Dr. Seuss book. Business executives who want to leverage the value of Big Data analytics in their organisation can get lost amidst this highly-technical and rapidly-emerging ecosystem.

Overview - IBM Big Data Platform


In an effort to simplify Big Data, many experts have referenced the “3 V’s”: Volume, Velocity, and Variety. In other words, is information being generated at a high volume (e.g. terabytes per day), with a rapid rate of change, encompassing a broad range of sources including both structured and unstructured data? If the answer is yes then it falls into the Big Data category along with sensor data from the “internet of things”, log files, and social media streams. The ability to understand and manage these sources, and then integrate them into the larger Business Intelligence ecosystem can provide previously unknown insights from data and this understanding leads to the “4th V” of Big Data – Value.



There is a vast opportunity offered by Big Data technologies to discover new insights that drive significant business value. Industries are seeing data as a market differentiator and have started reinventing themselves as “data companies”, as they realise that information has become their biggest asset. This trend is prevalent in industries such as telecommunications, internet search firms, marketing firms, etc. who see their data as a key driver for monetisation and growth. Insights such as footfall traffic patterns from mobile devices have been used to assist city planners in designing more efficient traffic flows. Customer sentiment analysis through social media and call logs have given new insights into customer satisfaction. Network performance patterns have been analysed to discover new ways to drive efficiencies. Customer usage patterns based on web click-stream data have driven innovation for new products and services to increase revenue. The list goes on.

IBM predictive analytics with Apache Spark: Coding optional, possibilities endless


Key to success in any Big Data analytics initiative is to first identify the business needs and opportunities, and then select the proper fit-for-purpose platform. With the array of new Big Data technologies emerging at a rapid pace, many technologists are eager to be the first to test the latest Dr. Seuss-termed platform. But each technology has a unique specialisation, and might not be aligned to the business priorities. In fact, some identified use cases from the business might be best suited by existing technologies such as a data warehouse while others require a combination of existing technologies and new Big Data systems.



With this integration of disparate data systems comes the 5th V – Veracity, i.e. the correctness and accuracy of information.



Behind any information management practice lies the core doctrines of Data Quality, Data Governance, and Metadata Management, along with considerations for Privacy and Legal concerns.

Big Data & Analytics Architecture


Big Data needs to be integrated into the entire information landscape, not seen as a stand-alone effort or a stealth project done by a handful of Big Data experts.



Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale.

What’s new in predictive analytics: IBM SPSS and IBM decision optimization


Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions.


  • Advanced analytics enables you to find deeper insights and drive real-time actions.
  • With advanced analytics capabilities, you can understand what happened, what will happen and what should happen.
  • Easily engage both business and technical users to uncover opportunities and address big issues. Operationalize analytics into business processes

Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical platform

Prescriptive analytics

What if you could make strategic decisions based not only on what has occurred or is likely to occur in the future, but through targeted recommendations based on why and how things happen? Prescriptive analytics technology recommends actions based on desired outcomes, taking into account specific scenarios, resources and knowledge of past and current events. This insight can help your organization make better decisions and have greater control of business outcomes.

Prescriptive analytics is the next step on the path to insight-based actions. It creates value through synergy with predictive analytics, which analyzes data to predict a future outcome. Prescriptive analytics takes that insight to the next level by suggesting the optimal way to handle that future situation. Organizations that can act fast in dynamic conditions and make superior decisions in uncertain environments gain a strong competitive advantage.

IBM prescriptive analytics solutions provide organizations in commerce, financial services, healthcare, government and other highly data-intensive industries with a way to analyze data and transform it into recommended actions almost instantaneously. These solutions combine predictive models, deployment options, localized rules, scoring and optimization techniques to form a powerful foundation for decision management. For example, you can:

  • Automate complex decisions and trade-offs to better manage limited resources.
  • Take advantage of a future opportunity or mitigate a future risk.
  • Proactively update recommendations based on changing events.
  • Meet operational goals, increase customer loyalty, prevent threats and fraud, and optimize business processes.

The information management big data and analytics capabilities include :

Data Management & Warehouse: Gain industry-leading database performance across multiple workloads while lowering administration, storage, development and server costs; Realize extreme speed with capabilities optimized for analytics workloads such as deep analytics, and benefit from workload-optimized systems that can be up and running in hours.

Hadoop System: Bring the power of Apache Hadoop to the enterprise with application accelerators, analytics, visualization, development tools, performance and security features.

Stream Computing: Efficiently deliver real-time analytic processing on constantly changing data in motion and enable descriptive and predictive analytics to support real-time decisions. Capture and analyze all data, all the time, just in time. With stream computing, store less, analyze more and make better decisions faster.

Content Management: Enable comprehensive content lifecycle and document management with cost-effective control of existing and new types of content with scale, security and stability.

Information Integration & Governance: Build confidence in big data with the ability to integrate, understand, manage and govern data appropriately across its lifecycle.

From insight to action: Predictive and prescriptive analytics

The 5 game changing big data use cases

While much of the big data activity in the market up to now has been experimenting and learning about big data technologies, IBM has been focused on also helping organizations understand what problems big data can address.

We’ve identified the top 5 high value use cases that can be your first step into big data:

Big Data Exploration
Find, visualize, understand all big data to improve decision making. Big data exploration addresses the challenge that every large organization faces: information is stored in many different systems and silos and people need access to that data to do their day-to-day work and make important decisions.

What is the Big Data Exploration use case?

Big data exploration addresses the challenge faced by every large organization: business information is spread across multiple systems and silos and people need access to that data to meet their job requirements and make important decisions. Big Data Exploration enables you to explore and mine big data to find, visualize, and understand all your data to improve decision making. By creating a unified view of information across all data sources - both inside and outside of your organization - you gain enhanced value and new insights.

Ask yourself:

  • Are you struggling to manage and extract value from the growing volume and variety of data and need to unify information across federated sources?
  • Are you unable to relate “raw” data collected from system logs, sensors, or click streams with customer and line-of-business data managed in your enterprise systems?
  • Do you risk exposing unsecure personal information and/or privileged data due to lack of information awareness?
If you answered yes to any of the above questions, the big data exploration use case is the best starting point for your big data journey.


Introduction to apache spark v3



Enhanced 360º View of the Customer
Extend existing customer views by incorporating additional internal and external information sources. Gain a full understanding of customers—what makes them tick, why they buy, how they prefer to shop, why they switch, what they’ll buy next, and what factors lead them to recommend a company to others.

IBM Watson Analytics Presentation


What is the Enhanced 360º View of the Customer big data use case?

With the onset of the digital revolution, the touch points between an organization and its customers have increased many times over; organizations now require specialized solutions to effectively manage these connections. An enhanced 360-degree view of the customer is a holistic approach that takes into account all available and meaningful information about the customer to drive better engagement, more revenue and longterm loyalty. It combines data exploration, data governance, data access, data integration and analytics in a solution that harnesses the volume, velocity and variety. IBM provides several important capabilities to help you make effective use of big data and improve the customer experience.

Ask yourself:

  • Do you need a deeper understanding of customer sentiment from both internal and external sources?
  • Do you want to increase customer loyalty and satisfaction by understanding what meaningful actions are needed?
  • Are you challenged to get the right information to the right people to provide customers what they need to solve problems, cross-sell, and up-sell?


If you answered yes to any of the above questions, the enhanced 360 view of the customer use case is the best starting point for your big data journey.

With Enhanced 360º View of the Customer, you can:

Improve campaign effectiveness
Accurate, targeted cross-sell / up-sell
Retain your most profitable customers
Deliver superior customer experience at the point of service

Security Intelligence Extension

Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types (e.g. social media, emails, sensors, Telco) and sources of under-leveraged data to significantly improve intelligence, security and law enforcement insight.

What is the Security Intelligence big data use case?

The growing number of high-tech crimes - cyber-based terrorism, espionage, computer intrusions, and major cyber fraud - poses a real threat to every individual and organization. To meet the security challenge, businesses need to augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new data types (e.g. social media, emails, sensors, Telco) and sources of under-leveraged data. Analyzing data in-motion and at rest can help find new associations or uncover patterns and facts to significantly improve intelligence, security and law enforcement insight.

Ask yourself:

  • Do you need to enrich your security or intelligence system with underleveraged or unused data sources (video, audio, smart devices, network, Telco, social media)?
  • Are you able to address the need for sub second detection, identification, resolution of physical or cyber threats?
  • Are you able to follow activities of criminals, terrorists, or persons in a blacklist and detect criminal activity before it occurs?

If you answered yes to any of the above questions, the security intelligence extension use case is the best starting point for your big data journey.
There are three main areas for Security Intelligence Extension>

Enhanced intelligence and surveillance insight. Analyzing data in-motion and at rest can help find new associations or uncover patterns and facts. This type of real or near real-time insight can be invaluable and even life-saving.

Real-time cyber attack prediction & mitigation. So much of our lives are spent online, and the growing number of high-tech crimes, including cyber-based terrorism, espionage, computer intrusions, and major cyber fraud, pose a real threat to potentially everyone. By analyzing network traffic, organizations can discover new threats early and react in real time.

Crime prediction & protection. The ability to analyze internet (e.g. email, VOIP), smart devices (e.g. location, call detail records) and social media data can help law enforcement organizations better detect criminal threats and gather criminal evidence. Instead of waiting for a crime to be committed, they can prevent them from happening in the first place and proactively apprehend criminals.

With Security Intelligence Extension, organizations can:

  • Sift through massive amounts of data - both inside and outside your organization - to uncover hidden relationships, detect patterns, and stamp out security threats
  • Uncover fraud by correlating real-time and historical account activity to uncover abnormal user behavior and suspicious transactions
  • Examine new sources and varieties of data for evidence of criminal activity, such as internet, mobile devices, transactions, email, and social media

Operations Analysis
Analyze a variety of machine and operational data for improved business results. The abundance and growth of machine data, which can include anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across different types of data sets. By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.

What is the Operations Analysis big data use case?

Operations Analysis focuses on analyzing machine data, which can include anything from IT machines to sensors, meters and GPS devices. It’s growing at exponential rates and comes in large volumes and a variety of formats, including in-motion, or streaming data. Leveraging machine data requires complex analysis and correlation across different types of data sets. By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.

Ask yourself:

  • Do you have real-time visibility into your business operations including customer experience and behavior?
  • Are you able to analyze all your machine data and combine it with enterprise data to provide a full view of business operations?
  • Are you proactively monitoring end-to-end infrastructure to avoid problems?


If you answered yes to any of the above questions, the Operations Analysis use case is the best starting point for your big data journey.

Through Operations Analysis, organizations can:

  • Gain real-time visibility into operations, customer experience and behavior
  • Analyze massive volumes of machine data with sub-second latency to identify events of interest as they occur
  • Apply predictive models and rules to identify potential anomalies or opportunities
  • Optimize service levels in real-time by combining operational and enterprise data

Data Warehouse Modernization
Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse. Offload infrequently accessed or aged data from warehouse and application databases using information integration software and tools.

IBM Big Data Analytics Concepts and Use Cases


What is the Data Warehouse Modernization big data use case?

Data Warehouse Modernization (formerly known as Data Warehouse Augmentation) is about building on an existing data warehouse infrastructure, leveraging big data technologies to ‘augment’ its capabilities. There are three key types of Data Warehouse Modernizations:

  • Pre-Processing - using big data capabilities as a “landing zone” before determining what data should be moved to the data warehouse
  • Offloading - moving infrequently accessed data from data warehouses into enterprise-grade Hadoop
  • Exploration - using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the data warehouse for more structured, deep analytics.

Ask yourself:

  • Are you integrating big data and data warehouse capabilities to increase operational efficiency?
  • Have you taken steps to migrate rarely used data to new technologies like Hadoop to optimize storage, maintenance and licensing costs?
  • Are you using stream computing to filter and reduce storage costs? 
  • Are you leveraging structured, unstructured, and streaming data sources required for deep analysis?
  • Do you have a lot of cold, or low-touch data that is driving up costs or slowing performance?

If you answered yes to any of the above questions, the Data Warehouse Modernization use case is the best starting point for your big data journey.

With Data Warehouse Modernization, organizations can:

  • Combine streaming and other unstructured data sources to existing data warehouse investments
  • Optimize data warehouse storage and provide query-able archive
  • Rationalize the data warehouse for greater simplicity and lower cost
  • Provide better query performance to enable complex analytical applications
  • Deliver improved business insights to operations for real-time decision-making

Analytics and Big Data are pointless without good and Accurate Data. That is why IBM Launched the IBM DataFirst   http://www.ibmbigdatahub.com/blog/chief-takeaways-ibm-datafirst-launch-event








More Information:

http://enterprisearchitects.com/the-5v-s-of-big-data/

https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/

http://www.ibm.com/analytics/us/en/technology/biginsights/

https://developer.ibm.com/predictiveanalytics/2015/11/06/spss-algorithms-optimized-for-apache-spark-spark-algorithms-extending-spss-modeler/

https://www.opendatascience.com

https://aws.amazon.com/blogs/big-data/crunching-statistics-at-scale-with-sparkr-on-amazon-emr/

https://developer.ibm.com/predictiveanalytics/category/predictive-analytics/

http://www.ibm.com/analytics/us/en/events/datafirst/

http://www.ibm.com/analytics/us/en/services/datafirst.html

http://www.ibmbigdatahub.com/blog/chief-takeaways-ibm-datafirst-launch-event

https://www.devrelate.com/ibm-developerworks-datafirst/

https://www.devrelate.com/blog/

https://www.youracclaim.com/org/ibm/badge/the-ibm-datafirst-method

https://www.eventbrite.com/e/ibm-datafirst-launch-event-registration-26725283041

https://www.ibm.com/blogs/bluemix/2016/01/applications-and-microservices-with-docker-and-containers/

https://www.ibm.com/blogs/bluemix/2015/12/bluemix-framework-for-microservices-architecture/

https://www.researchgate.net/figure/281404634_fig1_Figure-1-The-five-V's-of-Big-Data-Adapted-from-IBM-big-data-platform-Bringing-big

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

https://bigdatauniversity.com

https://bigdatauniversity.com/courses/

https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

https://www-01.ibm.com/software/data/bigdata/use-cases/security-intelligence.html

23 December 2016

Software Defined Storage and Ceph - What Is all the Fuss About?


Ceph: What It Is

Ceph is open source, software-defined distributed storage maintained by Red Hat since their acquisition of InkTank in April 2014.

The power of Ceph can transform your organization’s IT infrastructure and your ability to manage vast amounts of data. If your organization runs applications with different storage interface needs, Ceph is for you! Ceph’s foundation is the Reliable Autonomic Distributed Object Store (RADOS), which provides your applications with object, block, and file system storage in a single unified storage cluster—making Ceph flexible, highly reliable and easy for you to manage.
Ceph’s RADOS provides you with extraordinary data storage scalability—thousands of client hosts or KVMs accessing petabytes to exabytes of data. Each one of your applications can use the object, block or file system interfaces to the same RADOS cluster simultaneously, which means your Ceph storage system serves as a flexible foundation for all of your data storage needs. You can use Ceph for free, and deploy it on economical commodity hardware. Ceph is a better way to store data.

OBJECT STORAGE
Ceph provides seamless access to objects using native language bindings or radosgw, a REST interface that’s compatible with applications written for S3 and Swift.

OBJECT STORAGE

Ceph’s software libraries provide client applications with direct access to the RADOS object-based storage system, and also provide a foundation for some of Ceph’s advanced features, including RADOS Block Device (RBD), RADOS Gateway, and the Ceph File System.

LIBRADOS
The Ceph librados software libraries enable applications written in C, C++, Java, Python and PHP to access Ceph’s object storage system using native APIs. The librados libraries provide advanced features, including:
  • partial or complete reads and writes
  • snapshots
  • atomic transactions with features like append, truncate and clone range
  • object level key-value mappings

REST GATEWAY
RADOS Gateway provides Amazon S3 and OpenStack Swift compatible interfaces to the RADOS object store.

BLOCK STORAGE
Ceph’s RADOS Block Device (RBD) provides access to block device images that are striped and replicated across the entire storage cluster.

BLOCK STORAGE

Ceph’s object storage system isn’t limited to native binding or RESTful APIs. You can mount Ceph as a thinly provisioned block device! When you write data to Ceph using a block device, Ceph automatically stripes and replicates the data across the cluster. Ceph’s RADOS Block Device (RBD) also integrates with Kernel Virtual Machines (KVMs), bringing Ceph’s virtually unlimited storage to KVMs running on your Ceph clients.

HOW IT WORKS
Ceph RBD interfaces with the same Ceph object storage system that provides the librados interface and the Ceph FS file system, and it stores block device images as objects. Since RBD is built on top of librados, RBD inherits librados capabilites, including read-only snapshots and revert to snapshot. By striping images across the cluster, Ceph improves read access performance for large block device images.

BENEFITS
  • Thinly provisioned
  • Resizable images
  • Image import/export
  • Image copy or rename
  • Read-only snapshots
  • Revert to snapshots
  • Ability to mount with Linux or QEMU KVM clients!

FILE SYSTEM
Ceph provides a POSIX-compliant network file system that aims for high performance, large data storage, and maximum compatibility with legacy applications.

FILE SYSTEM

Ceph’s object storage system offers a significant feature compared to many object storage systems available today: Ceph provides a traditional file system interface with POSIX semantics. Object storage systems are a significant innovation, but they complement rather than replace traditional file systems. As storage requirements grow for legacy applications, organizations can configure their legacy applications to use the Ceph file system too! This means you can run one storage cluster for object, block and file-based data storage.

HOW IT WORKS
Ceph’s file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and file names of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.

BENEFITS
The Ceph file system provides numerous benefits:
  • It provides stronger data safety for mission-critical applications.
  • It provides virtually unlimited storage to file systems.
  • Applications that use file systems can use Ceph FS with POSIX semantics. No integration or customization required!
  • Ceph automatically balances the file system to deliver maximum performance.




Red hat ceph storage customer presentation



It’s capable of block, object, and file storage, though only block and object are currently deployed in production.  It is scale-out, meaning multiple Ceph storage nodes (servers) cooperate to present a single storage system that easily handles many petabytes (1PB = 1,000 TB = 1,000,000 GB) and increase both performance and capacity at the same time. Ceph has many basic enterprise storage features including replication (or erasure coding), snapshots, thin provisioning, tiering (ability to shift data between flash and hard drives), and self-healing capabilities.


Why Ceph is HOT

In many ways Ceph is a unique animal—it’s the only storage solution that deliver four  critical  capabilities:
  • open-source
  • software-defined
  • enterprise-class
  • unified storage (object, block, file).
Many other storage products are open source or scale out or software-defined or unified or have enterprise features, and some let you pick 2 out of 3, but almost nothing else offers all four together.

Red Hat Ceph Storage: Past, Present and Future



Open source means lower cost
Software-defined means deployment flexibility, faster hardware upgrades, and lower cost
Scale-out means it’s less expensive to build large systems and easier to manage them
Block + Object means more flexibility (most other storage products are block only, file only, object only, or file+block; block+object is very rare)
Enterprise features mean a reasonable amount of efficiency and data protection

Quick and Easy Deployment of a Ceph Storage Cluster with SLES 


Ceph includes many basic enterprise storage features including: replication (or erasure coding), snapshots, thin provisioning, auto-tiering (ability to shift data between flash and hard drives), self-healing capabilities

Red Hat Storage Day New York - What's New in Red Hat Ceph Storage



Despite all that Ceph has to offer there are still two camps: those that love it and those that dismiss it.

I Love Ceph!
The nature of Ceph means some of the storage world loves it, or at least has very high hopes for it. Generally server vendors love Ceph because it lets them sell servers as enterprise storage, without needing to develop and maintain complex storage software. The drive makers (of both spinners and SSDs) want to love Ceph because it turns their drive components into a storage system. It also lowers the cost of the software and controller components of storage, leaving more money to spend on drives and flash.

Ceph, Meh!
On the other hand, many established storage hardware and software vendors hope Ceph will fade into obscurity. Vendors who already developed richly featured software don’t like it because it’s cheaper competition and applies downward price pressure on their software. Those who sell tightly coupled storage hardware and software fear it because they can’t revise their hardware as quickly or sell it as cheaply as the commodity server vendors used by most Ceph customers.

Battle of the Titans – ScaleIO vs. Ceph at OpenStack Summit Tokyo 2015 (Full Video)



To be honest, Ceph isn’t perfect for everyone. It’s not the most efficient at using flash or CPU (but it’s getting better), the file storage feature isn’t fully mature yet, and it is missing key efficiency features like deduplication and compression. And some customers just aren’t comfortable with open-source or software-defined storage of any kind. But every release of Ceph adds new features and improved performance, while system integrators build turnkey Ceph appliances that make it easy to deploy and come with integrated hardware and software support.
What’s Next for Ceph?

EMC- Battle of the Titans: Real-time Demonstration of Ceph vs. ScaleIO Performance for Block Storage


Ceph continues to evolve, backed by both Red Hat (which acquired Inktank in 2014) and by a community of users and vendors who want  to see it succeed.  In every release it gets faster, gains new features, and becomes easier to manage.

The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat



Ceph is basically a fault-tolerant distributed clustered filesystem. If it works, that’s like a nirvana for shared storage: you have many servers, each one pitches in a few disks, and the there’s a filesystem that sits on top that visible to all servers in the cluster. If a disk fails, that’s okay too.

Those are really cool features, but it turns out that Ceph is really more than just that. To borrow a phrase, Ceph is like an onion – it’s got layers. The filesystem on top is nifty, but the coolest bits are below the surface.
If Ceph proves to be solid enough for use, we’ll need to train our sysadmins all about Ceph. That means pretty diagrams and explanations, which we thought would be more fun to share you.

Building exascale active archives with Red Hat Ceph Storage



Diagram
This is the logical diagram that we came up with while learning about Ceph. It might help to keep it open in another window as you read a description of the components and services.



Ceph components
We’ll start at the bottom of the stack and work our way up.

OSDs
OSD stands for Object Storage Device, and roughly corresponds to a physical disk. An OSD is actually a directory (eg.
/var/lib/ceph/osd-1
) that Ceph makes use of, residing on a regular filesystem, though it should be assumed to be opaque for the purposes of using it with Ceph.

Use of XFS or btrfs is recommended when creating OSDs, owing to their good performance, featureset (support for XATTRs larger than 4KiB) and data integrity.

We’re using btrfs for our testing.

Using RAIDed OSDs
A feature of Ceph is that it can tolerate the loss of OSDs. This means we can theoretically achieve fantastic utilisation of storage devices by obviating the need for RAID on every single device.

However, we’ve not yet determined whether this is awesome. At this stage we’re not using RAID, and just letting Ceph take care of block replication.


Placement Groups
Also referred to as PGs, the official docs note that placement groups help ensure performance and scalability, as tracking metadata for each individual object would be too costly.

A PG collects objects from the next layer up and manages them as a collection. It represents a mostly-static mapping to one or more underlying OSDs. Replication is done at the PG layer: the degree of replication (number of copies) is asserted higher, up at the Pool level, and all PGs in a pool will replicate stored objects into multiple OSDs.

As an example in a system with 3-way replication:


  • PG-1 might map to OSDs 1, 37 and 99
  • PG-2 might map to OSDs 4, 22 and 41
  • PG-3 might map to OSDs 18, 26 and 55
  • Etc.


Any object that happens to be stored on PG-1 will be written to all three OSDs (1,37,99). Any object stored in PG-2 will be written to its three OSDs (4,22,41). And so on.

Pools
A pool is the layer at which most user-interaction takes place. This is the important stuff like GET, PUT, DELETE actions for objects in a pool.

Pools contain a number of PGs, not shared with other pools (if you have multiple pools). The number of PGs in a pool is defined when the pool is first created, and can’t be changed later. You can think of PGs as providing a hash mapping for objects into OSDs, to ensure that the OSDs are filled evenly when adding objects to the pool.

The Future of Cloud Software Defined: Andrew Hatfield, Red Hat


CRUSH maps
CRUSH mappings are specified on a per-pool basis, and serve to skew the distribution of objects into OSDs according to administrator-defined policy. This is important for ensuring that replicas don’t end up on the same disk/host/rack/etc, which would break the entire point of having replicant copies.

A CRUSH map is written by hand, then compiled and passed to the cluster.

Focus on: Red Hat Storage big data


Still confused?
This may not make much sense at the moment, and that’s completely understandable. Someone on the Ceph mailing list provided a brief summary of the components which we found helpful for clarifying things:


Ceph services
Now we’re into the good stuff. Pools full of objects are well and good, but what do you do with it now?

RADOS
What the lower layers ultimately provide is a RADOS cluster: Reliable Autonomic Distributed Object Store. At a practical level this translates to storing opaque blobs of data (objects) in high performance shared storage.

Because RADOS is fairly generic, it’s ideal for building more complex systems on top. One of these is RBD.

Decoupling Storage from Compute in Apache Hadoop with Ceph



RBD
As the name suggests, a RADOS Block Device (RBD) is a block device stored in RADOS. RBD offers useful features on top of raw RADOS objects. From the official docs:

  • RBDs are striped over multiple PGs for performance
  • RBDs are resizable
  • Thin provisioning means on-disk space isn’t used until actually required

RBD also takes advantage of RADOS capabilities such as snapshotting and cloning, which would be very handy for applications like virtual machine disks.

Red Hat Storage Day Boston - Why Software-defined Storage Matters



CephFS
CephFS is a POSIX-compliant clustered filesystem implemented on top of RADOS. This is very elegant because the lower layer features of the stack provide really awesome filesystem features (such as snapshotting), while the CephFS layer just needs to translate that into a usable filesystem.

CephFS isn’t considered ready for prime-time just yet, but RADOS and RBD are.

Kraken Ceph Dashboard



More Information:

http://slides.com/karansingh-1/deck

https://www.redhat.com/en/technologies/storage/ceph

https://scalableinformatics.com/unison

http://storagefoundry.net/collections/nautilus/ceph

http://www.fujitsu.com/global/products/computing/storage/eternus-cd/

http://www.mellanox.com/page/ethernet_switch_overview

http://www.mellanox.com/page/products_overview

http://www.mellanox.com/page/infiniband_cards_overview

http://ceph.com/category/webinars/

http://www.virtualtothecore.com/en/adventures-ceph-storage-part-1-introduction/

http://ceph.com/community/blog/

http://docs.ceph.com/docs/master/architecture/

http://karan-mj.blogspot.nl/2014/01/how-data-is-stored-in-ceph-cluster.html

https://www.redhat.com/en/about/press-releases/red-hat-unveils-red-hat-ceph-storage-2-enhanced-object-storage-capabilities-improved-ease-use

http://www.anchor.com.au/

http://www.anchor.com.au/blog/2012/09/a-crash-course-in-ceph/

https://www.hastexo.com/blogs/florian/2012/03/08/ceph-tickling-my-geek-genes

https://github.com/cholcombe973/ceph-dash-charm

Apache: Big Data North America 2016   https://www.youtube.com/watch?v=hTfIAWhd3qI&list=PLGeM09tlguZQ3ouijqG4r1YIIZYxCKsLp


DISTRIBUTED STORAGE PERFORMANCE FOR OPENSTACK CLOUDS: RED HAT STORAGE SERVER VS. CEPH STORAGE   http://docplayer.net/2905788-Distributed-storage-performance-for-openstack-clouds-red-hat-storage-server-vs-ceph-storage.html


Red Hat Announces Ceph Storage 2  http://www.storagereview.com/red_hat_announces_ceph_storage_2


Red Hat Ceph Storage
https://access.redhat.com/products/red-hat-ceph-storage