big data hadoop

Apache Mahout
Apache Whirr
Google Map-Reduce
Google File System
Google Big Table

====================================================================================================================
=============HadoopEcosystem============================================================
====================================================================================================================

Become familiar with the : these projects can help you store/access your data, and they scale.
========================================================================================================================
Cloudera:
Cloudera Inc. is a Palo Alto-based enterprise software company which provides Apache Hadoop-based software and services. It contributes to Hadoop and related Apache projects and provides a distribution for Hadoop for the enterprise.[1] Cloudera has two products: Cloudera's Distribution including Apache Hadoop (CDH) and Cloudera Enterprise. CDH is a data management platform which incorporates HDFS, Hadoop MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, ZooKeeper and Hue and is available free under an Apache license. Cloudera Enterprise is a package which includes Cloudera's Distribution including Apache Hadoop, production support and tools designed to make it easier to run Hadoop in a production environment. Cloudera offers services including support, consulting services and training (both public and private).
========================================================================================================================
HBase:
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original BigTable paper.[1] Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs. HBase is not a direct replacement for a classic SQL Database, although recently its performance has improved, and it is now serving several data-driven websites, including Facebook's Messaging Platform.
========================================================================================================================
Zookeeper:
Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service,[clarification needed] synchronization service,[clarification needed] and naming registry[clarification needed] for large distributed systems. ZooKeeper was a sub project of Hadoop but is now a top-level project in its own right.
ZooKeeper's architecture supports high-availability through redundant services. The clients can thus ask another ZooKeeper master if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a trie datastructure. Clients can read and write from/to the nodes and in this way have a shared configuration service. Updates are totally ordered.
========================================================================================================================
Hive:
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
========================================================================================================================
mahout:
Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms
========================================================================================================================
Apache Solr: Solr es un motor de búsqueda de código abierto basado en la biblioteca Java del proyecto Lucene, con APIs en XML/HTTP y JSON, resaltado de resultados, búsqueda facetada, caché, y una interfaz para su administración.Corre sobre un contenedor de servlets Java como Apache Tomcat.
========================================================================================================================
Pig:
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin[1]. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python or JavaScript [2] and then call directly from the language.
========================================================================================================================
Sqoop:
Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to connect to a database.
========================================================================================================================
Flume:
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analyticapplication
========================================================================================================================
Oozie:
Oozie is a workflow/coordination system to manage Apache Hadoop(TM) jobs.Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (Java map-reduce, Streaming map-reduce, Pig, Distcp, etc.)Oozie is a scalable, reliable and extensible system.
========================================================================================================================

HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single datanode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The filesystem uses the TCP/IP layer for communication; clients use RPC to communicate between each other. HDFS stores large files (an ideal file size is a multiple of 64 MB[9]), across multiple machines.

==================================================================================================================================
===========================================others===============================================================
==================================================================================================================================

- Replication
- Horizontal data partitioning
- vertical data partitioning
- Hybrid Storage
- Shared nothing
- no single point of failure
- Auto-sharding for Write-scalability
- automatically shards (partitions) tables across nodes, enabling databases to scale horizontally on low cost, commodity hardware while maintaining complete application transparency.
With its distributed, shared-nothing architecture, MySQL Cluster has been designed to deliver 99.999% availability ensuring resilience to failures and the ability to perform scheduled maintenance without downtime.
- SQL & NoSQL APIs
- real-time response time and throughput meet the needs of the most demanding web, telecommunications and enterprise applications.
- Geographic replication enables multiple clusters to be distributed geographically for disaster recovery and the scalability of global web services.


==================================================================================================================================
=========================================================HOOVER ARCH==============================================================
==================================================================================================================================
1. Hoover logs        (COLLECT)
2. Parse hoover logs     (SUMARIZE)
3. Query to hoover logs (QUERY)
TERABTES

MORE DEEP:
1. TRACKING DATA
2. MINING DATA
3. QUERYING DATA

ONE MORE FACTOR:
Conversion of the old db to the new arch.

0 pensamientos:

Post a Comment

feedback!