você não especificou nenhum destinatário

Você tem certeza de que deseja enviar uma mensagem sem assunto?

Paper: Recomendação para o GlobosatPlay

leave a comment »

Esse artigo descreve como foi feita a recomendação personalizada do GlobosatPlay usando Collaborative Filtering por Fatores Latentes, inspirado pelos trabalhos publicados sobre o Netflix Prize. A solução utiliza Mahout, Hadoop e Kafka para processar a recomendação para milhões de usuários ativos, milhares de itens variados em fluxo e histórico esparso de interações. Os resultados mostraram um ganho de 50% na conversão sobre Collaborative Filtering User-User e a expectativa é usar esse algoritmo em outros produtos da Globo.com.

Relatório INF2979 (PDF)

Written by Ciro Cavani

2015-01-2 at 2:38 pm

Singularity @ Globo.com HackDay 2014-12-02

leave a comment »

Written by Ciro Cavani

2014-12-3 at 12:21 am

Plataforma de BigData da Globo.com (Sistema de Recomendação) @ Rio BigData Meetup, Out2014

leave a comment »

Written by Ciro Cavani

2014-12-2 at 2:17 pm

Rio Big Data Meetup Out/2014

leave a comment »

Minha palestra:

A proposta dessa palestra é fundamentar alguns conceitos de BigData e explorar a dinâmica de como tratar um grande volume de dados para extrair valor. A ideia é apresentar a solução de dados na Plataforma de BigData da Globo.com usada pelo Sistema de Recomendação e comentar a experiência do seu desenvolvimento.


Written by Ciro Cavani

2014-09-16 at 7:47 pm

Recommender Systems (Machine Learning Summer School 2014 @ CMU)

leave a comment »

Machine Learning is a foundational discipline that forms the basis of much modern data analysis. It combines theory from areas as diverse as Statistics, Mathematics, Engineering, and Information Technology with many practical and relevant real life applications. The focus of the current summer school is big data analytics, distributed inference, scalable algorithms, and applications to the digital economy. The event is targeted at research students, IT professionals, and academics from all over the world.



Recommender Systems (Machine Learning Summer School 2014 @ CMU) from Xavier Amatriain


Written by Ciro Cavani

2014-08-6 at 11:46 pm

Sean Owen: Design Patterns for Large-Scale Real-Time Learning (video)

leave a comment »

Sean Owen provides examples of operational analytics projects in the field, presenting a reference architecture and algorithm design choices for a successful implementation based on his experience with customers and Oryx/Cloudera.

Sean Owen on Apr 15, 2014
43 min

Sean Owen is Director of Data Science at Cloudera, based in London. Before Cloudera, he founded Myrrix Ltd, a company commercializing large-scale real-time recommender systems on Apache Hadoop. He has been a primary committer and VP for Apache Mahout, and co-author of Mahout in Action. Previously, Sean was a senior engineer at Google.

Written by Ciro Cavani

2014-04-18 at 1:09 am

Apache Spark – Referências

leave a comment »

Apache Spark é um framework para processamento distribuído de grande quantidade de dados com foco em análise interativa.

O projeto começou em Berkeley como uma forma alternativa ao MR de processar dados em cluster usando memória das máquinas como ‘armazenamento’ principal e vem evoluindo para um ‘framework’ completo para analise de dados. Já fazem parte do projeto o Shark (SQL-like, compatível com Hive), MLlib (implementação de algoritmos de Machine Learning, no lugar do Mahout) e o Spark Streaming (processamento de stream, no lugar do Storm). Recentemente foi integrado o GraphX (processamento de grafo, no lugar do Giraph).

Um cluster Hadoop, formado pelo HDFS (NameNode, DataNode) e pelo YARN (ResourceManager, NodeManager e ApplicationMaster), pode ser usado como infrastrutura para o Spark para construir aplicações que rodam como ApplicationMaster e usam os recursos distribuídos do cluster (criando ‘workers’ que são executados em container).

O Yahoo já usa essa plataforma em produção e mais gente está indo nesse caminho.



Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Apache Spark: The Next Big Data Thing?

I played around with the Scala API (Spark is written in Scala), and to be honest, at first I was pretty underwhelmed, because Spark looked, well, so small. The basic abstraction are Resilient Distributed Datasets (RDDs), basically distributed immutable collections, which can be defined based on local files or files stored in on Hadoop via HDFS, and which provide the usual Scala-style collection operations like map, foreach and so on.

How companies are using Spark

Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be where Spark is today.

Apache Spark: Distributed Machine Learning using MLbase
46 min, Twitter on August 6th 2013.

Ameet Talwalker and Evan Sparks present their work on the MLbase project which will be a distributed Machine Learning platform on top of Apache Spark.


Spark Summit 2013 (vídeos)

It featured production users of Spark, Shark, Spark Streaming and related projects. Speakers came from organizations including Yahoo, Adobe, Intel, Amazon, Red Hat, Databricks, and more.

Spark Summit 2013 – Integration into the Yahoo! Data and Analytics Platform – Tim Tully
22 min, December 12 2013

Spark Summit 2013 – Spark in the Hadoop Ecosystem – Eric Baldeschwieler
8 min (tem mais 5 min sobre o Summit), December 12 2013

Spark Summit 2013 – The State of Spark, and Where We’re Going Next – Matei Zaharia
32 min, December 12 2013

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
(14 páginas)

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
(16 páginas)

Many “big data” applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databases—parallel recovery of lost state—and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.

Mais sobre o Hadoop:


Written by Ciro Cavani

2014-01-20 at 3:07 pm