BI Blogs

Bringing together Business Intelligence voices from across the web

Second Day of the Miracle DB Forum, Edinburgh

Posted on the May 31st, 2007. Read 249 times

Source: Mark Rittman's Oracle Weblog [link]

Today was Day Two of the Miracle DBF up at Edinburgh Castle, and for those of us that got a reasonably early night it was a Nine O’Clock start with Mogens Norgaard on “You Probably Need RAC”. Following on from Mogens was Dan Fink with “No Mogens…” and James Morle with “Brewing Benchmarks”.

After lunch was a talk on tuning multi-tier J2EE applications, plus Jonathan Lewis on Understanding Statspack reports. I was particulary keen to hear Jonathan’s talk on Statspack - the general orthodox view nowadays seems to be that Statspack is worthless due to the fact that all the data is averaged and aggregated across all sessions, whereas Jonathan gave some good examples of where it can be useful - principally, because it lets you look back to a previous period of time, whereas SQL traces have to be specifically enabled and put in place in the hope that the issue will happen again in the future. Interestingly, Graham Wood was in the audience and also gave a talk (”You Probably Don’t Need SQL Trace”) that looked at ASH and showed how it got around this issue, gave the same aggregated results as Statspack but also lets you drill down by session, application, time period and so on. Overall, a very good set of presentations, thanks to Thomas for arranging everything.

I had a brief chance to look around Edinburgh and specifically the Royal Mile, and yet again spent the early evening in the Jolly Judge just down the road.

Like all these sorts of events, the presentations themselves have only been part of the attraction; I also got the chance to spend a bit of time with old friends such as Doug Burns, Lisa Dobson and Alex Gorbachev, plus I got the time to chat with people such as Dan Fink, Julian Dyke, Simon Kelsey and of course Thomas Presslie about everything Oracle and how they’re getting on with the projects they’re working on. Tomorrow, I’ve got a day’s leave and I’m going up north with Thomas and few others on the “Whisky Adventure”, then it’s back to work to get on with preparation for the next round of the seminars and to catch up with some of our customers. Overall, a great few days, as usual learnt a lot, and a good chance to catch up with others in the Oracle database world.

Helpful Resources for Scaling Web Sites and Applications

Posted on the May 31st, 2007. Read 264 times

Source: Clickstream [link]

There’s a good collection of presentations about high-end scaling of web applications at Peter Van Dijck’s blog. Most are hosted on slideshare, but there are some mp3s and PDFs as well. It’s hard to find quality information on scaling applications of any sort. I used to deal with scalability a lot in the early days of the web. Technologies have changed quite a bit since then so I focus on scalability mostly in the data warehouse and business intelligence world now. I always pay attention when someone goes through the trouble of collecting a set of good quality material like this. I’ve only seen two of the presentations he’s got listed, and never read the O’Reilly book he mentioned, which means more summer reading in the works.

The “Force-Field” of Data Governance

Posted on the May 31st, 2007. Read 223 times

Source: Data Doghouse - performance management, business intelligence, and data warehousing [link]

Data governance has a big job: establishing the processes, policies, standards, organization, and technologies required to manage and ensure the availability, accessibility, quality, consistency, auditability and security of data in an organization.

In many instances it is seen as too overwhelming or daunting to undertake, like Don Quixote chasing windmills. Too often people are immobilized by analysis paralysis and fail to move forward after initial meetings and excitement. But don’t stop before you get started. Data governance is like the force-field around your data, protecting that “single version of the truth.”

OK, so maybe “force-field” isn’t be best analogy. Perhaps data governance would be better explained with a story. After all, there’s nothing “sci-fi” about reality or customer success.

The story is about Nationwide Insurance — I recently learned about the experience they had with data governance and their Enterprise Data Management (EDM). They won a TDWI 2006 Best Practices Award
in the master data management category and an InfoWorld 2006 Project of the Year Award.

>>>continue to the rest of this blog post, The "Force-Field" of Data Governance

Aggregation for Data Warehouse, Part 4

Posted on the May 30th, 2007. Read 235 times

Source: Dylan's BI Study Notes [link]

I have briefly mentioned the various approaches of providing aggregation for data warehouse. This post describes the third approach of using the Materialized View (Oracle) or Automated Summary Table (AST, DB2) from the database system.  I will describe the Pro’s and Con’s of using this approach.

A materialized view, like a table or a view, is a database object. A materialized view can contain the results of a query. It enables you to have a very efficient data access to the pre-aggregated data. It also helps you to maintain such aggregated data by simply defining the query, like creating a database view. (more…)

Real Estate Data Visualization

Posted on the May 30th, 2007. Read 264 times

Source: Clickstream [link]

Trulia, a real estate search service, has an interactive map of US homes animated based on when the homes were built. It’s fun to play with. The search service (the first link) has a well-designed navigation and display interface, better than others I’ve used. Worth looking at to see how information displays can be made interactive.

Drop views from a schema

Posted on the May 30th, 2007. Read 226 times

Source: SQL BI [link]

Today I wrote a procedure to drop all views from a schema (and the schema itself) in SQL Server 2005. I use views to expose objects to SSAS DSV and it’s useful having a cleanup code. Here’s the script: DropSchema drop all the views and the schema itself, DropSchemaViews is called by DropSchema and drop all the views within the specified schema. No error checking code, use it at your own risk!

1:  IF EXISTS(SELECT * FROM sys.procedures WHERE NAME = ‘DropSchema’ AND SCHEMA_ID = 1) DROP PROCEDURE dbo.DropSchema2:  GO3:  IF EXISTS(SELECT * FROM sys.procedures WHERE NAME = ‘DropSchemaViews’ AND SCHEMA_ID = 1) DROP PROCEDURE dbo.DropSchemaViews4:  GO

5:  CREATE PROCEDURE dbo.DropSchemaViews( @schema NVARCHAR(128) )6:  AS BEGIN

7:      DECLARE @view NVARCHAR(256)8:      DECLARE @cmd NVARCHAR(500) 

9:      DECLARE v CURSOR FOR10:          SELECT name 11:          FROM sys.VIEWS12:          WHERE SCHEMA_ID = 13:                    (SELECT schema_id 14:                     FROM sys.schemas 15:                     WHERE NAME = @schema)16:      OPEN v17:      FETCH NEXT FROM v INTO @view

18:      WHILE @@fetch_status = 0 BEGIN19:          SET @cmd = N‘DROP VIEW [’ + @schema + N‘].[’ + @view + N‘]’20:          EXEC ( @cmd )21:          FETCH NEXT FROM v into @view22:      END23:      CLOSE v24:      DEALLOCATE v

25:  END26:  GO

27:  CREATE PROCEDURE dbo.DropSchema( @schema NVARCHAR(128) )28:  AS BEGIN

29:      DECLARE @view NVARCHAR(256)30:      DECLARE @cmd NVARCHAR(500) 

31:      EXEC DropSchemaViews @schema 

32:      IF EXISTS(SELECT schema_id 33:                FROM sys.schemas 34:                WHERE NAME = @schema)35:      BEGIN36:          SET @cmd = N‘DROP SCHEMA [’ + @schema + N‘]’37:          EXEC ( @cmd )38:      END39:      40:  END41:  GO

 


Cross-posted from SQLBlog! - http://www.sqlblog.com

Posting from the Miracle DBF, Edinburgh

Posted on the May 30th, 2007. Read 574 times

Source: Mark Rittman's Oracle Weblog [link]

I’m currently sitting in a pub just down from Edinburgh Castle, in a break between the morning and afternoon sessions. Just across the table from me is Doug Burns, and behind is Thomas Presslie and Carel-Jan Engel, who are working on the Miracle musical for tonight’s gala dinner. We’ve slipped out for a cheeky one before the afternoon’s presentations, and to update the blogs on yesterday’s events.

The event started off with a reception in Edinburgh Castle and a tour around the Whisky Heritage Centre - and there lies my dilemma for the rest of the week: it’s a whisky-centered event, I’m going on the extra Whisky Adventure after the main event, up to the Minmore Hotel and a number of distilleries north of Edinburgh, and I can’t stand whisky - I’m a bit like Doug Burns and his fish fingers, beans and sausages - I tend to go for exotic lagers, bitters and so on, not spirits and especially smokey old whisky. For this week though, my objective is to try and “get in” to whisky, understand the difference between blends and single malts, and as a gentle introduction to all of this Thomas has just bought me a Whisky Mac, which is sitting beside me as I type.

Anyway, the evening started off with an introduction by Thomas….

… a keynote from Jonathan, about Ironing and Oracle….

and another keynote from Graham Wood, comparing Oracle features (such as row-level locking and writers not blocking readers, as Mogens is demonstrating below) to Whisky.

After the keynotes, there was a Whisky-tasting session, a sponsor talk from IBM at 10.45pm with distinct sniggering from Doug and Lisa on the second row, and then a free bar on from 11.15pm. In the end, Peter Scott and I left just afterwards - apart from not being a Whisky drinker, I’m a total lightweight as well - and got back to the Travelodge about 11.30pm, to turn in for the night.

Today, things started off with Doug Burns on DTrace, so far the best talk of the conference….

For now though, I’m in the official pub just down from the conference venue, updating the blog and planning to get back for Kurt’s talk on DUDE and data recovery. For now, that’s it, but hopefully I’ll get a chance to update the blog in a break tomorrow.

¿Se ha lanzado HP al mercado Business Intelligence?

Posted on the May 30th, 2007. Read 376 times

Source: Todo BI: Business Intelligence, Data Warehouse, CRM y mucho mas... [link]

HP apuesta por el Business Intelligence

Que las grandes y ‘tradicionales’ empresas del Hardware: IBM, HP, Sun… se estaban moviendo cada vez mas hacia el mundo de los servicios y de las aplicaciones de gestión empresarial venía siendo un hecho. Pero una de las novedades es la apuesta directa por áreas concretas, como es el caso de HP con el Business Intelligence. Vayamos a los hechos:

- HP ha adquirido knightsbridge, para mi una de las mejores empresas de consultoría especializadas en Business Intelligence y DataWarehouse, con mas de 700 empleados especializados en BI y DW y auténticos generadores de estudios y análisis muy valiosos. Se trata, sin duda alguna, de un musculo humano muy importante para sacar adelante proyectos e implementaciones.

- Como knigthsbridge estan posicionados basicamente en EEUU. Para Europa se ha adquirido TTP, empresa con base en Italia, y que cuenta con mas de 300 especialistas en este campo.

- Han incorporado a personal de peso en el mundo del Business Intelligence, autentica materia gris: Dan Holley de Teradata, Ali El Khardobi (el diseñador del ‘entredicho tras la compra de Sunopsis’ Oracle Warehouse Builder y Goetz Graf , experto en Teradata y Neoview.

- Han lanzado Neoview (probablemente lo mas importante). Se trata de una solución integrada de software y hardware, muy en la linea de Teradata y algunos appliances (parece que quieren cubrir ese hueco), destinado a DataWarehouses empresariales que puede soportar cientos de Terabytes.
Parece que una tendencia de futuro: una solución totalmente integrada, rápida de montar, sencilla (dentro de lo que pueda ser montar un DW), que es fácil de administrar y que es compatible con otros sistemas Business Intelligence.
HP propone con NeoView una aplicación DW que puede ser más amplia y menos costosa que la de Teradata, gracias a la experiencia de HP como desarrollador y fabricante de un amplio abanico de servidores hardware y a su gigantesca rama de servicios.

- El CEO de HP viene del BI: La estrategia de HP en este campo está potenciada por la trayectoria de su CEO, Mark Hurd, y su CIO, Randy Mott, en IBM, Oracle y Teradata.

- Estan dedicando muchos recursos y esfuerzo al I+D en Business Intelligence, incluyendo especialistas en estas tecnologías, incluyendo capacidades semánticas, mejoras en tipos de datos, consultas paralelas, potenciando los parnters, etc…

- Un ejemplo de ello es el acuerdo con Microstrategy: MicroStrategy 8 ha sido certificada para poder combinarse con HP Neoview.

Lo dicho: podemos afirmar sin lugar a equivocarnos que HP ha aterrizado en el mundo del Business Intelligence y DataWarehouse, y viene para quedarse!!

Fuente:
Is HP Serious about Business Intelligence?

Saber mas sobre HP Neoview:
HP Neoview Enterprise Data Warehouse

When Right-Time Isn’t Real-Time Enough

Posted on the May 30th, 2007. Read 249 times

Source: BI this week [link]

The bottom line: companies need to figure out a “right time” window that’s right for them.

As Right-Time Approaches Real-Time, Data Integration Is Key

Posted on the May 30th, 2007. Read 254 times

Source: BI this week [link]

At bottom, real-time is a data integration problem: it involves getting fresh data to business users as rapidly as possible.

LOLTrek: This Meme Has Jumped the Shark

Posted on the May 29th, 2007. Read 275 times

Source: Clickstream [link]

If you don’t know what this means, you probably need to stay in more. In this case, go find the “I can has cheezburger?” image, then visit a tribute to tribbles, LOLTrek. If don’t know what “jump the shark” means then you’re definitely having too much life and not enough computer and need to spend some time on Wikipedia.

Even funnier is Google correcting “I can has cheezburger” to “I can has cheezeborger” when I tried to track down one of the original images:

To finish off my news reading for today, “I’m in ur cellz killin ur mitochondrias” (Fanta screws with your mitochondria?) Good thing I don’t drink those fizzy drinks.

Story About the First Computer

Posted on the May 29th, 2007. Read 274 times

Source: Clickstream [link]

The New Yorker has a story about the Antikythera Mechanism, possibly the first mechanical computer, built by hand in ancient Greece. I liked the conclusion best, that the device embodied the world view of an era, in much the way that our current world view is embodied in our own technology.

“One day in the spring of 1900, a party of Greek sponge divers returning from North Africa was forced by a storm to take shelter in the lee of the small island of Antikythera, which lies between Crete and Kythera. After the storm passed, one of the divers, Elias Stadiatis, put on a weighted suit and an airtight helmet that was connected by an air hose to a compressor on the boat, and went looking for giant clams, with which to make a feast that evening.”
continue reading story

US Terrorism Data: Worst Practices for Data Quality and Governance

Posted on the May 29th, 2007. Read 322 times

Source: Clickstream [link]

The Department of Homeland Security is a perfect counter-example of how to deal with data. A way to learn good practices is to do the opposite of what the government does. Here’s a list of worst practices to not follow from the DHS.

Worst practice #1: load all the data you can lay your hands on. Terror Database Has Quadrupled In Four Years discusses the “Terrorist Identities Datamart Environment” or TIDE, the source of data for airline, police, border and consulate watch lists. Their policy is to shovel everything possible in there, needed or not, because it might be useful some day. This runs counter to years of data warehouse practices.

“Each day, waves of new information are fed into terrorism-suspect databases… Ballooning from fewer than 100,000 files in 2003 to about 435,000, the growing database threatens to overwhelm the people who manage it.”

Aside from this problem, do we really believe there are 435,000 suspected terrorists in the US? That’s almost four times larger than our military in Iraq. What are they waiting for? They could take over the US now. Obviously something is wrong with all that data.

Worst practice #2: don’t do anything about data quality. “The single biggest worry that I have is long-term quality control,” said Russ Travers, in charge of TIDE at the National Counterterrorism Center in McLean.” Yet he’s doing nothing to QA the data before dumping it into the system. To make things worse, there’s no way to remove or correct data once it’s loaded.

“The bar for inclusion is low, and once someone is on the list, it is virtually impossible to get off it.”

“In 2004 and 2005, mis-identifications accounted for about half of the tens of thousands of times a traveler’s name triggered a watch-list hit.”

“Sen. Ted Stevens (R-Alaska) said last year that his wife had been delayed repeatedly while airlines queried whether Catherine Stevens was the watch-listed Cat Stevens. The listing referred to the Britain-based pop singer who converted to Islam and changed his name to Yusuf Islam. The reason Islam is not allowed to fly to the United States is secret.”

TSA can’t tell the difference between a 70’s musician and a senator’s wife? That’s a serious data quality problem. Given the 50% miss rate, it would be better for the police to toss a coin each time they arrest someone instead of consulting the watch lists.

The data management practices go one step further. They actually have a process to include names on the list that are no longer valid. Their idea is that if someone is dead (for example), then a terrorist might use that name since it won’t be on the list. Perhaps all the dead people account for the growth of the database. Why not go one step further and put the names of everyone who died this year onto the list?

Worst practice #3: ignore the problems of your users.

“TIDE is a vacuum cleaner for both proven and unproven information, and its managers disclaim responsibility for how other agencies use the data. “What’s the alternative?” Travers said.”

Multiple agencies are complaining about wasted man-hours due to mis-identification. Airlines are routinely stopping people in airports, like Ted Steven’s wife, even though the data is obviously wrong. God forbid that we should have any data management processes in place. Travers should probably get out to a data warehousing conference once in a while. Particularly since he said earlier that data quality is a problem. His current alternative of doing nothing isn’t feasible.

But the problems don’t stop with TIDE…

Worst practice #4: if the users aren’t sold on the concept, build it and they will come.
Remember
TIA ? Congress killed the program, so the people involved did it for the government of Singapore instead. Son of TIA: Pentagon Surveillance System Is Reborn in Asia tells how Snowden and company built a system for tracking people in a totalitarian state. Now they they want to sell the system they developed back to the US government. It’s like doing an IT-driven data warehouse project, only with Orwellian overtones.

Worst practice #5: ignore the users. DHS has a monumental mess on their hands. Different government departments need different data, just like the finance department has different needs than the marketing department. Yet the DHS systems are being centrally mandated by (mostly) intelligence people with no idea of how other groups like the police or the border patrol need information. Combine this with poorly managed data integration and no governance and you have data being copied and misapplied all over the place.

A number of the articles I linked to came up via Bruce Schneier’s crypto-gram mailing list. Always interesting reading, even if you aren’t a security professional. Choice items from this month’s issue:

“…tips on preventing terrorism” indeed. (Tip #7: When transporting nuclear wastes, always be sure to padlock your truck.)

YOU are big brother: Control and track your car from the ‘net Or, as Schneier says, have someone hack into their web site and control it for you.

AMEX is Watching You AMEX has a patent application titled “Method and System for Facilitating a Shopping Experience,” that:

“describes a Minority Report style blueprint for monitoring consumers through RFID-enabled objects, like the American Express Blue Card.

According to the patent, RFID readers called “consumer trackers” would be placed in store shelving to pick up “consumer identification signals” emitted by RFID-embedded objects carried by shoppers. These would be used to identify people, track their movements, and observe their behavior.”

Breaches of personal data: blaming the myth and punishing the victim Can we stop blaming hackers for theft of information already?

The report states that “60 percent of the incidents involve missing or stolen hardware, insider abuse or theft, administrative error, or accidentally exposing data online.”

Given that its data suggests that a significant portion of the blame should go to those who hold the data, the report argues forcefully for legislation that requires they meet minimum data safety standards.

Windows Vista code-signing to keep out evil spyware? I don’t think so: VBootkit bypasses Windows Vista’s code-signing mechanisms Microsoft spent a lot trying to secure Vista. So far, no dice.

Local Sheriff Suspects Al-Qaeda Or Teens

“This activity matches up with the M.O. of a terrorist casing a potential target,” Steinhorst said. “It also matches the M.O. of a group of teens drinking beer and fooling around.”

I don’t read The Onion enough.

Is ASSM evil in a DW?

Posted on the May 29th, 2007. Read 297 times

Source: oramoss oracle [link]

I saw an interesting article on ASSM from Howard the other day. I wanted to comment on it but unless I’m being a bit thick, it doesn’t seem to be a blogpost and therefore I couldn’t and instead, I thought I’d do it here.

As I said, I found the article interesting and informative but I did have a few “But what about…?” issues that popped into my head as I read through it.

It probably won’t surprise anyone who reads my blog that my queries relate to the use of ASSM on a datawarehouse environment given my interest in that area.

Firstly, Howard asserted that ASSM wouldn’t be helpful in scenarios where you recieved sorted data from a warehouse source which you wanted to load up into a target table in order that you could gain a performance benefit from creating an index on that data whilst taking advantage of the NOSORT option - thereby speeding up the index creation considerably. Nothing Howard said was untrue although there are a number of things that could be discussed in relation to this point…

Firstly, in my experience, most warehouse processing uses Oracle parallel execution for performance - but if one were to load up data into a target table using parallel processing, then even if it were sorted in the source, the use of parallel processing would result in the target being unsorted and hence this advantage would be lost. Of course you could resort to serial processing but would that be more efficient than doing the work in parallel and then sorting the data for the index build again using parallel processing techniques - each process is probably different but I guess what I’m saying is that it’s possible that the NOSORT issue is not necessarily relevant in all cases especially those on a DW using parallel execution.

There are a number of caveats with the use of the NOSORT option such as:

You cannot specify REVERSE with this clause.
You cannot use this clause to create a cluster index partitioned or bitmap index.
You cannot specify this clause for a secondary index on an index-organized table.

Read more here

The use of Direct Path operations on a DW might also mean that the space wastage doesn’t occur given Oracle constructs full blocks from clean ones rather than via free list management or the ASSM equivalent.

Howard suggested that tests he has undertaken showed a 1.5% excess space utilisation when ASSM was involved on a 1Tb table. A long time ago I would have been horified to lose 15Gb but these days I probably wouldn’t be as concerned, particularly as I often find ways to save that kind of storage just as easily in other areas, e.g. Ensuring we use compression appropriately for tables and indexes and ensuring we only have indexes we need and which are of the right type and structure. If the solution to save that 1.5% were simple then I’d consider that too - not using ASSM sounds like a simple solution but I guess it depends on what the DBA standards are for the organisation amongst other factors.

Howard mentioned that with ASSM “wasting” space and using more blocks to store the same number of rows, that would mean a full scan would need to read more blocks for a given table and that the numerous, repeatedly accessed and therefore “hot”, ASSM bitmap blocks could lead to other, “warm” data being aged out sooner than would otherwise be the case - It’s a reasonable hypothesis and if we were full scanning a 1Tb table that had 1.5% or 15Gb of “waste” - I’m making a big assumption due to my lack of research here, in guessing that the 15Gb is largely made up of these ASSM bitmap blocks - then yes, 15Gb of blocks could have a considerable impact on a buffer cache…but that’s assuming you’d be reading the whole 1Tb table via a full scan - one would imagine the table is partitioned by some reasonable key (hopefully including time) and that therefore you’d only be reading a much smaller subset with the consequent reduced effect on the buffer cache. The effect is there but it’s impact would be system/query dependent.

Since the read (full scan) of the table itself would result in blocks going on the LRU end of the LRU list for the buffer cache then that wouldn’t affect the caching of hot/warm data at all - it’s just a matter of whether the ASSM bitmap blocks would be aging out other “warm” data to the detriment of overall system performance.

Whilst it raised a number of queries, it’s definitely the case that the article Howard posted has raised quite a debate within the DBA team on the warehouse I’m currently working on…should be fun this week on the warehouse as we contemplate peforming a number of benchmarks to try and ascertain which way we should jump - if any!

On another note, even though he’s moved on to pastures new, I think Doug Burns must have had something to do with the latest Sky TV adverts as the phone number they wanted you to call was 08702 42 42 42 - now that’s what I call leaving a lasting impression!

Contextual Shifts in Metadata Understanding

Posted on the May 28th, 2007. Read 264 times

Source: Blog: Dan E. Linstedt [link]

Context fascinates me, not to mention I love a good challenge…. Perhaps it’s the fever I have tonight and perhaps it’s just a wandering mind. In this entry I’m going to explore a couple of perspectives that I’ve been creating lately, along with a few theories that I’m proving out - and they have to do with (what else?) The obvious. The fact that metadata drives our data model structures, but rarely are metadata synchronized with definitional context (derived automatically from unstructured data sources), and rarely are they visualized beyond the standard 2 dimensional data models that we are so used to seeing and working with.

This entry is a thought experiment that dives into a land of “what-if” analysis, and attaches it to what I call Dynamic Data Warehousing - which also leads to Dynamic Automated Architecture Manageability. The problem is: how can we build a consistent, standardized, and solid foundational data model that will adapt its self going forward as the business changes and the needs change? (All of this of course without loosing sight of all the history that has already been collected). Impossible you say? Not at all…

Kettle ETL Jobs over the Internet

Posted on the May 28th, 2007. Read 347 times

Source: bayon blog [link]

Kettle recently switched over from accessing the file system via standard Java libraries to using the Apache VFS libraries. So what does that mean? Why does that matter? Well, it really opens up a whole slew of deployment options, and provides even more options for managing Kettle code. You can now manage your actual ETL jobs and transforms in Zipfiles, on Web Servers, FTP servers, WebDav locations, etc. Basically, you can making Kettle even thinner than it is right now.

For instance, you can now run Kettle ETL jobs with ONLY the base Kettle installation and a remote URL. Consider the following example.

I have two Kettle jobs:

http://www.nicholasgoodman.com/kettle/parentjob.kjb
200705281727

-

://www.nicholasgoodman.com/kettle/subjob.kjb
200705281729

The parent job, does something very simple: It executes the subjob.kjb using relative addressing. You can use the well known and supported ${Internal.Job.Filename.Directory} variable so that you don’t have to hard code the physical location of another jobs/transform/datafile/etc. In this case, it doesn’t matter if the subjob is on the local file system or on a webserver. The dialog to setup the relative addressing looks like this:

200705281733

Now, using nothing but the standard Kettle 2.5.0 download I can execute these two jobs without any client side ETL Jobs or Transforms.

./kitchen.sh -file=http://www.nicholasgoodman.com/kettle/parentjob.kjb

17:42:30,075 INFO [Kitchen] Kitchen - Start of run.
17:42:30,644 INFO [Kettle] Kettle - Reading repositories XML file: /Users/ngoodman/.kettle/repositories.xml
17:42:30,647 ERROR [Kettle] Kettle - Error opening file: /Users/ngoodman/.kettle/repositories.xml : java.io.FileNotFoundException: /Users/ngoodman/.kettle/repositories.xml (No such file or directory)
ERROR: No repositories defined on this system.
2007/05/28 17:42:30:695 PDT [INFO] DefaultFileReplicator - Using “/tmp/vfs_cache” as temporary files store.
17:42:31,991 INFO [Thread[parentjob (parentjob (Thread-2)),5,main]] Thread[parentjob (parentjob (Thread-2)),5,main] - Sleeping: 0 minutes
17:42:31,992 INFO [parentjob] parentjob - Starting entry [subjob]
17:42:32,133 INFO [Thread[subjob (subjob (Thread-3)),5,main]] Thread[subjob (subjob (Thread-3)),5,main] - Sleeping: 0 minutes
17:42:32,134 INFO [subjob] subjob - Starting entry [Dummy]
17:42:32,135 INFO [subjob] subjob - Finished jobentry [Dummy] (result=true)
17:42:32,233 INFO [parentjob] parentjob - Starting entry [Dummy]
17:42:32,234 INFO [parentjob] parentjob - Finished jobentry [Dummy] (result=true)
17:42:32,234 INFO [parentjob] parentjob - Finished jobentry [subjob] (result=true)
17:42:32,235 INFO [Kitchen] Kitchen - Finished!
17:42:32,235 INFO [Kitchen] Kitchen - Start=2007/05/28 17:42:30.630, Stop=2007/05/28 17:42:32.235
17:42:32,235 INFO [Kitchen] Kitchen - Processing ended after 1 seconds.

Just another nice feature that allows even more interesting ways to manage a deployment of ETL jobs and Transforms. Great work Kettle team!

Balanced Scorecard: El mapa estratégico (parte III)

Posted on the May 28th, 2007. Read 875 times

Source: Sistemas Decisionales, algo mas que Business Intelligence [link]

El mapa estratégico, es sin duda el elemento más importante de un BSC, por desgracia en muchas implantaciones nos olvidamos de este elemento fundamental.
El mapa estratégico debe describir, en forma clara y visual, la estrategia de la organización. Debe permitir comprender, de un solo vistazo, la estrategia de la empresa, incluidas las relaciones causa-efecto y las dependencias entre los objetivos.

Veamos un ejemplo de mapa estratégico (mapa publicado en la revista DATA.TI)

En primer lugar, para crear un mapa estratégico tenemos que identificar las perspectivas por la cuales queremos analizar el estado de nuestra empresa.

Cuatro de ellas ya las sabemos, pero quizás necesitemos añadir alguna más.

Seguidamente debemos definir los objetivos estratégicos y trazar la forma de conseguir con objetivos locales a cada perspectiva. La representación de estas relaciones causa-efecto será nuestro mapa estratégico. Si al acabar el mapa no tenemos ningún objetivo local en alguna de las dimensiones, veremos claramente que nuestra estrategia no esta equilibra, no esta “balanceada” como pretendemos con el BSC.
El mapa estratégico puede a su vez estar subdividido por líneas de actuación, como puede ser la de crecimiento, la de productividad o la de calidad.

Por último no debemos olvidar, que un mapa estratégico no es algo que se crea una vez y se arrincona, es algo dinámico, que debe reflejar la estrategia cambiante según las necesidades del entorno en el que nos movemos nosotros y nuestros competidores.

Por eso debe mantenerse periódicamente y mostrar de forma gráfica y clara (por ejemplo mediante el uso de colores) el estado de consecución de cada uno de los objetivos.

La importancia de los mapas estratégicos en el Balanced Scorecard es tanta que los mismo Kaplan y Norton, ya practicamente no hablan de otra cosa en los ultimos años, priorizándolo por encima del cuadro de mando y la representación de indicadores de la que hablaré en el próximo post.

Are ad-hoc query tools less used now?

Posted on the May 28th, 2007. Read 277 times

Source: Pete-s random notes [link]

For once I don’t have a strong opinion of my own, so I am seeking your input.
Over the past few years pervasive BI is the big must-have in corporate intelligence systems; that is, making the whole organisation intelligence consumers by giving users performance dashboards and embedding fixed queries into line of business applications. But what […]

Como instalar Pentaho

Posted on the May 28th, 2007. Read 5754 times

Source: Todo BI: Business Intelligence, Data Warehouse, CRM y mucho mas... [link]

Una de las cuestiones que mas nos planteais en vuestros correos es la dificultad incial para empezar con Pentaho, dado que tenemos una gran cantidad de opciones y productos que descargar, pero no se hace fácil saber cómo. Por ello, os vamos a presentar una novedad que hará muy sencillo iniciarse con Pentaho. Un instalador de Pentaho, que va a estar disponible desde la próxima release 1.5.4, y que os mostramos como novedad.

Pero si antes queréis saber un poco más sobre Pentaho y hasta dónde podéis llegar, no olvidéis echar un ojo a estos enlaces imprescindibles:

Primera Formación Oficial de Pentaho en España
Pentaho: la solucion Open Source Business Intelligence
Presentación Business Intelligence Open Source
Recortes de Pentaho
Demo Online de Pentaho
Pentaho Reporting
Pentaho OLAP


Paso 1:

Descargarse un único archivo: pentaho-1.5.4.0-windows-opensource-installer, que estará disponible desde la web de descargas de Pentaho y de Sourceforge.


Paso 2:

Aparece la pantalla de presentación del instalador en donde se dice que se va a instalar la demo preconfigurada que trae Pentaho. Sirve para hacerse una idea de lo que ofrece. Pero para ver funcionalidades mas avanzadas os aconsejamos echar un ojo a los enlaces superiores que os indicábamos.

Pentaho Install 1

Paso 3:

Licencia Open Source MPL (Mozilla Public License), lo que nos da muchas garantias de uso, ya que cumple con todos los criterios del Software Libre.

Pentaho Install 2

Paso 4:

Indicamos el directorio donde queremos instalar:

Pentaho Install 3

Paso 5:

Elegimos entre las opciones Pentaho Reporting Server y Pentaho BI Server. Para ver una comparativa de ambas opciones echar un ojo al siguiente enlace. Nosotros os recomendamos la Pentaho BI Server, pues da una visión mas amplia.

Pentaho Install 4

Paso 6:

Podemos elegir entre instalar Jboss como servidor de aplicaciones (viene por defecto, pero tambien se pueden usar Tomcat u otros comerciales) y la base de datos, bien una embebida: Hypersonic o un MySQL (existente o nuevo). Igualmente, mas adelante se puede configurar para cualquier base de datos.
La opción mas rápida es la primera: usar una Base de Datos integrada

Pentaho Install 5

Paso 7:

Elegimos una instalación por defecto o avanzada. La diferencia es que en la avanzada podemos configurar los siguientes puertos: Aplication Server Port, AJP connector port, Jboss Rmiport, Jboss Default Port, Jboss Web Service Port, Jboss RMI Object Port, Jboss Server bind invoker port, y Jboss Server bind UIL2 port.

Pentaho Install 6

Paso 8:

Decimos si queremos que se installe Jboss como un servicio para se arranque cada vez que reinicializamos la máquina.

Pentaho Install 7

Paso 9:

En nuestro menu tenemos las siguientes opciones para administrar Pentaho.

Pentaho Install 8


…. y ya está!!!!

Tags: Destacado

Counting Days in MDX

Posted on the May 27th, 2007. Read 483 times

Source: Mosha Pasumansky [link]

Time dimension is special in OLAP. Many MDX functions usually only make sense when applied to Time dimension (PrevMember, Lag, ParallelPeriod, PeriodsToDate, ClosingPeriod etc); semiadditive measures work differently with Time etc. Today, however, we will talk about much simpler subject - counting number of days in the currently selected period. There are all kinds of uses for this metric, for example computing averages over time (this is usually interesting in inventory applications to get average level of inventory). For this article we will use Adventure Works cube and compute average of [Internet Sales Amount] over [Ship Date] dimension. We could say that really for computing averages over time, one should use AverageOfChildren semiadditive measure. This is true with two caveats. First, semantics of AverageOfChildren semiadditive aggregation with respect to treating NULLs is the same as with Avg function - i.e. it will not count days which had no sales, and if we wanted to count such days, then AverageOfChildren won’t work. Secondly AverageOfChildren is available only in Enterprise Edition. Lastly knowing number of days (or other time periods) in the currently selected time is useful in other calculations as well, we use average because it is simple enough, yet illustrative for our goals.

The simplest and most straightforward way to calculate the number of days in current date would be

Count(Descendants([Ship Date].[Calendar], [Ship Date].[Calendar].[Date]))

Now using this calculation we can also calculate other metrics such as Sales per Day etc. Let’s see it work in the following query

WITH
 MEMBER Measures.[Number Of Days] AS Count(Descendants([Ship Date].[Calendar], [Ship Date].[Calendar].[Date]))
 MEMBER Measures.SalesPerDay  AS [Measures].[Internet Sales Amount] / Measures.[Number Of Days]
SELECT
{[Measures].[Internet Sales Amount], Measures.[Number Of Days], Measures.SalesPerDay} ON 0
, DrillDownMember(
   DrillDownMember(
    DrillDownMember(
     DrillDownMember([Ship Date].[Calendar].[All Periods], [Ship Date].[Calendar].[All Periods])
     ,[Ship Date].[Calendar].[Calendar Year].&[2003])
    ,[Ship Date].[Calendar].[Calendar Semester].&[2003]&[1])
   ,[Ship Date].[Calendar].[Calendar Quarter].&[2003]&[2])
  ON 1
FROM [Adventure Works]

This query shows sales, number of days and sales per day for the report where the user drilled down to year 2003, first semester, second quarter to show months of April, May and June. Works great. But the problems start if the user wants to see SalesPerDay metric not for the entire Q2 and not for individual months, but for combination of April and May. The operation widely known as multiselect.

WITH
 MEMBER Measures.[Number Of Days] AS Count(Descendants([Ship Date].[Calendar], [Ship Date].[Calendar].[Date]))
 MEMBER Measures.SalesPerDay  AS [Measures].[Internet Sales Amount] / Measures.[Number Of Days]
SELECT
{[Measures].[Internet Sales Amount], Measures.[Number Of Days], Measures.SalesPerDay} ON 0
FROM [Adventure Works]
WHERE {[Ship Date].[Calendar].[Month].&[2003]&[4],[Ship Date].[Calendar].[Month].&[2003]&[5]}

Both number of days and sales per day show familiar error ‘Query (2, 45) The MDX function DESCENDANTS failed because the coordinate for the ‘Month Name’ attribute contains a set‘. The piece ‘Query (2, 45)‘ tells us that the problem is in the query at line 2, column 45. That’s in the definition of the [Number of Days] calculated member during the call to the Descendants function (we knew that already from the error text, just in case if there were more than one Descendats function there, we know which one failed). I have written about this particular problem before in the article “Writing multiselect friendly MDX calculations“. Using the techniques described in that article we could rewrite the definition of the calculated member to replace Descendants function with EXISTING operator. The result will look like following:

WITH
 MEMBER Measures.[Number Of Days] AS Count(EXISTING [Ship Date].[Calendar].[Date])
 MEMBER Measures.SalesPerDay  AS [Measures].[Internet Sales Amount] / Measures.[Number Of Days]
SELECT
{[Measures].[Internet Sales Amount], Measures.[Number Of Days], Measures.SalesPerDay} ON 0
FROM [Adventure Works]
WHERE {[Ship Date].[Calendar].[Month].&[2003]&[4],[Ship Date].[Calendar].[Month].&[2003]&[5]}

Now it doesn’t fail and returns correct result for both number of days and sales per day. So are we happy ? No, we are not happy, because not all of the problems are solved. And I am not even talking now about performance hit that we take by replacing Descendants with EXISTING. Given the fact that the Time dimension is usually very small (i.e. 10 years will require no more than 3660 days, which is tiny as far as Analysis Services cares), the performance difference is not going to be very significant. I am talking about how basic scenarios which require either visual totals or custom grouping are going to be broken in more fundamental way than multiselect.

Let’s go back to our first query where we drilled down to 2003, Quarter 3. Now, let’s say we want to hide month of June as if it didn’t exist, and see how the results for Q3, and the entire year 2003 change. This is so called visual totals mode, and Excel always turned it on by default. Other tools such as OWC also allow it. There are many different ways how visual totals can be implemented - either by using ‘Default Visual Mode’ connection string property, or by using VisualTotals MDX function or by using subselects or CREATE SUBCUBE statement - it doesn’t matter. At the end of the day all of them achieve the same thing. Let’s see how our query would look like in this scenario:

WITH
 MEMBER Measures.[Number Of Days] AS Count(Descendants([Ship Date].[Calendar], [Ship Date].[Calendar].[Date]))
 MEMBER Measures.SalesPerDay  AS [Measures].[Internet Sales Amount] / Measures.[Number Of Days]
SELECT
{[Measures].[Internet Sales Amount], Measures.[Number Of Days], Measures.SalesPerDay} ON 0
,  VisualTotals(
    Except(
     DrillDownMember(
      DrillDownMember(
       DrillDownMember(
        DrillDownMember([Ship Date].[Calendar].[All Periods], [Ship Date].[Calendar].[All Periods])
       ,[Ship Date].[Calendar].[Calendar Year].&[2003])
      ,[Ship Date].[Calendar].[Calendar Semester].&[2003]&[1])
     ,[Ship Date].[Calendar].[Calendar Quarter].&[2003]&[2])
    ,[Ship Date].[Calendar].[Month].&[2003]&[6])
   )
   ON 1
FROM [Adventure Works]

Here we hide June by using Except, and then wrap everything in the VisualTotals call to get visual totals. The results are pretty bad. While the physical measure [Internet Sales Amount] works absolutely correctly in the presence of visual totals and shows only sales of April and May, the calculated measures completely ignore it. I.e. the [Number of Days] still shows 91 even though it should be 61. What’s worse, the SalesPerDay metric ends up completely fubared as a result of it. After all it is division of [Internet Sales Amount] by [Number of Days]. So we divide something which took visual totals into account by something that didn’t. The result is neither donkey nor orange. The number simply doesn’t make any sense. At least with multiselect we got an error - so the user didn’t get the result he wanted, but he also didn’t get wrong result either. He got an error, so he knew something was wrong. Here user gets wrong result without any warning. How does he know it is wrong ? And changing the expression to use EXISTING isn’t going help - the result will stay the same. There is a fundamental problem here. With multiselect, the current coordinate reflects it, so MDX expressions can detect presence of multiselect. But visual totals are stealth. It is impossible to detect them directly from MDX, one can only watch their side effects. Generally speaking it is a hard problem, but some people think about solution and you can help !

Luckily, in this particular case there is something that can be done. There is a solution for counting number of days in the current time period. And unlike with some other problems, this solution is actually very elegant. And it also solves some other problems. And it has best performance. And it doesn’t require any MDX ! This may sound like magic, but it is true. The solution is actually quite simple, and it was discovered by many people independently. In particular, Vladimir Shtepa is known for popularizing it in Internet forums. The trick is to create new measure group, which will have only one dimension - Time, and a measure with Aggregation Type ‘Count’ bound to the Day attribute of Time dimension (or any other attribute if we need to count something different from days). The most important setting on this new measure group is to leave IgnoreUnrelatedDimensions=true - this will allow to use this measure across any other dimension in the cube. Let’s call this measure [Number Of Days] and see how our queries perform with it.

Both

WITH
 MEMBER Measures.SalesPerDay  AS [Measures].[Internet Sales Amount] / Measures.[Number Of Days]
SELECT
{[Measures].[Internet Sales Amount], Measures.[Number Of Days], Measures.SalesPerDay} ON 0
FROM [Adventure Works]
WHERE {[Ship Date].[Calendar].[Month].&[2003]&[4],[Ship Date].[Calendar].[Month].&[2003]&[5]}

And

WITH
 MEMBER Measures.SalesPerDay  AS [Measures].[Internet Sales Amount] / Measures.[Number Of Days]
SELECT
{[Measures].[Internet Sales Amount], Measures.[Number Of Days], Measures.SalesPerDay} ON 0
,  VisualTotals(
    Except(
	  DrillDownMember(
		DrillDownMember(
			DrillDownMember(
			  DrillDownMember([Ship Date].[Calendar].[All Periods], [Ship Date].[Calendar].[All Periods])
			  , [Ship Date].[Calendar].[Calendar Year].&[2003])
			, [Ship Date].[Calendar].[Calendar Semester].&[2003]&[1])
		, [Ship Date].[Calendar].[Calendar Quarter].&[2003]&[2])
	  , [Ship Date].[Calendar].[Month].&[2003]&[6])
	)
   ON 1
FROM [Adventure Works]

now work fine and return correct results. But there are more advantages to this approach

* If we need to compute not just number of days, but something more sophisticated, like number of non-holidays - this can be easily done. Just add another measure into the measure group, with AggregationType Sum, and put the value ‘1′ for the days which are considered to be non-holidays and NULL for the days which are holidays.

* It works great with multiple Time dimensions and Role playing dimensions. Indeed, in Adventure Works we have 3 different role playing Time dimensions. By using all of them in the measure group, we will get single measure [Number Of Days] to correctly show the number of days regardless by which Time dimension user slices. With calculated measures approach, different calculated measures need to be created per Time dimension.

* Performance is the best there can be, because now [Number Of Days] is a real measure which is precomputed during processing.

* In addition to multiselect and visual totals, other scenarios such as custom grouping and sliced local cubes - all work correctly.

 

Next Page »