Mile High Tech Con

Mile High Tech ConThere’s a brand new business intelligence conference launching next month in Denver, Colorado.  The Mile High Tech Con is a three-day event taking place July 24-26, 2014, and is aimed at business intelligence practitioners, data analysts, and information managers/CIOs. 

From the event website:

Featuring three days of sessions and events focusing on Data Warehousing and Business Intelligence. Mile High Tech Con will highlight innovative topics that will inspire strategic decision making while offering relationship building and networking. Mile High Tech Con will foster collaboration and discussion so that attendees attain the opportunity to connect with industry experts and fellow developers while examining current and upcoming trends.

Among the topics to be covered at this conference:

  • Power BI
  • Predictive analytics
  • Biml
  • Master data management
  • Data quality
  • Dimensional modeling

There’s a great lineup of speakers, including Peter Myers, Marc Beacom, Reeves Smith, Bill Fellows, and others.  I even managed to convince them to let me speak as well!  I’ll be delivering three presentations:

  • Real World SSIS (full day seminar)
  • Error Handling Patterns in SSIS (two-hour deep dive)
  • Introduction to Data Quality Services (one-hour presentation)

This is the inaugural event for the Mile High Tech Con.  They have a model similar to that of SQLBits – one day of full-day seminars (Thursday), one day of deep-dive material (Friday), and one community day (Saturday).  Best of all, the community day is absolutely free.  Registration is currently open, and early bird pricing for Thursday and/or Friday is in effect through next week.

I’m looking forward to attending and presenting at this event.  If you’re going to be there, give me a shout and let’s get together for coffee.

Transaction Rollbacks with Identity Values

Have you ever noticed unexpected gaps in sequences in IDENTITY columns?  Even though you’ve got transactions set up for your inserts and a no-deletion policy, you find that there are missing values in what should be an unbroken sequence of numbers.  The problem could be partially related to transaction rollbacks.

Conventional wisdom would lead one to believe that a rolled back transaction would undo the logged data changes to a table. While a rollback will remove the data rows included in a transaction, it does not reset the identity value to its previous setting.  We can see this demonstrated in a brief example.

First, let’s create a table with an identity column and insert some data:

CREATE TABLE #idtest
(
theID INT IDENTITY(1,1)
, theValue VARCHAR(20) NULL
)

INSERT #idtest
VALUES (‘Plane’)
, (‘Train’)
, (‘Automobile’)

Now we’ll check the identity value by running:

DBCC CHECKIDENT(#idtest)

Which should return 3, the current identity value of the table.  Next, we’ll start a transaction, insert a few rows, and the roll back our changes.

BEGIN TRAN

INSERT #idtest
VALUES (‘Hop’)
, (‘Skip’)
, (‘Jump’)

ROLLBACK TRAN

We just inserted three rows but rolled back the transaction, so the new rows were never committed.  However, if you check the identity value again, you’ll see it’s been incremented to 6 even though no new rows have been committed to the table.

This is actually intended behavior and not a bug in the product.  If you think through some concurrency scenarios, you can understand why identity columns would be handled in this manner.  Just an FYI, you can reset the identity value using the same DBCC command listed above, but you should make sure you understand the potential effects of doing so before you roll it out to your production systems.

Never Delete Data

Should you ever delete data?  In a production environment, do the benefits of deleting old data outweigh the possible risks?

Data quality is important.  Whether you refer to it as data integrity, permanent retention, or simply maintaining a complete audit trail, it can be effectively argued that deleting data from a production database diminishes the big picture of your data.  After all, any data that is worthy of storing, backing up, optimizing, and mining is worth storing permanently.  Deleting data affects the ability to thoroughly research historical activity, and can impact reports and aggregations on the remaining data.  Storing only the rolled-up data, such as end-of-year financial reports are often not sufficient, because auditors or financial personnel may need to drill down to the lowest level of detail.  Other information, including certain healthcare data, is best kept forever (and in some cases, is legislated so) to ensure a proper legal record should it be necessary for judicial or civil proceedings.

The need to routinely delete data was far more critical when storage was more expensive, in terms of dollars and system time.  Purchasing disks for storage has never been cheaper, and with modern 15000 RPM drives and solid state disks, data access times continue to improve.  Removing data simply for the sake of saving bytes on a platter is not as critical as it was just a few years ago.  Data can be retained indefinitely, in the original store or in a separate archive (another table or a different database altogether).

To be clear, I’m not taking on DBAs who use the DELETE functionality to eliminate data.  A proper data retention policy would involve all levels of an organization, from the CXOs to the technical staff and end users.  And a competent retention policy doesn’t have to mandate that data remains in the RDBMS – information can be stored in the database, database backups, the filesystem, magnetic tape or optical disk, or a combination of several of these.  The specifics of permanent data storage should be dictated by how frequently or quickly the data would need to be accessed.

There are times when deleting data is expected and even commonplace.  When staging data in temp tables or table variables, one would expect deletion of data during that processing.  Any process that writes data out to an archive store would naturally need to delete data from the original location, though this could better be considered a move rather than a delete.  Sensitive data which would never be reported on or reused is expected for the protection of customers or clients – the deleting of credit card numbers after a charge is successfully posted would fall into this category.

Unfortunately, this decision does not reside with database administrators alone, or even with their employing organizations.  Some vendor applications will routinely delete older, less-often used data as part of a purge to better performance or decrease storage requirements.  I recently experienced this with a healthcare vendor during a conversion from their product to a newer system.  It was discovered during the planning phase of the conversion project that this vendor’s system was hard-coded to purge the detail data from old accounts.  Although we were able to reconstruct some of the data using other means, the ability to thoroughly report on that historical data has been permanently and irreversibly diminished.

The bottom line is that you should ask yourself whether you could ever need the data you are deleting.  You shouldn’t just ask whether it is likely that you will need the data again – approaching from this angle will eventually come back to bite you.  A more appropriate question would be whether you can imagine any scenario, however unlikely, that would require you to reference the data in the future.  Eventually your boss/the board/the CFO/the auditors will come calling, and you’ll be glad you have your safety net.