Lakehouse on the Plateau of Performance: From Hype to Practical Value
In which cases can Lakehouse become the best solution for business, what functional and technical requirements must a solution satisfy to be considered a full-fledged Lakehouse platform, what a typical implementation path looks like, and by what means the total cost of data ownership is reduced — Vladislav Podrechnyev, Director of Data Engineering at Cinimex, shares his views.
— What prerequisites would you note for the emergence of a Lakehouse class solution in the Russian market? How has its appearance affected the big data market?
V. Podrechnyev: The emergence of Lakehouse is a natural result of the evolution in the data world. Classical DWHs handled regulated reporting and building data marts well in conditions where data was well structured and more predictable. As the number of sources grew (event logs, streaming, semi and unstructured data), business processes sped up, and demands for self-service analytics tightened, DWHs became economically heavy and operationally “rigid.” High license costs combined with the necessity to duplicate data forced data engineers to look for cheaper and more flexible approaches to data storage.
Thus, a parallel world of Data Lakes arose, offering flexibility in storing any types of data, but lacking transactional integrity, strict governance, and guaranteed reproducibility of results. Sloppy operation and non-ideal architectural practices often turned data lakes into “swamps.”
Amid this, open tabular formats appeared (Iceberg, Delta, Hudi, Paimon), which brought “behavior” of managed tables into file storage: ACID operations, versioning, time travel, and schema evolution. Essentially, files in the Data Lake gained the properties of managed DWH tables — this became a technical trigger for the appearance of Lakehouse.
On the business side, drivers include the exponential growth of volumes and types of data, analytics spreading beyond IT into business units, and the rapid rise of AI and LLMs as data consumers.
Thus Lakehouse relieves pain points from both camps: “expensive and rigid” in DWH, “flexible but without guarantees” in Data Lake. As a result, we obtain a unified and scalable foundation for an expanding funnel of analytical and product scenarios.
— How has the appearance of Lakehouse influenced team practices?
V. Podrechnyev: With Lakehouse implementation, data follows a much shorter path and becomes available to end users more quickly. If before the business had to wait for lengthy and laborious stages of processing, now data can be obtained and used from a single “source of truth.” As a result, it becomes more accurate and accessible, and consumers receive it substantially faster.
— What functional and technical requirements must a solution satisfy to be considered a full-fledged Lakehouse platform?
V. Podrechnyev: Lakehouse is built on three key “pillars.” At the base of a mature Lakehouse platform is a unified storage layer in open columnar formats (most often Parquet), upon which a transactional tabular layer is implemented. This layer — be it Delta Lake, Apache Iceberg, Hudi, or Paimon — ensures ACID guarantees, versioning, time travel, and schema evolution, allowing maintenance of consistency and reproducibility of data without copying between different circuits. Physically, these data reside in object stores, e.g. MinIO.
The second mandatory element is a compute layer supporting classic SQL, able to operate directly over tables in object storage. In practice, different SQL engines are applied for different tasks: Apache Spark, Trino, Dremio, StarRocks, Flink, Hive, Impala, etc. Each technology has its strengths and limitations: some are better for large-scale batch processing, others for interactive low-latency queries, others for streaming or integration with complex open-source ecosystems. The specific choice depends on the task profile, load, and infrastructure.
The third “pillar,” which is often and wrongly omitted from the list of obligatory ones, is the Data Governance layer, enabling order in data usage. Its key components are data and metadata catalogs, which let you see the full data lineage, understand responsibility zones at each stage, when and for what the data entered the system, and how it is structured. This helps not only to prevent data lakes from becoming “swamps,” but also to increase the efficiency of data work.
It is important to remember: technologies only work when there is a data culture and structured organizational processes. In practice, only the combination of technological foundation and mature Data Governance makes it possible to build a full and efficient Lakehouse platform.
— How do you position your platform?
V. Podrechnyev: Our approach: we “tailor” high-quality “suits” to order and are not aiming for mass market. In other words, we focus on designing and assembling architectures specific to the client’s tasks, industry, and infrastructure. The differences in Lakehouse for a large bank, an industrial holding, a pharma company, and a retailer will be radical: from a toolset emphasizing streaming to a platform fully optimized for training AI models. We always begin with analysis of the current data landscape, business processes, and pain points, and then choose an optimal tech stack — thereby ensuring compatibility, open formats, and protection from vendor lock-in.
Unlike approaches in which a client is imposed a fixed product with limited functionality, we always build solutions on open standards and integrate both world-class open-source platforms and proven commercial components. This allows us to rapidly respond to evolving requirements, connect new sources, scale infrastructure, and adapt it without expensive migrations. If a client has already introduced expensive proprietary software and is satisfied with it, we will consider it part of the target system to preserve their investments and thus reduce risk.
As an integrator, we are ready to implement third-party boxed platforms as well, providing full support — from requirement gathering to deployment and maintenance.
— What does the architecture of your Lakehouse platform look like?
V. Podrechnyev: At the core of our platform lies a unified physical storage layer in open columnar formats (primarily Parquet) with transactional tables layered above it (using Delta Lake, Apache Iceberg, Hudi, and Paimon).
The compute layer is chosen per client tasks, using SQL engines that operate directly on those tables.
Governance is managed via Data Governance tools.
The platform is built modularly and evolves as a set of replaceable services around a single data loop. We employ microservices, CI/CD, and containerization tools (Docker, Kubernetes) so that components can be independently scaled, updated without downtime, and quickly integrate new scenarios.
Security and availability are baked in from the start: we include multi-level access control, auditing, fault-tolerant deployments, and standard connection protocols.
Because the architecture is based on open formats and standard protocols, it remains vendor-agnostic: components can be swapped without data rebuilding and while preserving manageability.
A key element is the portal as a single entry point into the system. Through it, admins and data owners receive a user-friendly interface for process monitoring, access management, and working with the data and metadata catalog, as well as launching analytical and integration scenarios.
Taken together, this makes the Lakehouse platform reliable and convenient for daily team operation.
— What key business problems does your platform address?
V. Podrechnyev: Among the tasks the platform solves:
— reducing the total cost of data ownership;
— shortening the time from query to analytical result;
— increasing transparency and manageability of information flows;
— preparing data in an ML Ready format for AI scenario building.
This approach gives our clients not just a modern platform that “sits on the shelf,” but a full operational data ecosystem into which processes, people, and technologies integrate organically.
— What does the typical implementation path look like: what do clients come with, and what minimal viable step do you recommend starting with?
V. Podrechnyev: Usually clients don’t come to us saying “build us a Lakehouse,” but with a specific business project for which this architecture is a good fit. It might be Customer 360, building a corporate data platform for certain divisions, recommendation systems, import substitution of foreign solutions, or optimization of regulatory reporting. There are many scenarios, and task phrasing varies, but all are united by one thing: in every case, they need faster and more efficient data work with clear and sustainable economics.
Constraints are almost always similar: maximize usage of the existing tech stack without major infrastructure overhaul, stay within budget and schedule so that first tangible results appear within a few months and justify the investment. It is important to integrate into existing processes so that they are not halted during the “remodel.” The pillars mentioned earlier serve as the foundation for launching a minimally viable scenario: the client quickly sees value and visualizes how the platform will scale. The further route is clear: a short pilot on a priority use case, parallel deployment of the platform scaffold, stepwise addition of new domains and sources with a limited coexistence period of old and new circuits. After that comes the switch and decommissioning of previous solutions.
Most time is spent not on technology but on changing the practices of practitioners. You need to justify the architecture to the business, explain the economics and risks of usual approaches, and reduce anxiety about the transition to Lakehouse. Here, management support is critically important: top management must fix the strategic priority and project goals — then the project acquires a unified vector. A “painless” transition mode helps too: training roles (including data owner) and setting clear rules such as “no data in catalog — no consuming those data.” Then the team sees the benefit and understands that return to old processes, e.g. manual Excel exports, will no longer be allowed.
Success is always ensured by three components: a quick and meaningful start on one scenario; transparent architecture on open formats with basic governance; and organizational readiness for change. Then Lakehouse ceases to be just another IT project “gathering dust” and becomes a convenient data tool for the client.
— Do you fully take responsibility for psychological and technical assistance of staff?
V. Podrechnyev: Yes — as a customer-focused integrator, we always take on the development of data culture at clients, as well as training and support.
— Let’s do a brief “postmortem” — a case where the implementation was difficult. What was the root cause, what corrections did you make, and what lesson did you draw?
V. Podrechnyev: One of the hardest projects was optimizing regulatory reporting at a large financial institution. We quickly realized the main problem was not in the technology stack but in the organizational part. Data preparation was distributed across different teams, key steps were done manually, and rules were scarcely formalized. Users were accustomed to existing procedures and were not ready to switch to a new format, and political will for rapid change at the outset was lacking.
Another risk factor was high staff turnover: as people changed, priorities and interpretation of requirements changed too.
Ultimately, we built a correction plan focusing not on code but on architecture and processes. We mapped current data flows in detail, documented each manual transformation, and assembled end-to-end data lineage so that everyone could see where each metric came from and how it transformed.
In parallel, we aligned roles and responsibilities for data domains, instituted transparent data publishing rules, and introduced a change window to reboot requirements and reduce chaos. Then we launched a pilot on one priority report with a very limited set of users and a short period of parallel mode. Old and new circuits worked side by side until formal switchover.
The compromise was that we significantly narrowed the first release and shifted some requests to subsequent iterations.
In return, we obtained a stable platform and a predictable change process. The main lesson turned out to be simple: even the best Lakehouse architecture does not deliver effect without governed processes and leadership support. When there are data owners, data catalogs, clear access rules, and political will, technology starts to work at full force. Without that, the platform becomes just a collection of disparate tools.
— Provide 2–3 practical cases from different industries where Lakehouse gave tangible business value.
V. Podrechnyev: It is important to understand that Lakehouse architecture is applicable in absolutely different domains and industries. I will give examples from fundamentally different fields.
Our first encounter with this architectural concept was in 2022, during an ML project for an insurance company. The data and systems volume was so great, and SLA demands on data processing so strict, that we quickly realized: we needed to emphasize efficient data handling. We conducted a huge number of experiments with various data formats and tested many compute engines until we reached the needed performance. After all the investigation, the project was awarded “Best Machine Learning Model for an Insurance Company” in the “Project of the Year” competition by Global CIO.
A second interesting case is the automation of distributor reporting in a pharmaceutical company. Data arrived via email, SFTP, and APIs in different formats, and processing passed through multiple levels in the RDBMS: from raw data to cleaned data to data marts. That (still classical) path took much time and resources. We decided to adopt a modern approach and implement a Lakehouse architecture. After that, the data path became significantly shorter and faster, and the total cost of ownership decreased. The final consumers — BI — could build reports on current data with minimal delay. Also, the need for constant physical copying of data between DB layers disappeared.
A third example is a Customer 360 project for marketing in a medical company. The goal was to timely offer personalized services to patients to increase their lifetime value. Before Lakehouse, the analytics and data science teams spent massive time preparing data from a medical information system (MIS) built on Firebird, which was legacy and had limited integration. We created for them a unified managed layer, integrated it with the MIS, integrated it with the data catalog, gave that layer versioning and the ability to publish ready datasets for analytics and ML. This allowed us to drastically shorten iteration cycles in the ML project and bring the model into production significantly faster.
Another example is a company in a major bank’s ecosystem. In that case, the client came already requesting Lakehouse implementation. The task was to build a virtual workspace for analysts, data engineers, and data scientists. Important conditions included using the corporate file store and giving users the ability to subscribe to datasets and request access themselves. As a result, we delivered a full Lakehouse platform that integrated organically into the existing infrastructure, ensuring security, manageability, and — most importantly — convenient access to data. New use cases now get connected without reworking storage, and the access approval process is transparent and predictable.
— What “before / after” metrics can you share?
V. Podrechnyev: In the insurance company case, different approaches in experiments showed tenfold differences in performance: with a poorly implemented technology vs. quality implementation with the best-performing stack. In the Customer 360 case, data scientists and analysts reduced the time to prepare data marts from several days to roughly 1–2 hours — time to explore the data and deliver results.
— For what types of workloads is your platform especially “strong”?
V. Podrechnyev: Lakehouse is most efficient where you need to combine the flexibility of a Data Lake with the manageability and predictability of a DWH. Primarily this is analytics over large and heterogeneous datasets: from classical BI to predictive models. If you need to process data in real time, combining streaming and batch processing, Lakehouse allows you to do so without bulky ETL pipelines and unnecessary data duplication.
The second direction is projects where speed of rolling new scenarios into production is critical: anti-fraud, personalized offers, event monitoring, client behavior analysis. The platform ensures a short path from data source to mart or model while preserving transparency, version control, and data quality.
A third direction is preparing data for AI and ML tracks. The ML-Ready layer is not just an add-on but part of its DNA, an integral architectural feature. It delivers clean consistent arrays with versioning and access via standard tools. This is especially critical for LLM or recommendation system projects where data quality directly affects the result.
V. Podrechnyev: I fully support that position: Lakehouse is not a “silver bullet”; it does not solve all problems.
In working with data, difficulties generally lie at the level of organizational processes and client-side culture. That culture must be built and sustained. The role of technology is to make data access, governance, and analytics insight generation as convenient as possible. Yet technology is only a tool: it helps, but will not give answers to a person who is not ready to use it.
So it’s not a panacea, but a sensible evolution of technical approaches — another step in change. The key to success lies beyond the technology itself: in organizational solutions and including the political will that makes data work a strategy, a project goal, and a norm for all employees.
— Are there directions not yet “mature” for Lakehouse use?
V. Podrechnyev: I would speak less of directions and more of anti-patterns — cases when Lakehouse is not needed. Any technology or architecture is only appropriate in its scenarios. If data volumes are measured in gigabytes rather than terabytes, i.e. it's not big data, Lakehouse will not bring benefit — it’s shooting a cannon at sparrows and a mistaken architectural choice. If data is strictly structured, changes very rarely, and a Data Warehouse is already built, there’s no sense in shifting to Lakehouse: that would be replacing one system with another with no gain. If an organization is not ready to change its data approach, build a data culture, implement Data Governance, then Lakehouse is also not applicable. Governance is the DNA of the system: without a data and metadata catalog, Lakehouse quickly becomes a swamp. But if all those conditions are met — data is truly abundant, the company is ready to change and maintain data order — then Lakehouse is justified. The spectrum of solvable tasks in that case is enormous: from data engineering and analytics to BI and ML scenarios.
— How does the platform ensure security requirements? Are there “pitfalls” it’s undesirable to step on?
V. Podrechnyev: Because we implement not a boxed solution, in all our projects security is part of Lakehouse architecture built in from the start, not an optional “add-on.” We always begin with data classification and assignment of domain owners, after which we define centralized access policies and apply them to all ways of accessing data. Usually, a role-based and attribute-based access model is used, enabling row- and column-level separation and masking of sensitive fields. Encryption is applied both at rest and in transit. Keys and secrets are managed through corporate secret stores, so they do not appear in code or pipelines. An important part is the process itself: access requests go through a portal with approval steps, expiration times, and mandatory auditing. Compliance control is built in jointly with the customer’s information security (IS) or privacy teams. Of course, we consider local requirements for personal data, logging, and data placement. Another crucial aspect is centralized auditing. We log who accessed what data, what transformations were performed, and based on which sets the reports were built. This is supported by the data catalog and data lineage — effectively the history of metric origin. We also employ isolation of different environments (dev, test, prod), network segmentation, and distinct service accounts. For high availability, we use fault-tolerant deployments and verified backup and restore procedures. The “pitfalls” in security are mostly organizational rather than technical: e.g. duplicating policies across tools, “grey” data exports bypassing catalogs, excessive privileges requested “just in case,” using real personal data in test environments, or granting access without expiry or regular audits. The checklist is simple: classify data, detect such cases, and root them out to properly handle personal or any sensitive information.
— How do you ensure portability and absence of vendor lock-in in practice?
V. Podrechnyev: As I mentioned, we initially build the Lakehouse on open-source technologies or the client’s proprietary solutions. The underlying physical storage formats are classic — Parquet with transactional layers. This means data are not locked inside a proprietary system and can be read by any compatible tools (Spark, Trino, Dremio, StarRocks, etc.) depending on the client’s needs. In architecture, we plan for component replacement without reassembling or migrating the data itself. If the client unexpectedly decides to switch SQL engine or Data Governance tool, one can simply replace the compute or service layer, keeping storage and formats the same. This approach works in on-prem environments as well as in hybrid and multi-cloud infrastructures. I should note that among our cloud partners (Yandex Cloud, MTS Web Services, and VK Cloud), the same paradigm of component interchangeability is used for specific tasks. That facilitates portability of solutions between environments and expands the client’s choice of tools. In our projects, we actively use this compatibility to accelerate deployment and reduce migration risk. We also always standardize integration points via open APIs and common protocols, not via narrow private connectors that may become bottlenecks. This allows us to quickly hook up new data sources and consumers without vendor lock-in. For business, this means reduced long-term risk and freedom to evolve the architecture toward future needs without costly, prolonged migrations.
— By what means is the reduction of total cost of ownership achieved?
V. Podrechnyev: Lakehouse primarily saves by combining the capabilities of a classical DWH and a Data Lake in one architecture. We avoid data duplication across different circuits, reducing storage and administration costs. Instead of supporting two parallel platforms, there is the possibility to use one unified platform that ensures strict transactional scenarios while still allowing raw data operation. From a cost standpoint, Lakehouse is usually cheaper than a classical DWH — firstly by eliminating expensive licenses, secondly by reducing infrastructure duplication. At the same time, it is generally more expensive than a simple Data Lake because it includes transactional guarantees, governance mechanisms, and reproducibility tools. It is precisely this balance of functionality and cost that gives business maximum return on investment. There are also operational advantages: data paths become shorter, technical processes are simplified, manual operations and bottlenecks decrease. This accelerates the cycle from query to result and allows IT teams to focus on business value rather than supporting complex infrastructure. Ultimately, business gets direct benefit: faster product releases, quicker decision-making, and infrastructure savings without sacrificing functionality. In summary: Lakehouse provides an optimal ratio of cost, flexibility, and speed.
— What do you see as the next stage in Lakehouse evolution?
V. Podrechnyev: Despite interest and hype, Lakehouse is only now reaching its plateau of effectiveness and stage of broad industrial adoption. Many companies are today in pilot and early deployment stages, and ahead lies mass transition to full-scale operation. It is important not just to keep pace, but to build architecture so it remains relevant in five or ten years. If before the main task was to gather and structure data, now we can know everything about them — from lineage to quality. The next stage is cultivating a mature data culture within companies and sharply lowering the entry barrier. I believe data work should be as native and intuitive as using a smartphone: clear interfaces, minimal technical barriers, confidence in result correctness. The likelihood of erroneous conclusions will diminish via built-in checks, automated quality control, and the emergence of a native language to communicate with data — when you can form requests in natural human language. Architecturally, this will lead to a new wave of democratization of access — data will cease to be the prerogative of a narrow circle of specialists, becoming part of everyday work for each employee and unit.
— Could you share your ideas for how this direction might develop?
V. Podrechnyev: : Much of this is already actively tested, by us and many large companies and vendors. If you ask ChatGPT or another LLM a question about an uploaded file, the model will certainly formulate an answer. However, not all solutions allowing dialogue with data in native language are yet available for on-premises scenarios. Moreover, compliance and corporate standards often prevent uploading sensitive information into such services. Therefore, many organizations deploy large language models internally: already this gives the ability to interact with data in a familiar human way. The market is moving in that direction — pilots and early experiments are underway, but evolution will take time. We face a very interesting wave of data democratization and expansion of the data universe.
Read the interview for FutureBanking via the link
V. Podrechnyev: The emergence of Lakehouse is a natural result of the evolution in the data world. Classical DWHs handled regulated reporting and building data marts well in conditions where data was well structured and more predictable. As the number of sources grew (event logs, streaming, semi and unstructured data), business processes sped up, and demands for self-service analytics tightened, DWHs became economically heavy and operationally “rigid.” High license costs combined with the necessity to duplicate data forced data engineers to look for cheaper and more flexible approaches to data storage.
Thus, a parallel world of Data Lakes arose, offering flexibility in storing any types of data, but lacking transactional integrity, strict governance, and guaranteed reproducibility of results. Sloppy operation and non-ideal architectural practices often turned data lakes into “swamps.”
Amid this, open tabular formats appeared (Iceberg, Delta, Hudi, Paimon), which brought “behavior” of managed tables into file storage: ACID operations, versioning, time travel, and schema evolution. Essentially, files in the Data Lake gained the properties of managed DWH tables — this became a technical trigger for the appearance of Lakehouse.
On the business side, drivers include the exponential growth of volumes and types of data, analytics spreading beyond IT into business units, and the rapid rise of AI and LLMs as data consumers.
Thus Lakehouse relieves pain points from both camps: “expensive and rigid” in DWH, “flexible but without guarantees” in Data Lake. As a result, we obtain a unified and scalable foundation for an expanding funnel of analytical and product scenarios.
— How has the appearance of Lakehouse influenced team practices?
V. Podrechnyev: With Lakehouse implementation, data follows a much shorter path and becomes available to end users more quickly. If before the business had to wait for lengthy and laborious stages of processing, now data can be obtained and used from a single “source of truth.” As a result, it becomes more accurate and accessible, and consumers receive it substantially faster.
— What functional and technical requirements must a solution satisfy to be considered a full-fledged Lakehouse platform?
V. Podrechnyev: Lakehouse is built on three key “pillars.” At the base of a mature Lakehouse platform is a unified storage layer in open columnar formats (most often Parquet), upon which a transactional tabular layer is implemented. This layer — be it Delta Lake, Apache Iceberg, Hudi, or Paimon — ensures ACID guarantees, versioning, time travel, and schema evolution, allowing maintenance of consistency and reproducibility of data without copying between different circuits. Physically, these data reside in object stores, e.g. MinIO.
The second mandatory element is a compute layer supporting classic SQL, able to operate directly over tables in object storage. In practice, different SQL engines are applied for different tasks: Apache Spark, Trino, Dremio, StarRocks, Flink, Hive, Impala, etc. Each technology has its strengths and limitations: some are better for large-scale batch processing, others for interactive low-latency queries, others for streaming or integration with complex open-source ecosystems. The specific choice depends on the task profile, load, and infrastructure.
The third “pillar,” which is often and wrongly omitted from the list of obligatory ones, is the Data Governance layer, enabling order in data usage. Its key components are data and metadata catalogs, which let you see the full data lineage, understand responsibility zones at each stage, when and for what the data entered the system, and how it is structured. This helps not only to prevent data lakes from becoming “swamps,” but also to increase the efficiency of data work.
It is important to remember: technologies only work when there is a data culture and structured organizational processes. In practice, only the combination of technological foundation and mature Data Governance makes it possible to build a full and efficient Lakehouse platform.
— How do you position your platform?
V. Podrechnyev: Our approach: we “tailor” high-quality “suits” to order and are not aiming for mass market. In other words, we focus on designing and assembling architectures specific to the client’s tasks, industry, and infrastructure. The differences in Lakehouse for a large bank, an industrial holding, a pharma company, and a retailer will be radical: from a toolset emphasizing streaming to a platform fully optimized for training AI models. We always begin with analysis of the current data landscape, business processes, and pain points, and then choose an optimal tech stack — thereby ensuring compatibility, open formats, and protection from vendor lock-in.
Unlike approaches in which a client is imposed a fixed product with limited functionality, we always build solutions on open standards and integrate both world-class open-source platforms and proven commercial components. This allows us to rapidly respond to evolving requirements, connect new sources, scale infrastructure, and adapt it without expensive migrations. If a client has already introduced expensive proprietary software and is satisfied with it, we will consider it part of the target system to preserve their investments and thus reduce risk.
As an integrator, we are ready to implement third-party boxed platforms as well, providing full support — from requirement gathering to deployment and maintenance.
— What does the architecture of your Lakehouse platform look like?
V. Podrechnyev: At the core of our platform lies a unified physical storage layer in open columnar formats (primarily Parquet) with transactional tables layered above it (using Delta Lake, Apache Iceberg, Hudi, and Paimon).
The compute layer is chosen per client tasks, using SQL engines that operate directly on those tables.
Governance is managed via Data Governance tools.
The platform is built modularly and evolves as a set of replaceable services around a single data loop. We employ microservices, CI/CD, and containerization tools (Docker, Kubernetes) so that components can be independently scaled, updated without downtime, and quickly integrate new scenarios.
Security and availability are baked in from the start: we include multi-level access control, auditing, fault-tolerant deployments, and standard connection protocols.
Because the architecture is based on open formats and standard protocols, it remains vendor-agnostic: components can be swapped without data rebuilding and while preserving manageability.
A key element is the portal as a single entry point into the system. Through it, admins and data owners receive a user-friendly interface for process monitoring, access management, and working with the data and metadata catalog, as well as launching analytical and integration scenarios.
Taken together, this makes the Lakehouse platform reliable and convenient for daily team operation.
— What key business problems does your platform address?
V. Podrechnyev: Among the tasks the platform solves:
— reducing the total cost of data ownership;
— shortening the time from query to analytical result;
— increasing transparency and manageability of information flows;
— preparing data in an ML Ready format for AI scenario building.
This approach gives our clients not just a modern platform that “sits on the shelf,” but a full operational data ecosystem into which processes, people, and technologies integrate organically.
— What does the typical implementation path look like: what do clients come with, and what minimal viable step do you recommend starting with?
V. Podrechnyev: Usually clients don’t come to us saying “build us a Lakehouse,” but with a specific business project for which this architecture is a good fit. It might be Customer 360, building a corporate data platform for certain divisions, recommendation systems, import substitution of foreign solutions, or optimization of regulatory reporting. There are many scenarios, and task phrasing varies, but all are united by one thing: in every case, they need faster and more efficient data work with clear and sustainable economics.
Constraints are almost always similar: maximize usage of the existing tech stack without major infrastructure overhaul, stay within budget and schedule so that first tangible results appear within a few months and justify the investment. It is important to integrate into existing processes so that they are not halted during the “remodel.” The pillars mentioned earlier serve as the foundation for launching a minimally viable scenario: the client quickly sees value and visualizes how the platform will scale. The further route is clear: a short pilot on a priority use case, parallel deployment of the platform scaffold, stepwise addition of new domains and sources with a limited coexistence period of old and new circuits. After that comes the switch and decommissioning of previous solutions.
Most time is spent not on technology but on changing the practices of practitioners. You need to justify the architecture to the business, explain the economics and risks of usual approaches, and reduce anxiety about the transition to Lakehouse. Here, management support is critically important: top management must fix the strategic priority and project goals — then the project acquires a unified vector. A “painless” transition mode helps too: training roles (including data owner) and setting clear rules such as “no data in catalog — no consuming those data.” Then the team sees the benefit and understands that return to old processes, e.g. manual Excel exports, will no longer be allowed.
Success is always ensured by three components: a quick and meaningful start on one scenario; transparent architecture on open formats with basic governance; and organizational readiness for change. Then Lakehouse ceases to be just another IT project “gathering dust” and becomes a convenient data tool for the client.
— Do you fully take responsibility for psychological and technical assistance of staff?
V. Podrechnyev: Yes — as a customer-focused integrator, we always take on the development of data culture at clients, as well as training and support.
— Let’s do a brief “postmortem” — a case where the implementation was difficult. What was the root cause, what corrections did you make, and what lesson did you draw?
V. Podrechnyev: One of the hardest projects was optimizing regulatory reporting at a large financial institution. We quickly realized the main problem was not in the technology stack but in the organizational part. Data preparation was distributed across different teams, key steps were done manually, and rules were scarcely formalized. Users were accustomed to existing procedures and were not ready to switch to a new format, and political will for rapid change at the outset was lacking.
Another risk factor was high staff turnover: as people changed, priorities and interpretation of requirements changed too.
Ultimately, we built a correction plan focusing not on code but on architecture and processes. We mapped current data flows in detail, documented each manual transformation, and assembled end-to-end data lineage so that everyone could see where each metric came from and how it transformed.
In parallel, we aligned roles and responsibilities for data domains, instituted transparent data publishing rules, and introduced a change window to reboot requirements and reduce chaos. Then we launched a pilot on one priority report with a very limited set of users and a short period of parallel mode. Old and new circuits worked side by side until formal switchover.
The compromise was that we significantly narrowed the first release and shifted some requests to subsequent iterations.
In return, we obtained a stable platform and a predictable change process. The main lesson turned out to be simple: even the best Lakehouse architecture does not deliver effect without governed processes and leadership support. When there are data owners, data catalogs, clear access rules, and political will, technology starts to work at full force. Without that, the platform becomes just a collection of disparate tools.
— Provide 2–3 practical cases from different industries where Lakehouse gave tangible business value.
V. Podrechnyev: It is important to understand that Lakehouse architecture is applicable in absolutely different domains and industries. I will give examples from fundamentally different fields.
Our first encounter with this architectural concept was in 2022, during an ML project for an insurance company. The data and systems volume was so great, and SLA demands on data processing so strict, that we quickly realized: we needed to emphasize efficient data handling. We conducted a huge number of experiments with various data formats and tested many compute engines until we reached the needed performance. After all the investigation, the project was awarded “Best Machine Learning Model for an Insurance Company” in the “Project of the Year” competition by Global CIO.
A second interesting case is the automation of distributor reporting in a pharmaceutical company. Data arrived via email, SFTP, and APIs in different formats, and processing passed through multiple levels in the RDBMS: from raw data to cleaned data to data marts. That (still classical) path took much time and resources. We decided to adopt a modern approach and implement a Lakehouse architecture. After that, the data path became significantly shorter and faster, and the total cost of ownership decreased. The final consumers — BI — could build reports on current data with minimal delay. Also, the need for constant physical copying of data between DB layers disappeared.
A third example is a Customer 360 project for marketing in a medical company. The goal was to timely offer personalized services to patients to increase their lifetime value. Before Lakehouse, the analytics and data science teams spent massive time preparing data from a medical information system (MIS) built on Firebird, which was legacy and had limited integration. We created for them a unified managed layer, integrated it with the MIS, integrated it with the data catalog, gave that layer versioning and the ability to publish ready datasets for analytics and ML. This allowed us to drastically shorten iteration cycles in the ML project and bring the model into production significantly faster.
Another example is a company in a major bank’s ecosystem. In that case, the client came already requesting Lakehouse implementation. The task was to build a virtual workspace for analysts, data engineers, and data scientists. Important conditions included using the corporate file store and giving users the ability to subscribe to datasets and request access themselves. As a result, we delivered a full Lakehouse platform that integrated organically into the existing infrastructure, ensuring security, manageability, and — most importantly — convenient access to data. New use cases now get connected without reworking storage, and the access approval process is transparent and predictable.
— What “before / after” metrics can you share?
V. Podrechnyev: In the insurance company case, different approaches in experiments showed tenfold differences in performance: with a poorly implemented technology vs. quality implementation with the best-performing stack. In the Customer 360 case, data scientists and analysts reduced the time to prepare data marts from several days to roughly 1–2 hours — time to explore the data and deliver results.
— For what types of workloads is your platform especially “strong”?
V. Podrechnyev: Lakehouse is most efficient where you need to combine the flexibility of a Data Lake with the manageability and predictability of a DWH. Primarily this is analytics over large and heterogeneous datasets: from classical BI to predictive models. If you need to process data in real time, combining streaming and batch processing, Lakehouse allows you to do so without bulky ETL pipelines and unnecessary data duplication.
The second direction is projects where speed of rolling new scenarios into production is critical: anti-fraud, personalized offers, event monitoring, client behavior analysis. The platform ensures a short path from data source to mart or model while preserving transparency, version control, and data quality.
A third direction is preparing data for AI and ML tracks. The ML-Ready layer is not just an add-on but part of its DNA, an integral architectural feature. It delivers clean consistent arrays with versioning and access via standard tools. This is especially critical for LLM or recommendation system projects where data quality directly affects the result.
The fourth direction is for banks building regulatory reporting or integrating divisions via a single data platform. Here performance is valued along with built-in access control, auditing, and compliance mechanisms, allowing working with sensitive data without compromise on security.
— At one of our forums, an idea was voiced that Lakehouse is not a panacea for all woes. Do you agree?V. Podrechnyev: I fully support that position: Lakehouse is not a “silver bullet”; it does not solve all problems.
In working with data, difficulties generally lie at the level of organizational processes and client-side culture. That culture must be built and sustained. The role of technology is to make data access, governance, and analytics insight generation as convenient as possible. Yet technology is only a tool: it helps, but will not give answers to a person who is not ready to use it.
So it’s not a panacea, but a sensible evolution of technical approaches — another step in change. The key to success lies beyond the technology itself: in organizational solutions and including the political will that makes data work a strategy, a project goal, and a norm for all employees.
— Are there directions not yet “mature” for Lakehouse use?
V. Podrechnyev: I would speak less of directions and more of anti-patterns — cases when Lakehouse is not needed. Any technology or architecture is only appropriate in its scenarios. If data volumes are measured in gigabytes rather than terabytes, i.e. it's not big data, Lakehouse will not bring benefit — it’s shooting a cannon at sparrows and a mistaken architectural choice. If data is strictly structured, changes very rarely, and a Data Warehouse is already built, there’s no sense in shifting to Lakehouse: that would be replacing one system with another with no gain. If an organization is not ready to change its data approach, build a data culture, implement Data Governance, then Lakehouse is also not applicable. Governance is the DNA of the system: without a data and metadata catalog, Lakehouse quickly becomes a swamp. But if all those conditions are met — data is truly abundant, the company is ready to change and maintain data order — then Lakehouse is justified. The spectrum of solvable tasks in that case is enormous: from data engineering and analytics to BI and ML scenarios.
— How does the platform ensure security requirements? Are there “pitfalls” it’s undesirable to step on?
V. Podrechnyev: Because we implement not a boxed solution, in all our projects security is part of Lakehouse architecture built in from the start, not an optional “add-on.” We always begin with data classification and assignment of domain owners, after which we define centralized access policies and apply them to all ways of accessing data. Usually, a role-based and attribute-based access model is used, enabling row- and column-level separation and masking of sensitive fields. Encryption is applied both at rest and in transit. Keys and secrets are managed through corporate secret stores, so they do not appear in code or pipelines. An important part is the process itself: access requests go through a portal with approval steps, expiration times, and mandatory auditing. Compliance control is built in jointly with the customer’s information security (IS) or privacy teams. Of course, we consider local requirements for personal data, logging, and data placement. Another crucial aspect is centralized auditing. We log who accessed what data, what transformations were performed, and based on which sets the reports were built. This is supported by the data catalog and data lineage — effectively the history of metric origin. We also employ isolation of different environments (dev, test, prod), network segmentation, and distinct service accounts. For high availability, we use fault-tolerant deployments and verified backup and restore procedures. The “pitfalls” in security are mostly organizational rather than technical: e.g. duplicating policies across tools, “grey” data exports bypassing catalogs, excessive privileges requested “just in case,” using real personal data in test environments, or granting access without expiry or regular audits. The checklist is simple: classify data, detect such cases, and root them out to properly handle personal or any sensitive information.
— How do you ensure portability and absence of vendor lock-in in practice?
V. Podrechnyev: As I mentioned, we initially build the Lakehouse on open-source technologies or the client’s proprietary solutions. The underlying physical storage formats are classic — Parquet with transactional layers. This means data are not locked inside a proprietary system and can be read by any compatible tools (Spark, Trino, Dremio, StarRocks, etc.) depending on the client’s needs. In architecture, we plan for component replacement without reassembling or migrating the data itself. If the client unexpectedly decides to switch SQL engine or Data Governance tool, one can simply replace the compute or service layer, keeping storage and formats the same. This approach works in on-prem environments as well as in hybrid and multi-cloud infrastructures. I should note that among our cloud partners (Yandex Cloud, MTS Web Services, and VK Cloud), the same paradigm of component interchangeability is used for specific tasks. That facilitates portability of solutions between environments and expands the client’s choice of tools. In our projects, we actively use this compatibility to accelerate deployment and reduce migration risk. We also always standardize integration points via open APIs and common protocols, not via narrow private connectors that may become bottlenecks. This allows us to quickly hook up new data sources and consumers without vendor lock-in. For business, this means reduced long-term risk and freedom to evolve the architecture toward future needs without costly, prolonged migrations.
— By what means is the reduction of total cost of ownership achieved?
V. Podrechnyev: Lakehouse primarily saves by combining the capabilities of a classical DWH and a Data Lake in one architecture. We avoid data duplication across different circuits, reducing storage and administration costs. Instead of supporting two parallel platforms, there is the possibility to use one unified platform that ensures strict transactional scenarios while still allowing raw data operation. From a cost standpoint, Lakehouse is usually cheaper than a classical DWH — firstly by eliminating expensive licenses, secondly by reducing infrastructure duplication. At the same time, it is generally more expensive than a simple Data Lake because it includes transactional guarantees, governance mechanisms, and reproducibility tools. It is precisely this balance of functionality and cost that gives business maximum return on investment. There are also operational advantages: data paths become shorter, technical processes are simplified, manual operations and bottlenecks decrease. This accelerates the cycle from query to result and allows IT teams to focus on business value rather than supporting complex infrastructure. Ultimately, business gets direct benefit: faster product releases, quicker decision-making, and infrastructure savings without sacrificing functionality. In summary: Lakehouse provides an optimal ratio of cost, flexibility, and speed.
— What do you see as the next stage in Lakehouse evolution?
V. Podrechnyev: Despite interest and hype, Lakehouse is only now reaching its plateau of effectiveness and stage of broad industrial adoption. Many companies are today in pilot and early deployment stages, and ahead lies mass transition to full-scale operation. It is important not just to keep pace, but to build architecture so it remains relevant in five or ten years. If before the main task was to gather and structure data, now we can know everything about them — from lineage to quality. The next stage is cultivating a mature data culture within companies and sharply lowering the entry barrier. I believe data work should be as native and intuitive as using a smartphone: clear interfaces, minimal technical barriers, confidence in result correctness. The likelihood of erroneous conclusions will diminish via built-in checks, automated quality control, and the emergence of a native language to communicate with data — when you can form requests in natural human language. Architecturally, this will lead to a new wave of democratization of access — data will cease to be the prerogative of a narrow circle of specialists, becoming part of everyday work for each employee and unit.
— Could you share your ideas for how this direction might develop?
V. Podrechnyev: : Much of this is already actively tested, by us and many large companies and vendors. If you ask ChatGPT or another LLM a question about an uploaded file, the model will certainly formulate an answer. However, not all solutions allowing dialogue with data in native language are yet available for on-premises scenarios. Moreover, compliance and corporate standards often prevent uploading sensitive information into such services. Therefore, many organizations deploy large language models internally: already this gives the ability to interact with data in a familiar human way. The market is moving in that direction — pilots and early experiments are underway, but evolution will take time. We face a very interesting wave of data democratization and expansion of the data universe.
Read the interview for FutureBanking via the link