i
Altimetrik
Filter interviews by
I applied via Campus Placement and was interviewed in Aug 2021. There were 6 interview rounds.
In both aptitude and coding in the second round, aptitude mostly consists of basic problems and there are some data science problems like bias, stats and probability.
2 coding problems the ones I got are easier didn't take more than 15 minutes to solve both of them.
Gradient descent is an optimization algorithm used to minimize the cost function of a machine learning model.
Gradient descent is used to update the parameters of a model to minimize the cost function.
It follows the direction of steepest descent, which is the negative gradient of the cost function.
The learning rate determines the step size of the algorithm.
The formula for gradient descent is: theta = theta - alpha * (1/...
A dictionary sorted in ascending order based on keys.
Create a dictionary with key-value pairs
Use the sorted() function to sort the dictionary based on keys
Convert the sorted dictionary into a list of tuples
Use the dict() constructor to create a new dictionary from the sorted list of tuples
I applied via Naukri.com and was interviewed in Jun 2020. There was 1 interview round.
I applied via Naukri.com and was interviewed in Sep 2024. There was 1 interview round.
To create a pipeline in ADF, you can use the Azure Data Factory UI or code-based approach.
Use Azure Data Factory UI to visually create and manage pipelines
Use code-based approach with JSON to define pipelines and activities
Add activities such as data movement, data transformation, and data processing to the pipeline
Set up triggers and schedules for the pipeline to run automatically
Activities in pipelines include data extraction, transformation, loading, and monitoring.
Data extraction: Retrieving data from various sources such as databases, APIs, and files.
Data transformation: Cleaning, filtering, and structuring data for analysis.
Data loading: Loading processed data into a data warehouse or database.
Monitoring: Tracking the performance and health of the pipeline to ensure data quality and reliab
getmetadata is used to retrieve metadata information about a dataset or data source.
getmetadata can provide information about the structure, format, and properties of the data.
It can be used to understand the data schema, column names, data types, and any constraints or relationships.
This information is helpful for data engineers to properly process, transform, and analyze the data.
For example, getmetadata can be used ...
Triggers in databases are special stored procedures that are automatically executed when certain events occur.
Types of triggers include: DML triggers (for INSERT, UPDATE, DELETE operations), DDL triggers (for CREATE, ALTER, DROP operations), and logon triggers.
Triggers can be classified as row-level triggers (executed once for each row affected by the triggering event) or statement-level triggers (executed once for eac...
Normal cluster is used for interactive workloads while job cluster is used for batch processing in Databricks.
Normal cluster is used for ad-hoc queries and exploratory data analysis.
Job cluster is used for running scheduled jobs and batch processing tasks.
Normal cluster is terminated after a period of inactivity, while job cluster is terminated after the job completes.
Normal cluster is more cost-effective for short-liv...
Slowly changing dimensions refer to data warehouse dimensions that change slowly over time.
SCDs are used to track historical changes in data over time.
There are three types of SCDs - Type 1, Type 2, and Type 3.
Type 1 SCDs overwrite old data with new data, Type 2 creates new records for changes, and Type 3 maintains both old and new data in separate columns.
Example: A customer's address changing would be a Type 2 SCD.
Ex...
Use Python's 'with' statement to ensure proper resource management and exception handling.
Use 'with' statement to automatically close files after use
Helps in managing resources like database connections
Ensures proper cleanup even in case of exceptions
List is mutable, tuple is immutable in Python.
List can be modified after creation, tuple cannot be modified.
List uses square brackets [], tuple uses parentheses ().
Lists are used for collections of items that may need to be changed, tuples are used for fixed collections of items.
Example: list_example = [1, 2, 3], tuple_example = (4, 5, 6)
Datalake 1 and Datalake 2 are both storage systems for big data, but they may differ in terms of architecture, scalability, and use cases.
Datalake 1 may use a Hadoop-based architecture while Datalake 2 may use a cloud-based architecture like AWS S3 or Azure Data Lake Storage.
Datalake 1 may be more suitable for on-premise data storage and processing, while Datalake 2 may offer better scalability and flexibility for clou...
To read a file in Databricks, you can use the Databricks File System (DBFS) or Spark APIs.
Use dbutils.fs.ls('dbfs:/path/to/file') to list files in DBFS
Use spark.read.format('csv').load('dbfs:/path/to/file') to read a CSV file
Use spark.read.format('parquet').load('dbfs:/path/to/file') to read a Parquet file
Star schema is denormalized with one central fact table surrounded by dimension tables, while snowflake schema is normalized with multiple related dimension tables.
Star schema is easier to understand and query due to denormalization.
Snowflake schema saves storage space by normalizing data.
Star schema is better for data warehousing and OLAP applications.
Snowflake schema is better for OLTP systems with complex relationsh
repartition increases partitions while coalesce decreases partitions in Spark
repartition shuffles data and can be used for increasing partitions for parallelism
coalesce reduces partitions without shuffling data, useful for reducing overhead
repartition is more expensive than coalesce as it involves data movement
example: df.repartition(10) vs df.coalesce(5)
Parquet file format is a columnar storage format used for efficient data storage and processing.
Parquet files store data in a columnar format, which allows for efficient querying and processing of specific columns without reading the entire file.
It supports complex nested data structures like arrays and maps.
Parquet files are highly compressed, reducing storage space and improving query performance.
It is commonly used ...
Improving query performance by optimizing indexes, using proper data types, and minimizing data retrieval.
Optimize indexes on frequently queried columns
Use proper data types to reduce storage space and improve query speed
Minimize data retrieval by only selecting necessary columns
Avoid using SELECT * in queries
Use query execution plans to identify bottlenecks and optimize accordingly
SCD type2 table is used to track historical changes in data by creating new records for each change.
Contains current and historical data
New records are created for each change
Includes effective start and end dates for each record
Requires additional columns like surrogate keys and version numbers
Used for slowly changing dimensions in data warehousing
I applied via Naukri.com and was interviewed in Oct 2024. There were 2 interview rounds.
Spark performance problems can arise due to inefficient code, data skew, resource constraints, and improper configuration.
Inefficient code can lead to slow performance, such as using collect() on large datasets.
Data skew can cause uneven distribution of data across partitions, impacting processing time.
Resource constraints like insufficient memory or CPU can result in slow Spark jobs.
Improper configuration settings, su...
I applied via LinkedIn and was interviewed in Jan 2024. There was 1 interview round.
Pyspark is a Python API for Apache Spark, a powerful open-source distributed computing system.
Pyspark is used for processing large datasets in parallel across a cluster of computers.
It provides high-level APIs in Python for Spark programming.
Pyspark allows seamless integration with other Python libraries like Pandas and NumPy.
Example: Using Pyspark to perform data analysis and machine learning tasks on big data sets.
Pyspark SQL is a module in Apache Spark that provides a SQL interface for working with structured data.
Pyspark SQL allows users to run SQL queries on Spark dataframes.
It provides a more concise and user-friendly way to interact with data compared to traditional Spark RDDs.
Users can leverage the power of SQL for data manipulation and analysis within the Spark ecosystem.
To merge 2 dataframes of different schema, use join operations or data transformation techniques.
Use join operations like inner join, outer join, left join, or right join based on the requirement.
Perform data transformation to align the schemas before merging.
Use tools like Apache Spark, Pandas, or SQL to merge dataframes with different schemas.
Pyspark streaming is a scalable and fault-tolerant stream processing engine built on top of Apache Spark.
Pyspark streaming allows for real-time processing of streaming data.
It provides high-level APIs in Python for creating streaming applications.
Pyspark streaming supports various data sources like Kafka, Flume, Kinesis, etc.
It enables windowed computations and stateful processing for handling streaming data.
Example: C...
I applied via Company Website and was interviewed in Jan 2024. There was 1 interview round.
Spark architecture includes driver, cluster manager, and worker nodes for distributed processing.
Spark architecture consists of a driver program that manages the execution of tasks on worker nodes.
Cluster manager is responsible for allocating resources and scheduling tasks across worker nodes.
Worker nodes execute the tasks and store data in memory or disk for processing.
Example: In a Spark application, the driver progr...
I applied via Recruitment Consulltant and was interviewed before Jul 2023. There were 2 interview rounds.
Handling ADF pipelines involves designing, building, and monitoring data pipelines in Azure Data Factory.
Designing data pipelines using ADF UI or code
Building pipelines with activities like copy data, data flow, and custom activities
Monitoring pipeline runs and debugging issues
Optimizing pipeline performance and scheduling triggers
I applied via Naukri.com and was interviewed in Feb 2024. There was 1 interview round.
Basics of sql and joins
Some of the top questions asked at the Altimetrik Data Science Intern interview -
based on 1 review
Rating in categories
Senior Software Engineer
1.2k
salaries
| ₹9.5 L/yr - ₹35 L/yr |
Staff Engineer
903
salaries
| ₹11.1 L/yr - ₹41 L/yr |
Senior Engineer
690
salaries
| ₹9 L/yr - ₹30 L/yr |
Software Engineer
328
salaries
| ₹4.8 L/yr - ₹19 L/yr |
Staff Software Engineer
235
salaries
| ₹10.4 L/yr - ₹37 L/yr |
Accenture
Xoriant
CitiusTech
HTC Global Services