Filter interviews by
I applied via Referral
Data frames can be created in Python using libraries like Pandas.
Import the Pandas library
Create a dictionary or list of data
Use the Pandas DataFrame function to convert the data into a data frame
The question is incomplete and lacks context. It seems to be related to creating a DataFrame in a programming language like Python.
The code snippet provided is not complete and lacks information on the programming language being used.
Assuming it is Python, the correct syntax to create a DataFrame using the pandas library would be df = pd.DataFrame()
The createDataFrame() function is not a standard function in pandas, so...
I applied via Recruitment Consulltant
I applied via Naukri.com and was interviewed in Nov 2024. There was 1 interview round.
I am a Senior Data Engineer with experience in building scalable data pipelines and optimizing data processing workflows.
Experience in designing and implementing ETL processes using tools like Apache Spark and Airflow
Proficient in working with large datasets and optimizing query performance
Strong background in data modeling and database design
Worked on projects involving real-time data processing and streaming analytic
Decorators in Python are functions that modify the behavior of other functions or methods.
Decorators are defined using the @decorator_name syntax before a function definition.
They can be used to add functionality to existing functions without modifying their code.
Decorators can be used for logging, timing, authentication, and more.
Example: @staticmethod decorator in Python is used to define a static method in a class.
SQL query to group by employee ID and combine first name and last name with a space
Use the GROUP BY clause to group by employee ID
Use the CONCAT function to combine first name and last name with a space
Select employee ID, CONCAT(first_name, ' ', last_name) AS full_name
Constructors in Python are special methods used for initializing objects. They are called automatically when a new instance of a class is created.
Constructors are defined using the __init__() method in a class.
They are used to initialize instance variables of a class.
Example: class Person: def __init__(self, name, age): self.name = name self.age = age person1 = Person('Alice', 30)
Indexing in SQL is a technique used to improve the performance of queries by creating a data structure that allows for faster retrieval of data.
Indexes are created on columns in a database table to speed up the retrieval of rows that match a certain condition in a WHERE clause.
Indexes can be created using CREATE INDEX statement in SQL.
Types of indexes include clustered indexes, non-clustered indexes, unique indexes, an...
Spark works well with Parquet files due to its columnar storage format, efficient compression, and ability to push down filters.
Parquet files are columnar storage format, which aligns well with Spark's processing model of working on columns rather than rows.
Parquet files support efficient compression, reducing storage space and improving read performance in Spark.
Spark can push down filters to Parquet files, allowing f...
I was interviewed in Aug 2024.
Python and sql tasks
I was interviewed in Sep 2024.
Pyspark is a Python library for big data processing using Spark framework.
Pyspark is used for processing large datasets in parallel.
It provides APIs for data manipulation, querying, and analysis.
Example: Using pyspark to read a CSV file and perform data transformations.
Databricks optimisation techniques improve performance and efficiency of data processing on the Databricks platform.
Use cluster sizing and autoscaling to optimize resource allocation based on workload
Leverage Databricks Delta for optimized data storage and processing
Utilize caching and persisting data to reduce computation time
Optimize queries by using appropriate indexing and partitioning strategies
Databricks is a unified data analytics platform that provides a collaborative environment for data engineers.
Databricks is built on top of Apache Spark and provides a workspace for data engineering tasks.
It allows for easy integration with various data sources and tools for data processing.
Databricks provides features like notebooks, clusters, and libraries for efficient data engineering workflows.
posted on 26 Oct 2024
I applied via Naukri.com and was interviewed in Sep 2024. There was 1 interview round.
Spark Optimization, Transformation, DLT, DL, Data Governance
Python
SQL
I applied via LinkedIn and was interviewed in Feb 2024. There were 3 interview rounds.
Working with nested JSON using PySpark involves using the StructType and StructField classes to define the schema and then using the select function to access nested fields.
Define the schema using StructType and StructField classes
Use the select function to access nested fields
Use dot notation to access nested fields, for example df.select('nested_field.sub_field')
Implementing SCD2 involves tracking historical changes in data over time.
Identify the business key that uniquely identifies each record
Add effective start and end dates to track when the record was valid
Insert new records with updated data and end date of '9999-12-31'
Update end date of previous record when a change occurs
Use a SQL query to select data from table 2 where data exists in table 1
Use a JOIN statement to link the two tables based on a common column
Specify the columns you want to select from table 2
Use a WHERE clause to check for existence of data in table 1
The number of records retrieved after performing joins depends on the type of join - inner, left, right, or outer.
Inner join retrieves only the matching records from both tables
Left join retrieves all records from the left table and matching records from the right table
Right join retrieves all records from the right table and matching records from the left table
Outer join retrieves all records from both tables, filling
I applied via Recruitment Consulltant and was interviewed in Mar 2024. There were 2 interview rounds.
Files can be read in AWS Glue using Data Catalog, crawlers, and Glue ETL jobs.
Use AWS Glue Data Catalog to store metadata information about the files.
Crawlers can automatically infer the schema of the files and populate the Data Catalog.
Glue ETL jobs can then be used to read the files from various sources like S3, RDS, etc.
Supports various file formats like CSV, JSON, Parquet, etc.
Duplicate records can be identified using SQL queries by comparing columns and using aggregate functions.
Use GROUP BY clause with COUNT() function to identify duplicate records
Use HAVING clause to filter out records with count greater than 1
Join the table with itself on specific columns to find duplicates
I was interviewed in Apr 2024.
I am a Senior Data Engineer with experience in various projects involving columnar format files in Spark, understanding Spark internals, OLAP vs OLTP, and data warehousing concepts.
Projects: Developed ETL pipelines using Spark for processing large datasets, implemented data quality checks, and optimized query performance.
Columnar format file in Spark: It stores data in columnar format to improve query performance by re...
Software Engineer
28
salaries
| ₹2 L/yr - ₹10 L/yr |
Softwaretest Engineer
20
salaries
| ₹3.5 L/yr - ₹7.5 L/yr |
Power Apps Developer
7
salaries
| ₹3.8 L/yr - ₹6.5 L/yr |
Devops Engineer
7
salaries
| ₹3.5 L/yr - ₹7.2 L/yr |
Salesforce Developer
7
salaries
| ₹4.9 L/yr - ₹7.7 L/yr |
Infosys
TCS
Wipro
HCLTech